Overview
Twine-data (twine for short) is a binary format that aims at combining simplicity, compactness, performance of (de)serialization, self descriptiveness, and expressiveness.
Unlike most of the modern self-descriptive formats (CBOR, MsgPack, JSON, S-expressions, etc.), twine's data model is not a tree. Rather, a twine byte stream represents a DAG of values, which is important to efficiently serialize data structures that exhibit sharing.
A simple example
A very simple example, corresponding to the JSON value {"a": ["hello", ["hello"]], "x": true}
,
would look like this:
$ cat example.json
{"a": ["hello", ["hello"]], "x": true}
# convert to twine
$ twine-utils from-json example.json -o example.twine
# the serialized bytes
$ hexyl example.twine
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 45 68 65 6c 6c 6f 61 f6 ┊ 62 f8 f3 72 41 61 f5 41 │Ehelloa×┊b××rAa×A│
│00000010│ 78 01 06 ┊ │x•• ┊ │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
# a debug view
$ twine-utils dump example.twine
[0x0]: "hello"
[0x6]: [@0x0] (len=1)
[0x8]: [@0x0, @0x6] (len=2)
[0xb]: {"a": @0x8, "x": true} (len=2)
We can already see some interesting properties of this Twine file:
- The string
"hello"
occurs only once in the file. This is not mandatory, but it's enabled by the sharing feature of Twine, namely, pointers (printed as@0x0
,0x6
,0x8
, etc in the dump output). - A pointer to a value is the numeric offset of that value in the file.
- The map and arrays resemble JSON somewhat, but cannot be nested. Instead, an array or map can only refer to another array or map via a pointer.
- Strings are not quoted, but length prefixed. We'll see how this works later, but the very first byte in the file,
0x45
, is actually a combination of "tag 4" (string), and "length 5".
Sharing and DAGs
Twine emerged from work on computational logic tools, where logical terms (mathematical formulas) can grow large, and tend to exhibit a lot of sharing.
SMT-LIB and the need to represent DAGs
This sharing is routine for some classes of tools, such as SMT solvers.
SMT solvers do not have a standard binary format, but the standard text format, SMT-LIB, based on S-expression, relies heavily on let
-bindings to represent sharing.
Tools that consume SMT-LIB typically turn let
-bindings such as (let ((x (+ 1 2))) (+ x x))
into a let-free DAG immediately (here, (+ (+ 1 2) (+ 1 2))
with perfect sharing) during parsing; but tools that produce SMT-LIB need to do more work to introduce these let
-bindings, pick unique variable names, etc. and the sharing cannot work across terms in distinct contexts.
The goal of Twine is to provide a substrate, like JSON or CBOR do in other contexts, to efficiently represent such large DAGs on disk. For this, Twine provides both pointers (implicitly followed on reading by default), and references (deserialized as actual offsets, never implicitly followed).
Self-descriptive format
In A Survey of JSON-compatible Binary Serialization Specifications, Juan Cruzz Viotti and Mital Kinderkhedia split JSON-compatible serialization formats between schema-driven and schema-less. Schema-less formats are self-descriptive, meaning it's possible to parse and manipulate them without having access to the underlying schema. Consequently, there exists many JSON editors and viewers, CBOR tools such as cbor.me, and S-expression centric editor plugins (e.g in emacs).
Usefulness for logic
For twine, leaning towards self-descriptiveness means that it's possible to implementat tools that work on any DAG. In the context of computational logic, a useful example is _pruning_ of a large DAG into a (potentially much) smaller DAG containing only values reachable from a given root.A possible use case is the representation of a proof trace. A trace here means that a SMT solver or theorem prover is going to log every single logic inference it does, as it searches for a proof. This trace naturally forms a DAG since each inference can be used to derive multiple further inference steps.
Once the empty clause (false
) is found, the inferences used to derive it can be extracted from the trace.
This is where pruning comes in: a different byte stream can be built by simply traversing the full trace, starting from the empty clause, and rebuilding a new DAG with only these steps.
Description
Let's now examine in more details the data format. We're going to discuss the data format first, before diving into encoding into bytes.
Data model
Twine, like CBOR or JSON, offers a specific data model that applications can rely on to encode their specific types into a byte stream. However, where CBOR or JSON offer a straightforward tree-shaped document model with arrays, maps, integers, strings, etc. Twine offers something akin to a heap of shallow values that can point to previous shallow values.
Tree-like data model
Still, it's possible to think of Twine values as being (imperfectly) represented by the following types:
#![allow(unused)] fn main() { /// An offset in a byte array. pub type Offset = usize; pub enum Value { Null, Bool(bool), Int64(i64), Float(f64), /// Text, in UTF8 String(String), /// Binary blob. Bytes(Vec<u8>), /// A variant with 0 arguments. Variant0(VariantIdx), /// A reference to a full value /// (which comes at an earlier offset). Ref(Offset), /// An implicitly followed reference to a full value /// (which comes at an earlier offset). Pointer(Offset), /// Tagged value, like CBOR Tag(Tag, Box<Value>), Array(Vec<Value>), /// Map made of key/value pairs Map(Vec<(Value, Value)>), /// Variant of an enum/sum type Variant(VariantIdx, Vec<Value>), } pub type Tag = u64; /// For efficiency, variants use indices, not names. pub struct VariantIdx(pub u32); }
- We can see the usual CBOR basics, like
null
, booleans, 64 bit floats, 64-bits integers (signed), UTF-8 strings, and byte strings. - We also have CBOR-like tags, with an integer tag and the tagged value.
- Arrays and Maps are similar to CBOR. Ordering in maps is not constrained and should be deserialized exactly in the order it is written on disk.
- Variants correspong to sum types/enums: an integer variant index, with arguments.
- Pointers contain the offset of the pointed-to value. Semantically, a pointer is dereferenced implicitly when convenient; for example, a user trying to read an integer value from offset
p0
, whenp0
contains the valuePointer(p1)
, will try to read an integer from offsetp1
instead. This is useful for sharing of large strings or binary blobs. - References contain the offset of the pointed-to value, and are returned as-is. This is useful to represent DAG edges in a way that's easily accessible by the user, who gets control over when to dereference the offset.
Heap of shallow values
A more correct view, which is used in current Twine implementations, is one that maps offsets in the byte stream to shallow values. Shallow values do not exhibit significant nesting and are quick to read and traverse at deserialization time.
Shallow values can still contain other shallow values if these are immediates (basically, scalars), as this doesn't introduce actual nesting.
Immediates
In terms of Rust code we would have immediates being defined by something like this excerpt of twine-data.rs
:
#![allow(unused)] fn main() { pub enum Immediate<'a> { /// CBOR/JSON null. Null, Bool(bool), Int64(i64), Float(f64), /// Text, in UTF8 String(&'a str), /// Binary blob. Bytes(&'a [u8]), /// A variant with 0 arguments. Variant0(VariantIdx), /// An explicit reference to a full value /// (which comes at an earlier offset). Ref(Offset), /// An automatically followed reference to a full value /// (which comes at an earlier offset). Pointer(Offset), } }
So these are encoded directly, without indirections, and can be stored as-is in arrays, maps, and variant arguments. A few notes:
- The distinction between
String
andBytes
is that the former is UTF8 text, and the latter can be anything (including\0
and invalid unicode — an example would be an image or a cryptographic hash.) Float
is a small simplification here, floats on the wire can be 32 or 64 bits. The serialization API can provide both if desired.Variant0
represents an enum variant that takes no argument. It is essentially an optimization.Pointer
is an offset to a value that comes earlier in the byte stream (you can only point at values already serialized earlier, so their offset is smaller). It is followed implicitly and automatically by deserializers. This is how something like transparent deduplication of strings can be done.Ref
also contains an offset (pointing to a value coming earlier in the byte stream). However it is visible to the user and is a full value on its own.- Skipping over a value should be fast and require no recursion.
Shallow values
Now, as said earlier, a Twine byte stream represents a map from offsets to shallow values. A value can always be referred to by its offset in the stream.
Shallow values are represented by the following (simplified) type:
#![allow(unused)] fn main() { pub enum ShallowValue<'a> { Imm(Immediate<'a>), Tag(Tag, Offset), Array { n_items: usize }, Map { n_items: usize }, Variant { idx: VariantIdx, n_args: usize } } }
So, immediates are valid shallow values using ShallowValue::Imm
;
but we also get the following CBOR-like cases:
Tag
is a pointer to a value (theOffset
) alongside an integer tag.Array
is followed byn_items
immediates values. An array of integers will contain the integers immediately, but an array of maps will only contain pointers to the maps, which have to have been serialized prior to the array.Map
is followed by2*n_items
immediate values, in the orderk1, v1, k2, v2, …
.
In addition, we have variants:
Variant
has a variant index (the enum case) and a number of arguments that follow as immediates, just like an array. A variant with 0 arguments can be more efficiently represented usingShallowValue::Imm(Immediate::Variant0(idx))
.
Summary
So, Twine distinguishes immediates (scalars, including pointers and references) as a subset of shallow values. Shallow values are stored in a byte stream at given offsets and can point to other values that have smaller offsets.
Wire format
Now that we have immediates and shallow values, let's see how they are encoded.
This part is where Twine is relatively unique. The closest (to my knowledge) to Twine's data model is Fleece, and Fleece makes different design choices to make different tradeoffs — see why not Fleece for more details —, but the way Twine encodes individual values is closest to CBOR.
Values and immediates in Twine always start with a header byte.
The header byte is divided into high
and low
nibbles (most significant and least significant). The high nibble is the kind of the value; the low nibble is a small integer values with kind-dependent meaning:
low
is a small integer. Iflow=15
, the header byte is followed by a LEB128 valuelow_rest
, and the resulting integer islow_rest + 15
low
is exactly as above, with optional LEB128 followup, but represents a length or number of itemslow
is a small integer that represents a small enumeration of values. Each value has a specific meaning.
From now on we refer to these are high
(4 bits), and n
for the small integer obtained from low
and the optional following LEB128 integer.
Here are the possible kinds:
kind | meaning of low | value |
---|---|---|
0 | enum | special scalars: true , false , null |
1 | small int | non-negative integer |
2 | small int | negative integer |
3 | enum | float |
4 | length | UTF-8 text string (length is number of bytes) |
5 | length | byte string (length is number of bytes) |
6 | length | array (length is number of items) |
7 | length | map (length is number of key/value pairs) |
8 | small int | tag (offset follows) |
9 | - | reserved |
10 | small int | variant0 (integer is the variant index) |
11 | small int | variant1 (integer is the variant index) |
12 | small int | variantN (integer is the variant index) |
13 | - | reserved |
14 | small int | reference (integer is the offset delta) |
15 | small int | pointer (integer is the offset delta) |
Kind=0: special scalars
Layout in bytes:
┌─────────┐
│offset: 0│
├─────────┤
│┌─┬───┐ │
││0│low│ │
│└─┴───┘ │
└─────────┘
These are some immediate values, using this table:
high | low | value |
---|---|---|
0 | 0 | false |
0 | 1 | true |
0 | 2 | null |
0 | _ | reserved |
Kind=1: non-negative integer.
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌─┬───┐ │LEB128 length extension│
││1│low│ │ │
│└─┴───┘ │ │
└─────────┴───────────────────────┘
The value is the integer n
. Only values in [0…2^63-1]
should
be encoded this way, so that they fit in i64
.
For example, the integer 42
is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Kind=2: negative integer
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌─┬───┐ │LEB128 length extension│
││2│low│ │ │
│└─┴───┘ │ │
└─────────┴───────────────────────┘
The value, given n
, is the signed integer -n-1
.
Only values in [-2^63…-1]
should be encoded this way, so they fit
in a i64
.
For example, -2
is:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 21 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
and -27
is:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 2f 0b ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note that the value we encode is 26
, which overflows 15; so we write low=15
and then 26-15 = 11 = 0x0b
as a single byte in LEB128.
Kind=3: float
Layout in bytes:
┌─────────┬─────────────────┐
│offset: 0│offset: 1..5 │
├─────────┼─────────────────┤
│┌─┬─┐ │f32 little endian│
││3│0│ │ │
│└─┴─┘ │ │
└─────────┴─────────────────┘
┌─────────┬─────────────────┐
│offset: 0│offset: 1..9 │
├─────────┼─────────────────┤
│┌─┬─┐ │f64 little endian│
││3│1│ │ │
│└─┴─┘ │ │
└─────────┴─────────────────┘
The value is a 32 or 64 bits float, encoded as its bits in little endian.
If low = 0
, the precision is 32 bits; if low = 1
, the precision is 64 bits. Other values of low
are reserved.
For example, 42.5 : f64
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 31 00 00 00 00 00 40 45 ┊ 40 │
└────────┴─────────────────────────┴─────────────────────────┘
This corresponds to 0x4045400000000000
in little endian,
as per https://float.exposed/0x4045400000000000 .
Kind=4: UTF-8 text
Byte layout:
┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐ │LEB128 length extension│string data │
││4│low│ │ │ │
│└─┴───┘ │ │ │
└─────────┴───────────────────────┴───────────────────┘
For example, "hello world! 😁"
becomes:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 4f 02 68 65 6c 6c 6f 20 ┊ 77 6f 72 6c 64 21 20 f0 │
│00000010│ 9f 98 81 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note how the header+length is 0x4f02
because the length in bytes is greater than 15.
Kind=5: byte string
Byte layout:
┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐ │LEB128 length extension│bytes │
││5│low│ │ │ │
│└─┴───┘ │ │ │
└─────────┴───────────────────────┴───────────────────┘
This is like text, but the content is an opaque blob with not constraints.
Kind=6: array
Byte layout:
┌─────────┬───────────────────────┬──────────┬──────────┬─────────┬────────────┐
│offset: 0│offset: 1…n1 │offset: … │offset: … │offset: …│offset: … │
├─────────┼───────────────────────┼──────────┼──────────┼─────────┼────────────┤
│┌─┬───┐ │LEB128 length extension│┌────────┐│┌────────┐│… │┌──────────┐│
││6│low│ │ ││array[0]│││array[1]││ ││array[n-1]││
│└─┴───┘ │ │└────────┘│└────────┘│ │└──────────┘│
└─────────┴───────────────────────┴──────────┴──────────┴─────────┴────────────┘
The number of items is n
, and items just follow.
The important restriction here is that items must be immediate values,
so that iteration remains fast; skipping over an immediate value can be
done by reading the header and deducing how many additional bytes
need to be skipped.
For example, [[42], 1, 2, 3]
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 61 1f 1b 64 f3 11 12 13 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note here that the sub-array [42]
is 0x611f1b
(length=1, followed by 0x1f1b
), and then the main array starts at offset=3
with header 0x64
.
Kind=7: map
Byte layout:
┌─────────┬───────────────────────┬─────────┬──────────┬─────────┬──────────┬────────────┐
│offset: 0│offset: 1…n1 │offset: …│offset: … │offset: …│offset: … │offset: … │
├─────────┼───────────────────────┼─────────┼──────────┼─────────┼──────────┼────────────┤
│┌─┬───┐ │LEB128 length extension│┌──────┐ │┌────────┐│… │┌────────┐│┌──────────┐│
││6│low│ │ ││key[0]│ ││value[0]││ ││key[n-1]│││value[n-1]││
│└─┴───┘ │ │└──────┘ │└────────┘│ │└────────┘│└──────────┘│
└─────────┴───────────────────────┴─────────┴──────────┴─────────┴──────────┴────────────┘
Like arrays, maps use n
to encode the number of key/value pairs.
For example, {"a": 42, "b": false}
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 72 41 61 1f 1b 41 62 00 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
We can see the keys 0x4161
and 0x4162
.
Note that in general, keys don't have to be strings.
Kind=8: tag
Byte layout:
┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌─┬───┐ │LEB128 length extension│┌─────┐ │
││6│low│ │ ││value│ │
│└─┴───┘ │ │└─────┘ │
└─────────┴───────────────────────┴─────────────┘
The tag is n
, and the value follows immediately as an immediate.
Tagging an array or map or other tag requires the value to be written first, and referred to using a pointer.
Kind=10,11,12: variant
Byte layout:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││10│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌──┬───┐ │LEB128 length extension│┌─────┐ │
││11│low│ │ ││value│ │
│└──┴───┘ │ │└─────┘ │
└─────────┴───────────────────────┴─────────────┘
┌─────────┬───────────────────────┬───────────────┬─────────┬─────────┬─────────┬───────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n2│offset: …│offset: …│offset: …│offset: … │
├─────────┼───────────────────────┼───────────────┼─────────┼─────────┼─────────┼───────────┤
│┌──┬───┐ │LEB128 length extension│LEB128 │┌───────┐│┌───────┐│… │┌─────────┐│
││12│low│ │ │ ││args[0]│││args[1]││ ││args[n-1]││
│└──┴───┘ │ │ │└───────┘│└───────┘│ │└─────────┘│
└─────────┴───────────────────────┴───────────────┴─────────┴─────────┴─────────┴───────────┘
Here we have multiple cases:
high=10
: just like a non-negative integer, but n` is the variant indexhigh=11
: just like a tag, butn
is the variant index. The single argument must be an immediate.high=12
: the variant index isn
, and is followed by a LEB128 number indicating how many arguments follow immediately. The arguments must be immediates.
Kind=14: reference
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││14│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
The reference is, logically, an offset (coming strictly earlier in the byte stream).
If the reference is at offset p
, then the offset it points to is p-n-1
.
For example, the value 42
followed by a reference to it is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b e1 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Kind=15: pointer
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││15│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
This is like a reference, but it's implicitly followed.
For example, the value 42
followed by a pointer to it is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b f1 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Final block
A Twine byte stream can end with a finalizer block, which is used to provide the entrypoint into the DAG. In other words, this entrypoint is the toplevel, outermost value that was serialized via Twine. It is often an array of map.
The last byte of a twine stream with finalizer, at offset p == twine_string.len()-1
, is a u8
delta.
If twine_string[p] == n
, then twine_string[p-n-1]
must be a valid twine value.
The entrypoint, if it fits in less than 256 bytes, can be stored directly this way; otherwise, a pointer to the entrypoint is stored instead.
For a large entrypoint at offset p_e
, then, the finalizer block is:
offset | value |
---|---|
… | … |
p_e | entrypoint value |
… | … |
p_ptr | Pointer(p_e) |
p | (p - p_ptr_e - 1) as u8 |
Why not […]?
There are already many serialization formats. Why invent a new one?
Why not JSON?
JSON is quite unsuitable for our purposes, really. It's slow to read and write (among other things because of string escaping), unless heroic efforts are deployed. It also doesn't specify anything about numbers and lacks the ability to represent byte strings and DAG references.
Why not CBOR?
CBOR is simpler, more efficient, more compact, and faster than JSON, and is one of the major sources of inspiration for Twine. However, it doesn't easily represent sharing within the same file.
Tags can be used to represent pointers (e.g. dasl uses tag 42 to mark external pointers to other content-addressed values), but CBOR libraries generally don't provide ways to read from an offset or to access the offset of a value being encoded for further reference.
Historical note about CBOR-pack
Before Twine, we were using our homegrown cbor-pack, a CBOR value of the shape {"h": [<cbor>, <cbor>, …, <cbor>], "k": <cbor>}
. The key, k
, would be the toplevel value (the entry point); the heap, h
, would contain an array of CBOR values. Tag 6 would be used to refer to values in h
by their index, so that 6(29)
would actually represent the value h[29]
. The downsides are that the whole array h
would have to be parsed into a CBOR AST in memory during deserialization; and it's hard to directly write a stream of bytes during serialization (because h
's array needs to be prefixed by a variable-width length).
Twine provides native sharing, and makes it easy to write bytes to some file or growing buffer during serialization, in one pass.
Why not Fleece?
Fleece is a binary encoding with a data model similar to JSON, but it also allows 0-copy reading and traversal of values, directly from a byte array (either in memory or memory-mapped). It is another big source of inspiration for Twine.
Compared to Fleece, Twine is somewhat simpler to implement, at the cost of some performance for 0-copy reading. Twine maps are not sorted by ascending keys; Twine arrays are not indexable in O(1)
.
However, Twine doesn't need to distinguish between wide and narrow collections because values don't have to fit in 2 or 4 bytes each. The complex bitpacking of Fleece is replaced by 4-bits tags à la CBOR.
Additionally, twine offers explicit references, which give users control over traversal of the DAG if they need it.
Implementations
There are only a few implementations currently:
- twine-data.rs (crate.io);
- a fairly bespoke OCaml implementation, in use in a larger project.
Test files
Here are a few test files, alongside a JSON equivalent:
-
A classic twitter user info file:
- twitter.json (not as small as it could be because of indentation)
- twitter.twine
-
The small example from the overview
-
An example from the JSON spec with a single object:
-
An example from the JSON spec with multiple objects in an array:
About
The format was designed mostly by @c-cube, aided by useful discussions with his colleagues @bronsa and @wintersteiger.