Overview

Twine-data (twine for short) is a binary format that aims at combining simplicity, compactness, performance of (de)serialization, self descriptiveness, and expressiveness.

Unlike most of the modern self-descriptive formats (CBOR, MsgPack, JSON, S-expressions, etc.), twine's data model is not a tree. Rather, a twine byte stream represents a DAG of values, which is important to efficiently serialize data structures that exhibit sharing.

A simple example

A very simple example, corresponding to the JSON value {"a": ["hello", ["hello"]], "x": true}, would look like this:

$ cat example.json
{"a": ["hello", ["hello"]], "x": true}

# convert to twine
$ twine-utils from-json example.json -o example.twine

# the serialized bytes
$ hexyl example.twine
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 45 68 65 6c 6c 6f 61 f6 ┊ 62 f8 f3 72 41 61 f5 41 │Ehelloa×┊b××rAa×A│
│00000010│ 78 01 06                ┊                         │x••     ┊        │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

# a debug view
$ twine-utils dump example.twine
[0x0]: "hello"
[0x6]: [@0x0] (len=1)
[0x8]: [@0x0, @0x6] (len=2)
[0xb]: {"a": @0x8, "x": true} (len=2)

We can already see some interesting properties of this Twine file:

  • The string "hello" occurs only once in the file. This is not mandatory, but it's enabled by the sharing feature of Twine, namely, pointers (printed as @0x0, 0x6, 0x8, etc in the dump output).
  • A pointer to a value is the numeric offset of that value in the file.
  • The map and arrays resemble JSON somewhat, but cannot be nested. Instead, an array or map can only refer to another array or map via a pointer.
  • Strings are not quoted, but length prefixed. We'll see how this works later, but the very first byte in the file, 0x45, is actually a combination of "tag 4" (string), and "length 5".

Sharing and DAGs

Twine emerged from work on computational logic tools, where logical terms (mathematical formulas) can grow large, and tend to exhibit a lot of sharing.

SMT-LIB and the need to represent DAGs

This sharing is routine for some classes of tools, such as SMT solvers. SMT solvers do not have a standard binary format, but the standard text format, SMT-LIB, based on S-expression, relies heavily on let-bindings to represent sharing. Tools that consume SMT-LIB typically turn let-bindings such as (let ((x (+ 1 2))) (+ x x)) into a let-free DAG immediately (here, (+ (+ 1 2) (+ 1 2)) with perfect sharing) during parsing; but tools that produce SMT-LIB need to do more work to introduce these let-bindings, pick unique variable names, etc. and the sharing cannot work across terms in distinct contexts.

The goal of Twine is to provide a substrate, like JSON or CBOR do in other contexts, to efficiently represent such large DAGs on disk. For this, Twine provides both pointers (implicitly followed on reading by default), and references (deserialized as actual offsets, never implicitly followed).

Self-descriptive format

In A Survey of JSON-compatible Binary Serialization Specifications, Juan Cruzz Viotti and Mital Kinderkhedia split JSON-compatible serialization formats between schema-driven and schema-less. Schema-less formats are self-descriptive, meaning it's possible to parse and manipulate them without having access to the underlying schema. Consequently, there exists many JSON editors and viewers, CBOR tools such as cbor.me, and S-expression centric editor plugins (e.g in emacs).

Usefulness for logic For twine, leaning towards self-descriptiveness means that it's possible to implementat tools that work on any DAG. In the context of computational logic, a useful example is _pruning_ of a large DAG into a (potentially much) smaller DAG containing only values reachable from a given root.

A possible use case is the representation of a proof trace. A trace here means that a SMT solver or theorem prover is going to log every single logic inference it does, as it searches for a proof. This trace naturally forms a DAG since each inference can be used to derive multiple further inference steps.

Once the empty clause (false) is found, the inferences used to derive it can be extracted from the trace. This is where pruning comes in: a different byte stream can be built by simply traversing the full trace, starting from the empty clause, and rebuilding a new DAG with only these steps.

Description

Let's now examine in more details the data format. We're going to discuss the data format first, before diving into encoding into bytes.

Data model

Twine, like CBOR or JSON, offers a specific data model that applications can rely on to encode their specific types into a byte stream. However, where CBOR or JSON offer a straightforward tree-shaped document model with arrays, maps, integers, strings, etc. Twine offers something akin to a heap of shallow values that can point to previous shallow values.

Tree-like data model

Still, it's possible to think of Twine values as being (imperfectly) represented by the following types:

#![allow(unused)]
fn main() {
/// An offset in a byte array.
pub type Offset = usize;

pub enum Value {
    Null,
    Bool(bool),
    Int64(i64),
    Float(f64),
    /// Text, in UTF8
    String(String),
    /// Binary blob.
    Bytes(Vec<u8>),
    /// A variant with 0 arguments.
    Variant0(VariantIdx),
    /// A reference to a full value
    /// (which comes at an earlier offset).
    Ref(Offset),
    /// An implicitly followed reference to a full value
    /// (which comes at an earlier offset).
    Pointer(Offset),
    /// Tagged value, like CBOR
    Tag(Tag, Box<Value>),
    Array(Vec<Value>),
    /// Map made of key/value pairs
    Map(Vec<(Value, Value)>),
    /// Variant of an enum/sum type
    Variant(VariantIdx, Vec<Value>),
}

pub type Tag = u64;

/// For efficiency, variants use indices, not names.
pub struct VariantIdx(pub u32);
}
  • We can see the usual CBOR basics, like null, booleans, 64 bit floats, 64-bits integers (signed), UTF-8 strings, and byte strings.
  • We also have CBOR-like tags, with an integer tag and the tagged value.
  • Arrays and Maps are similar to CBOR. Ordering in maps is not constrained and should be deserialized exactly in the order it is written on disk.
  • Variants correspong to sum types/enums: an integer variant index, with arguments.
  • Pointers contain the offset of the pointed-to value. Semantically, a pointer is dereferenced implicitly when convenient; for example, a user trying to read an integer value from offset p0, when p0 contains the value Pointer(p1), will try to read an integer from offset p1 instead. This is useful for sharing of large strings or binary blobs.
  • References contain the offset of the pointed-to value, and are returned as-is. This is useful to represent DAG edges in a way that's easily accessible by the user, who gets control over when to dereference the offset.

Heap of shallow values

A more correct view, which is used in current Twine implementations, is one that maps offsets in the byte stream to shallow values. Shallow values do not exhibit significant nesting and are quick to read and traverse at deserialization time.

Shallow values can still contain other shallow values if these are immediates (basically, scalars), as this doesn't introduce actual nesting.

Immediates

In terms of Rust code we would have immediates being defined by something like this excerpt of twine-data.rs:

#![allow(unused)]
fn main() {
pub enum Immediate<'a> {
    /// CBOR/JSON null.
    Null,
    Bool(bool),
    Int64(i64),
    Float(f64),
    /// Text, in UTF8
    String(&'a str),
    /// Binary blob.
    Bytes(&'a [u8]),
    /// A variant with 0 arguments.
    Variant0(VariantIdx),
    /// An explicit reference to a full value
    /// (which comes at an earlier offset).
    Ref(Offset),
    /// An automatically followed reference to a full value
    /// (which comes at an earlier offset).
    Pointer(Offset),
}
}

So these are encoded directly, without indirections, and can be stored as-is in arrays, maps, and variant arguments. A few notes:

  • The distinction between String and Bytes is that the former is UTF8 text, and the latter can be anything (including \0 and invalid unicode — an example would be an image or a cryptographic hash.)
  • Float is a small simplification here, floats on the wire can be 32 or 64 bits. The serialization API can provide both if desired.
  • Variant0 represents an enum variant that takes no argument. It is essentially an optimization.
  • Pointer is an offset to a value that comes earlier in the byte stream (you can only point at values already serialized earlier, so their offset is smaller). It is followed implicitly and automatically by deserializers. This is how something like transparent deduplication of strings can be done.
  • Ref also contains an offset (pointing to a value coming earlier in the byte stream). However it is visible to the user and is a full value on its own.
  • Skipping over a value should be fast and require no recursion.

Shallow values

Now, as said earlier, a Twine byte stream represents a map from offsets to shallow values. A value can always be referred to by its offset in the stream.

Shallow values are represented by the following (simplified) type:

#![allow(unused)]
fn main() {
pub enum ShallowValue<'a> {
    Imm(Immediate<'a>),
    Tag(Tag, Offset),
    Array { n_items: usize },
    Map { n_items: usize },
    Variant { idx: VariantIdx, n_args: usize }
}
}

So, immediates are valid shallow values using ShallowValue::Imm; but we also get the following CBOR-like cases:

  • Tag is a pointer to a value (the Offset) alongside an integer tag.
  • Array is followed by n_items immediates values. An array of integers will contain the integers immediately, but an array of maps will only contain pointers to the maps, which have to have been serialized prior to the array.
  • Map is followed by 2*n_items immediate values, in the order k1, v1, k2, v2, ….

In addition, we have variants:

  • Variant has a variant index (the enum case) and a number of arguments that follow as immediates, just like an array. A variant with 0 arguments can be more efficiently represented using ShallowValue::Imm(Immediate::Variant0(idx)).

Summary

So, Twine distinguishes immediates (scalars, including pointers and references) as a subset of shallow values. Shallow values are stored in a byte stream at given offsets and can point to other values that have smaller offsets.

Wire format

Now that we have immediates and shallow values, let's see how they are encoded.

This part is where Twine is relatively unique. The closest (to my knowledge) to Twine's data model is Fleece, and Fleece makes different design choices to make different tradeoffs — see why not Fleece for more details —, but the way Twine encodes individual values is closest to CBOR.

Values and immediates in Twine always start with a header byte. The header byte is divided into high and low nibbles (most significant and least significant). The high nibble is the kind of the value; the low nibble is a small integer values with kind-dependent meaning:

  • low is a small integer. If low=15, the header byte is followed by a LEB128 value low_rest, and the resulting integer is low_rest + 15
  • low is exactly as above, with optional LEB128 followup, but represents a length or number of items
  • low is a small integer that represents a small enumeration of values. Each value has a specific meaning.

From now on we refer to these are high (4 bits), and n for the small integer obtained from low and the optional following LEB128 integer.

Here are the possible kinds:

kindmeaning of lowvalue
0enumspecial scalars: true, false, null
1small intnon-negative integer
2small intnegative integer
3enumfloat
4lengthUTF-8 text string (length is number of bytes)
5lengthbyte string (length is number of bytes)
6lengtharray (length is number of items)
7lengthmap (length is number of key/value pairs)
8small inttag (offset follows)
9-reserved
10small intvariant0 (integer is the variant index)
11small intvariant1 (integer is the variant index)
12small intvariantN (integer is the variant index)
13-reserved
14small intreference (integer is the offset delta)
15small intpointer (integer is the offset delta)

Kind=0: special scalars

Layout in bytes:

┌─────────┐
│offset: 0│
├─────────┤
│┌─┬───┐  │
││0│low│  │
│└─┴───┘  │
└─────────┘

These are some immediate values, using this table:

highlowvalue
00false
01true
02null
0_reserved

Kind=1: non-negative integer.

Layout in bytes:

┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1           │
├─────────┼───────────────────────┤
│┌─┬───┐  │LEB128 length extension│
││1│low│  │                       │
│└─┴───┘  │                       │
└─────────┴───────────────────────┘

The value is the integer n. Only values in [0…2^63-1] should be encoded this way, so that they fit in i64.

For example, the integer 42 is encoded as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b                   ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Kind=2: negative integer

Layout in bytes:

┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1           │
├─────────┼───────────────────────┤
│┌─┬───┐  │LEB128 length extension│
││2│low│  │                       │
│└─┴───┘  │                       │
└─────────┴───────────────────────┘

The value, given n, is the signed integer -n-1. Only values in [-2^63…-1] should be encoded this way, so they fit in a i64.

For example, -2 is:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 21                      ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

and -27 is:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 2f 0b                   ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Note that the value we encode is 26, which overflows 15; so we write low=15 and then 26-15 = 11 = 0x0b as a single byte in LEB128.

Kind=3: float

Layout in bytes:

┌─────────┬─────────────────┐
│offset: 0│offset: 1..5     │
├─────────┼─────────────────┤
│┌─┬─┐    │f32 little endian│
││3│0│    │                 │
│└─┴─┘    │                 │
└─────────┴─────────────────┘

┌─────────┬─────────────────┐
│offset: 0│offset: 1..9     │
├─────────┼─────────────────┤
│┌─┬─┐    │f64 little endian│
││3│1│    │                 │
│└─┴─┘    │                 │
└─────────┴─────────────────┘

The value is a 32 or 64 bits float, encoded as its bits in little endian. If low = 0, the precision is 32 bits; if low = 1, the precision is 64 bits. Other values of low are reserved.

For example, 42.5 : f64 is represented as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 31 00 00 00 00 00 40 45 ┊ 40                      │
└────────┴─────────────────────────┴─────────────────────────┘

This corresponds to 0x4045400000000000 in little endian, as per https://float.exposed/0x4045400000000000 .

Kind=4: UTF-8 text

Byte layout:

┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1           │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐  │LEB128 length extension│string data        │
││4│low│  │                       │                   │
│└─┴───┘  │                       │                   │
└─────────┴───────────────────────┴───────────────────┘

For example, "hello world! 😁" becomes:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 4f 02 68 65 6c 6c 6f 20 ┊ 77 6f 72 6c 64 21 20 f0 │
│00000010│ 9f 98 81                ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Note how the header+length is 0x4f02 because the length in bytes is greater than 15.

Kind=5: byte string

Byte layout:

┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1           │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐  │LEB128 length extension│bytes              │
││5│low│  │                       │                   │
│└─┴───┘  │                       │                   │
└─────────┴───────────────────────┴───────────────────┘

This is like text, but the content is an opaque blob with not constraints.

Kind=6: array

Byte layout:

┌─────────┬───────────────────────┬──────────┬──────────┬─────────┬────────────┐
│offset: 0│offset: 1…n1           │offset: … │offset: … │offset: …│offset: …   │
├─────────┼───────────────────────┼──────────┼──────────┼─────────┼────────────┤
│┌─┬───┐  │LEB128 length extension│┌────────┐│┌────────┐│…        │┌──────────┐│
││6│low│  │                       ││array[0]│││array[1]││         ││array[n-1]││
│└─┴───┘  │                       │└────────┘│└────────┘│         │└──────────┘│
└─────────┴───────────────────────┴──────────┴──────────┴─────────┴────────────┘

The number of items is n, and items just follow. The important restriction here is that items must be immediate values, so that iteration remains fast; skipping over an immediate value can be done by reading the header and deducing how many additional bytes need to be skipped.

For example, [[42], 1, 2, 3] is represented as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 61 1f 1b 64 f3 11 12 13 ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Note here that the sub-array [42] is 0x611f1b (length=1, followed by 0x1f1b), and then the main array starts at offset=3 with header 0x64.

Kind=7: map

Byte layout:

┌─────────┬───────────────────────┬─────────┬──────────┬─────────┬──────────┬────────────┐
│offset: 0│offset: 1…n1           │offset: …│offset: … │offset: …│offset: … │offset: …   │
├─────────┼───────────────────────┼─────────┼──────────┼─────────┼──────────┼────────────┤
│┌─┬───┐  │LEB128 length extension│┌──────┐ │┌────────┐│…        │┌────────┐│┌──────────┐│
││6│low│  │                       ││key[0]│ ││value[0]││         ││key[n-1]│││value[n-1]││
│└─┴───┘  │                       │└──────┘ │└────────┘│         │└────────┘│└──────────┘│
└─────────┴───────────────────────┴─────────┴──────────┴─────────┴──────────┴────────────┘

Like arrays, maps use n to encode the number of key/value pairs.

For example, {"a": 42, "b": false} is represented as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 72 41 61 1f 1b 41 62 00 ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

We can see the keys 0x4161 and 0x4162. Note that in general, keys don't have to be strings.

Kind=8: tag

Byte layout:

┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1           │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌─┬───┐  │LEB128 length extension│┌─────┐      │
││6│low│  │                       ││value│      │
│└─┴───┘  │                       │└─────┘      │
└─────────┴───────────────────────┴─────────────┘

The tag is n, and the value follows immediately as an immediate. Tagging an array or map or other tag requires the value to be written first, and referred to using a pointer.

Kind=10,11,12: variant

Byte layout:

┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1           │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││10│low│ │                       │
│└──┴───┘ │                       │
└─────────┴───────────────────────┘

┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1           │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌──┬───┐ │LEB128 length extension│┌─────┐      │
││11│low│ │                       ││value│      │
│└──┴───┘ │                       │└─────┘      │
└─────────┴───────────────────────┴─────────────┘

┌─────────┬───────────────────────┬───────────────┬─────────┬─────────┬─────────┬───────────┐
│offset: 0│offset: 1…n1           │offset: n1+1…n2│offset: …│offset: …│offset: …│offset: …  │
├─────────┼───────────────────────┼───────────────┼─────────┼─────────┼─────────┼───────────┤
│┌──┬───┐ │LEB128 length extension│LEB128         │┌───────┐│┌───────┐│…        │┌─────────┐│
││12│low│ │                       │               ││args[0]│││args[1]││         ││args[n-1]││
│└──┴───┘ │                       │               │└───────┘│└───────┘│         │└─────────┘│
└─────────┴───────────────────────┴───────────────┴─────────┴─────────┴─────────┴───────────┘

Here we have multiple cases:

  • high=10: just like a non-negative integer, but n` is the variant index
  • high=11: just like a tag, but n is the variant index. The single argument must be an immediate.
  • high=12: the variant index is n, and is followed by a LEB128 number indicating how many arguments follow immediately. The arguments must be immediates.

Kind=14: reference

Layout in bytes:

┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1           │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││14│low│ │                       │
│└──┴───┘ │                       │
└─────────┴───────────────────────┘

The reference is, logically, an offset (coming strictly earlier in the byte stream). If the reference is at offset p, then the offset it points to is p-n-1.

For example, the value 42 followed by a reference to it is encoded as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b e1                ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Kind=15: pointer

Layout in bytes:

┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1           │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││15│low│ │                       │
│└──┴───┘ │                       │
└─────────┴───────────────────────┘

This is like a reference, but it's implicitly followed.

For example, the value 42 followed by a pointer to it is encoded as:

┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b f1                ┊                         │
└────────┴─────────────────────────┴─────────────────────────┘

Final block

A Twine byte stream can end with a finalizer block, which is used to provide the entrypoint into the DAG. In other words, this entrypoint is the toplevel, outermost value that was serialized via Twine. It is often an array of map.

The last byte of a twine stream with finalizer, at offset p == twine_string.len()-1, is a u8 delta. If twine_string[p] == n, then twine_string[p-n-1] must be a valid twine value. The entrypoint, if it fits in less than 256 bytes, can be stored directly this way; otherwise, a pointer to the entrypoint is stored instead.

For a large entrypoint at offset p_e, then, the finalizer block is:

offsetvalue
p_eentrypoint value
p_ptrPointer(p_e)
p(p - p_ptr_e - 1) as u8

Why not […]?

There are already many serialization formats. Why invent a new one?

Why not JSON?

JSON is quite unsuitable for our purposes, really. It's slow to read and write (among other things because of string escaping), unless heroic efforts are deployed. It also doesn't specify anything about numbers and lacks the ability to represent byte strings and DAG references.

Why not CBOR?

CBOR is simpler, more efficient, more compact, and faster than JSON, and is one of the major sources of inspiration for Twine. However, it doesn't easily represent sharing within the same file.

Tags can be used to represent pointers (e.g. dasl uses tag 42 to mark external pointers to other content-addressed values), but CBOR libraries generally don't provide ways to read from an offset or to access the offset of a value being encoded for further reference.

Historical note about CBOR-pack

Before Twine, we were using our homegrown cbor-pack, a CBOR value of the shape {"h": [<cbor>, <cbor>, …, <cbor>], "k": <cbor>}. The key, k, would be the toplevel value (the entry point); the heap, h, would contain an array of CBOR values. Tag 6 would be used to refer to values in h by their index, so that 6(29) would actually represent the value h[29]. The downsides are that the whole array h would have to be parsed into a CBOR AST in memory during deserialization; and it's hard to directly write a stream of bytes during serialization (because h's array needs to be prefixed by a variable-width length).

Twine provides native sharing, and makes it easy to write bytes to some file or growing buffer during serialization, in one pass.

Why not Fleece?

Fleece is a binary encoding with a data model similar to JSON, but it also allows 0-copy reading and traversal of values, directly from a byte array (either in memory or memory-mapped). It is another big source of inspiration for Twine.

Compared to Fleece, Twine is somewhat simpler to implement, at the cost of some performance for 0-copy reading. Twine maps are not sorted by ascending keys; Twine arrays are not indexable in O(1).

However, Twine doesn't need to distinguish between wide and narrow collections because values don't have to fit in 2 or 4 bytes each. The complex bitpacking of Fleece is replaced by 4-bits tags à la CBOR.

Additionally, twine offers explicit references, which give users control over traversal of the DAG if they need it.

Implementations

There are only a few implementations currently:

Test files

Here are a few test files, alongside a JSON equivalent:

About

The format was designed mostly by @c-cube, aided by useful discussions with his colleagues @bronsa and @wintersteiger.