Overview

Twine-data (twine for short) is a binary format that aims at combining simplicity, compactness, performance of (de)serialization, self descriptiveness, and expressiveness.

Unlike most of the modern self-descriptive formats (CBOR, MsgPack, JSON, S-expressions, etc.), twine's data model is not a tree. Rather, a twine byte stream represents a DAG of values, which is important to efficiently serialize data structures that exhibit sharing.

A simple example

A very simple example, corresponding to the JSON value {"a": ["hello", ["hello"]], "x": true}, would look like this:

$ cat example.json
{"a": ["hello", ["hello"]], "x": true}

# convert to twine
$ twine-utils from-json example.json -o example.twine

# the serialized bytes
$ hexyl example.twine
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 45 68 65 6c 6c 6f 61 f6 ┊ 62 f8 f3 72 41 61 f5 41 │Ehelloa×┊b××rAa×A│
│00000010│ 78 01 06                ┊                         │x••     ┊        │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

# a debug view
$ twine-utils dump example.twine
[0x0]: "hello"
[0x6]: [@0x0] (len=1)
[0x8]: [@0x0, @0x6] (len=2)
[0xb]: {"a": @0x8, "x": true} (len=2)

We can already see some interesting properties of this Twine file:

  • The string "hello" occurs only once in the file. This is not mandatory, but it's enabled by the sharing feature of Twine, namely, pointers (printed as @0x0, 0x6, 0x8, etc in the dump output).
  • A pointer to a value is the numeric offset of that value in the file.
  • The map and arrays resemble JSON somewhat, but cannot be nested. Instead, an array or map can only refer to another array or map via a pointer.
  • Strings are not quoted, but length prefixed. We'll see how this works later, but the very first byte in the file, 0x45, is actually a combination of "tag 4" (string), and "length 5".

Sharing and DAGs

Twine emerged from work on computational logic tools, where logical terms (mathematical formulas) can grow large, and tend to exhibit a lot of sharing.

SMT-LIB and the need to represent DAGs

This sharing is routine for some classes of tools, such as SMT solvers. SMT solvers do not have a standard binary format, but the standard text format, SMT-LIB, based on S-expression, relies heavily on let-bindings to represent sharing. Tools that consume SMT-LIB typically turn let-bindings such as (let ((x (+ 1 2))) (+ x x)) into a let-free DAG immediately (here, (+ (+ 1 2) (+ 1 2)) with perfect sharing) during parsing; but tools that produce SMT-LIB need to do more work to introduce these let-bindings, pick unique variable names, etc. and the sharing cannot work across terms in distinct contexts.

The goal of Twine is to provide a substrate, like JSON or CBOR do in other contexts, to efficiently represent such large DAGs on disk. For this, Twine provides both pointers (implicitly followed on reading by default), and references (deserialized as actual offsets, never implicitly followed).

Self-descriptive format

In A Survey of JSON-compatible Binary Serialization Specifications, Juan Cruzz Viotti and Mital Kinderkhedia split JSON-compatible serialization formats between schema-driven and schema-less. Schema-less formats are self-descriptive, meaning it's possible to parse and manipulate them without having access to the underlying schema. Consequently, there exists many JSON editors and viewers, CBOR tools such as cbor.me, and S-expression centric editor plugins (e.g in emacs).

Usefulness for logic For twine, leaning towards self-descriptiveness means that it's possible to implementat tools that work on any DAG. In the context of computational logic, a useful example is _pruning_ of a large DAG into a (potentially much) smaller DAG containing only values reachable from a given root.

A possible use case is the representation of a proof trace. A trace here means that a SMT solver or theorem prover is going to log every single logic inference it does, as it searches for a proof. This trace naturally forms a DAG since each inference can be used to derive multiple further inference steps.

Once the empty clause (false) is found, the inferences used to derive it can be extracted from the trace. This is where pruning comes in: a different byte stream can be built by simply traversing the full trace, starting from the empty clause, and rebuilding a new DAG with only these steps.