Data model

Twine, like CBOR or JSON, offers a specific data model that applications can rely on to encode their specific types into a byte stream. However, where CBOR or JSON offer a straightforward tree-shaped document model with arrays, maps, integers, strings, etc. Twine offers something akin to a heap of shallow values that can point to previous shallow values.

Tree-like data model

Still, it's possible to think of Twine values as being (imperfectly) represented by the following types:

#![allow(unused)]
fn main() {
/// An offset in a byte array.
pub type Offset = usize;

pub enum Value {
    Null,
    Bool(bool),
    Int64(i64),
    Float(f64),
    /// Text, in UTF8
    String(String),
    /// Binary blob.
    Bytes(Vec<u8>),
    /// A variant with 0 arguments.
    Variant0(VariantIdx),
    /// A reference to a full value
    /// (which comes at an earlier offset).
    Ref(Offset),
    /// An implicitly followed reference to a full value
    /// (which comes at an earlier offset).
    Pointer(Offset),
    /// Tagged value, like CBOR
    Tag(Tag, Box<Value>),
    Array(Vec<Value>),
    /// Map made of key/value pairs
    Map(Vec<(Value, Value)>),
    /// Variant of an enum/sum type
    Variant(VariantIdx, Vec<Value>),
}

pub type Tag = u64;

/// For efficiency, variants use indices, not names.
pub struct VariantIdx(pub u32);
}
  • We can see the usual CBOR basics, like null, booleans, 64 bit floats, 64-bits integers (signed), UTF-8 strings, and byte strings.
  • We also have CBOR-like tags, with an integer tag and the tagged value.
  • Arrays and Maps are similar to CBOR. Ordering in maps is not constrained and should be deserialized exactly in the order it is written on disk.
  • Variants correspong to sum types/enums: an integer variant index, with arguments.
  • Pointers contain the offset of the pointed-to value. Semantically, a pointer is dereferenced implicitly when convenient; for example, a user trying to read an integer value from offset p0, when p0 contains the value Pointer(p1), will try to read an integer from offset p1 instead. This is useful for sharing of large strings or binary blobs.
  • References contain the offset of the pointed-to value, and are returned as-is. This is useful to represent DAG edges in a way that's easily accessible by the user, who gets control over when to dereference the offset.

Heap of shallow values

A more correct view, which is used in current Twine implementations, is one that maps offsets in the byte stream to shallow values. Shallow values do not exhibit significant nesting and are quick to read and traverse at deserialization time.

Shallow values can still contain other shallow values if these are immediates (basically, scalars), as this doesn't introduce actual nesting.

Immediates

In terms of Rust code we would have immediates being defined by something like this excerpt of twine-data.rs:

#![allow(unused)]
fn main() {
pub enum Immediate<'a> {
    /// CBOR/JSON null.
    Null,
    Bool(bool),
    Int64(i64),
    Float(f64),
    /// Text, in UTF8
    String(&'a str),
    /// Binary blob.
    Bytes(&'a [u8]),
    /// A variant with 0 arguments.
    Variant0(VariantIdx),
    /// An explicit reference to a full value
    /// (which comes at an earlier offset).
    Ref(Offset),
    /// An automatically followed reference to a full value
    /// (which comes at an earlier offset).
    Pointer(Offset),
}
}

So these are encoded directly, without indirections, and can be stored as-is in arrays, maps, and variant arguments. A few notes:

  • The distinction between String and Bytes is that the former is UTF8 text, and the latter can be anything (including \0 and invalid unicode — an example would be an image or a cryptographic hash.)
  • Float is a small simplification here, floats on the wire can be 32 or 64 bits. The serialization API can provide both if desired.
  • Variant0 represents an enum variant that takes no argument. It is essentially an optimization.
  • Pointer is an offset to a value that comes earlier in the byte stream (you can only point at values already serialized earlier, so their offset is smaller). It is followed implicitly and automatically by deserializers. This is how something like transparent deduplication of strings can be done.
  • Ref also contains an offset (pointing to a value coming earlier in the byte stream). However it is visible to the user and is a full value on its own.
  • Skipping over a value should be fast and require no recursion.

Shallow values

Now, as said earlier, a Twine byte stream represents a map from offsets to shallow values. A value can always be referred to by its offset in the stream.

Shallow values are represented by the following (simplified) type:

#![allow(unused)]
fn main() {
pub enum ShallowValue<'a> {
    Imm(Immediate<'a>),
    Tag(Tag, Offset),
    Array { n_items: usize },
    Map { n_items: usize },
    Variant { idx: VariantIdx, n_args: usize }
}
}

So, immediates are valid shallow values using ShallowValue::Imm; but we also get the following CBOR-like cases:

  • Tag is a pointer to a value (the Offset) alongside an integer tag.
  • Array is followed by n_items immediates values. An array of integers will contain the integers immediately, but an array of maps will only contain pointers to the maps, which have to have been serialized prior to the array.
  • Map is followed by 2*n_items immediate values, in the order k1, v1, k2, v2, ….

In addition, we have variants:

  • Variant has a variant index (the enum case) and a number of arguments that follow as immediates, just like an array. A variant with 0 arguments can be more efficiently represented using ShallowValue::Imm(Immediate::Variant0(idx)).

Summary

So, Twine distinguishes immediates (scalars, including pointers and references) as a subset of shallow values. Shallow values are stored in a byte stream at given offsets and can point to other values that have smaller offsets.