Data model
Twine, like CBOR or JSON, offers a specific data model that applications can rely on to encode their specific types into a byte stream. However, where CBOR or JSON offer a straightforward tree-shaped document model with arrays, maps, integers, strings, etc. Twine offers something akin to a heap of shallow values that can point to previous shallow values.
Tree-like data model
Still, it's possible to think of Twine values as being (imperfectly) represented by the following types:
#![allow(unused)] fn main() { /// An offset in a byte array. pub type Offset = usize; pub enum Value { Null, Bool(bool), Int64(i64), Float(f64), /// Text, in UTF8 String(String), /// Binary blob. Bytes(Vec<u8>), /// A variant with 0 arguments. Variant0(VariantIdx), /// A reference to a full value /// (which comes at an earlier offset). Ref(Offset), /// An implicitly followed reference to a full value /// (which comes at an earlier offset). Pointer(Offset), /// Tagged value, like CBOR Tag(Tag, Box<Value>), Array(Vec<Value>), /// Map made of key/value pairs Map(Vec<(Value, Value)>), /// Variant of an enum/sum type Variant(VariantIdx, Vec<Value>), } pub type Tag = u64; /// For efficiency, variants use indices, not names. pub struct VariantIdx(pub u32); }
- We can see the usual CBOR basics, like
null
, booleans, 64 bit floats, 64-bits integers (signed), UTF-8 strings, and byte strings. - We also have CBOR-like tags, with an integer tag and the tagged value.
- Arrays and Maps are similar to CBOR. Ordering in maps is not constrained and should be deserialized exactly in the order it is written on disk.
- Variants correspong to sum types/enums: an integer variant index, with arguments.
- Pointers contain the offset of the pointed-to value. Semantically, a pointer is dereferenced implicitly when convenient; for example, a user trying to read an integer value from offset
p0
, whenp0
contains the valuePointer(p1)
, will try to read an integer from offsetp1
instead. This is useful for sharing of large strings or binary blobs. - References contain the offset of the pointed-to value, and are returned as-is. This is useful to represent DAG edges in a way that's easily accessible by the user, who gets control over when to dereference the offset.
Heap of shallow values
A more correct view, which is used in current Twine implementations, is one that maps offsets in the byte stream to shallow values. Shallow values do not exhibit significant nesting and are quick to read and traverse at deserialization time.
Shallow values can still contain other shallow values if these are immediates (basically, scalars), as this doesn't introduce actual nesting.
Immediates
In terms of Rust code we would have immediates being defined by something like this excerpt of twine-data.rs
:
#![allow(unused)] fn main() { pub enum Immediate<'a> { /// CBOR/JSON null. Null, Bool(bool), Int64(i64), Float(f64), /// Text, in UTF8 String(&'a str), /// Binary blob. Bytes(&'a [u8]), /// A variant with 0 arguments. Variant0(VariantIdx), /// An explicit reference to a full value /// (which comes at an earlier offset). Ref(Offset), /// An automatically followed reference to a full value /// (which comes at an earlier offset). Pointer(Offset), } }
So these are encoded directly, without indirections, and can be stored as-is in arrays, maps, and variant arguments. A few notes:
- The distinction between
String
andBytes
is that the former is UTF8 text, and the latter can be anything (including\0
and invalid unicode — an example would be an image or a cryptographic hash.) Float
is a small simplification here, floats on the wire can be 32 or 64 bits. The serialization API can provide both if desired.Variant0
represents an enum variant that takes no argument. It is essentially an optimization.Pointer
is an offset to a value that comes earlier in the byte stream (you can only point at values already serialized earlier, so their offset is smaller). It is followed implicitly and automatically by deserializers. This is how something like transparent deduplication of strings can be done.Ref
also contains an offset (pointing to a value coming earlier in the byte stream). However it is visible to the user and is a full value on its own.- Skipping over a value should be fast and require no recursion.
Shallow values
Now, as said earlier, a Twine byte stream represents a map from offsets to shallow values. A value can always be referred to by its offset in the stream.
Shallow values are represented by the following (simplified) type:
#![allow(unused)] fn main() { pub enum ShallowValue<'a> { Imm(Immediate<'a>), Tag(Tag, Offset), Array { n_items: usize }, Map { n_items: usize }, Variant { idx: VariantIdx, n_args: usize } } }
So, immediates are valid shallow values using ShallowValue::Imm
;
but we also get the following CBOR-like cases:
Tag
is a pointer to a value (theOffset
) alongside an integer tag.Array
is followed byn_items
immediates values. An array of integers will contain the integers immediately, but an array of maps will only contain pointers to the maps, which have to have been serialized prior to the array.Map
is followed by2*n_items
immediate values, in the orderk1, v1, k2, v2, …
.
In addition, we have variants:
Variant
has a variant index (the enum case) and a number of arguments that follow as immediates, just like an array. A variant with 0 arguments can be more efficiently represented usingShallowValue::Imm(Immediate::Variant0(idx))
.
Summary
So, Twine distinguishes immediates (scalars, including pointers and references) as a subset of shallow values. Shallow values are stored in a byte stream at given offsets and can point to other values that have smaller offsets.