Wire format
Now that we have immediates and shallow values, let's see how they are encoded.
This part is where Twine is relatively unique. The closest (to my knowledge) to Twine's data model is Fleece, and Fleece makes different design choices to make different tradeoffs — see why not Fleece for more details —, but the way Twine encodes individual values is closest to CBOR.
Values and immediates in Twine always start with a header byte.
The header byte is divided into high
and low
nibbles (most significant and least significant). The high nibble is the kind of the value; the low nibble is a small integer values with kind-dependent meaning:
low
is a small integer. Iflow=15
, the header byte is followed by a LEB128 valuelow_rest
, and the resulting integer islow_rest + 15
low
is exactly as above, with optional LEB128 followup, but represents a length or number of itemslow
is a small integer that represents a small enumeration of values. Each value has a specific meaning.
From now on we refer to these are high
(4 bits), and n
for the small integer obtained from low
and the optional following LEB128 integer.
Here are the possible kinds:
kind | meaning of low | value |
---|---|---|
0 | enum | special scalars: true , false , null |
1 | small int | non-negative integer |
2 | small int | negative integer |
3 | enum | float |
4 | length | UTF-8 text string (length is number of bytes) |
5 | length | byte string (length is number of bytes) |
6 | length | array (length is number of items) |
7 | length | map (length is number of key/value pairs) |
8 | small int | tag (offset follows) |
9 | - | reserved |
10 | small int | variant0 (integer is the variant index) |
11 | small int | variant1 (integer is the variant index) |
12 | small int | variantN (integer is the variant index) |
13 | - | reserved |
14 | small int | reference (integer is the offset delta) |
15 | small int | pointer (integer is the offset delta) |
Kind=0: special scalars
Layout in bytes:
┌─────────┐
│offset: 0│
├─────────┤
│┌─┬───┐ │
││0│low│ │
│└─┴───┘ │
└─────────┘
These are some immediate values, using this table:
high | low | value |
---|---|---|
0 | 0 | false |
0 | 1 | true |
0 | 2 | null |
0 | _ | reserved |
Kind=1: non-negative integer.
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌─┬───┐ │LEB128 length extension│
││1│low│ │ │
│└─┴───┘ │ │
└─────────┴───────────────────────┘
The value is the integer n
. Only values in [0…2^63-1]
should
be encoded this way, so that they fit in i64
.
For example, the integer 42
is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Kind=2: negative integer
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌─┬───┐ │LEB128 length extension│
││2│low│ │ │
│└─┴───┘ │ │
└─────────┴───────────────────────┘
The value, given n
, is the signed integer -n-1
.
Only values in [-2^63…-1]
should be encoded this way, so they fit
in a i64
.
For example, -2
is:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 21 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
and -27
is:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 2f 0b ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note that the value we encode is 26
, which overflows 15; so we write low=15
and then 26-15 = 11 = 0x0b
as a single byte in LEB128.
Kind=3: float
Layout in bytes:
┌─────────┬─────────────────┐
│offset: 0│offset: 1..5 │
├─────────┼─────────────────┤
│┌─┬─┐ │f32 little endian│
││3│0│ │ │
│└─┴─┘ │ │
└─────────┴─────────────────┘
┌─────────┬─────────────────┐
│offset: 0│offset: 1..9 │
├─────────┼─────────────────┤
│┌─┬─┐ │f64 little endian│
││3│1│ │ │
│└─┴─┘ │ │
└─────────┴─────────────────┘
The value is a 32 or 64 bits float, encoded as its bits in little endian.
If low = 0
, the precision is 32 bits; if low = 1
, the precision is 64 bits. Other values of low
are reserved.
For example, 42.5 : f64
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 31 00 00 00 00 00 40 45 ┊ 40 │
└────────┴─────────────────────────┴─────────────────────────┘
This corresponds to 0x4045400000000000
in little endian,
as per https://float.exposed/0x4045400000000000 .
Kind=4: UTF-8 text
Byte layout:
┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐ │LEB128 length extension│string data │
││4│low│ │ │ │
│└─┴───┘ │ │ │
└─────────┴───────────────────────┴───────────────────┘
For example, "hello world! 😁"
becomes:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 4f 02 68 65 6c 6c 6f 20 ┊ 77 6f 72 6c 64 21 20 f0 │
│00000010│ 9f 98 81 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note how the header+length is 0x4f02
because the length in bytes is greater than 15.
Kind=5: byte string
Byte layout:
┌─────────┬───────────────────────┬───────────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n1+len│
├─────────┼───────────────────────┼───────────────────┤
│┌─┬───┐ │LEB128 length extension│bytes │
││5│low│ │ │ │
│└─┴───┘ │ │ │
└─────────┴───────────────────────┴───────────────────┘
This is like text, but the content is an opaque blob with not constraints.
Kind=6: array
Byte layout:
┌─────────┬───────────────────────┬──────────┬──────────┬─────────┬────────────┐
│offset: 0│offset: 1…n1 │offset: … │offset: … │offset: …│offset: … │
├─────────┼───────────────────────┼──────────┼──────────┼─────────┼────────────┤
│┌─┬───┐ │LEB128 length extension│┌────────┐│┌────────┐│… │┌──────────┐│
││6│low│ │ ││array[0]│││array[1]││ ││array[n-1]││
│└─┴───┘ │ │└────────┘│└────────┘│ │└──────────┘│
└─────────┴───────────────────────┴──────────┴──────────┴─────────┴────────────┘
The number of items is n
, and items just follow.
The important restriction here is that items must be immediate values,
so that iteration remains fast; skipping over an immediate value can be
done by reading the header and deducing how many additional bytes
need to be skipped.
For example, [[42], 1, 2, 3]
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 61 1f 1b 64 f3 11 12 13 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Note here that the sub-array [42]
is 0x611f1b
(length=1, followed by 0x1f1b
), and then the main array starts at offset=3
with header 0x64
.
Kind=7: map
Byte layout:
┌─────────┬───────────────────────┬─────────┬──────────┬─────────┬──────────┬────────────┐
│offset: 0│offset: 1…n1 │offset: …│offset: … │offset: …│offset: … │offset: … │
├─────────┼───────────────────────┼─────────┼──────────┼─────────┼──────────┼────────────┤
│┌─┬───┐ │LEB128 length extension│┌──────┐ │┌────────┐│… │┌────────┐│┌──────────┐│
││6│low│ │ ││key[0]│ ││value[0]││ ││key[n-1]│││value[n-1]││
│└─┴───┘ │ │└──────┘ │└────────┘│ │└────────┘│└──────────┘│
└─────────┴───────────────────────┴─────────┴──────────┴─────────┴──────────┴────────────┘
Like arrays, maps use n
to encode the number of key/value pairs.
For example, {"a": 42, "b": false}
is represented as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 72 41 61 1f 1b 41 62 00 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
We can see the keys 0x4161
and 0x4162
.
Note that in general, keys don't have to be strings.
Kind=8: tag
Byte layout:
┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌─┬───┐ │LEB128 length extension│┌─────┐ │
││6│low│ │ ││value│ │
│└─┴───┘ │ │└─────┘ │
└─────────┴───────────────────────┴─────────────┘
The tag is n
, and the value follows immediately as an immediate.
Tagging an array or map or other tag requires the value to be written first, and referred to using a pointer.
Kind=10,11,12: variant
Byte layout:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││10│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
┌─────────┬───────────────────────┬─────────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…│
├─────────┼───────────────────────┼─────────────┤
│┌──┬───┐ │LEB128 length extension│┌─────┐ │
││11│low│ │ ││value│ │
│└──┴───┘ │ │└─────┘ │
└─────────┴───────────────────────┴─────────────┘
┌─────────┬───────────────────────┬───────────────┬─────────┬─────────┬─────────┬───────────┐
│offset: 0│offset: 1…n1 │offset: n1+1…n2│offset: …│offset: …│offset: …│offset: … │
├─────────┼───────────────────────┼───────────────┼─────────┼─────────┼─────────┼───────────┤
│┌──┬───┐ │LEB128 length extension│LEB128 │┌───────┐│┌───────┐│… │┌─────────┐│
││12│low│ │ │ ││args[0]│││args[1]││ ││args[n-1]││
│└──┴───┘ │ │ │└───────┘│└───────┘│ │└─────────┘│
└─────────┴───────────────────────┴───────────────┴─────────┴─────────┴─────────┴───────────┘
Here we have multiple cases:
high=10
: just like a non-negative integer, but n` is the variant indexhigh=11
: just like a tag, butn
is the variant index. The single argument must be an immediate.high=12
: the variant index isn
, and is followed by a LEB128 number indicating how many arguments follow immediately. The arguments must be immediates.
Kind=14: reference
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││14│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
The reference is, logically, an offset (coming strictly earlier in the byte stream).
If the reference is at offset p
, then the offset it points to is p-n-1
.
For example, the value 42
followed by a reference to it is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b e1 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘
Kind=15: pointer
Layout in bytes:
┌─────────┬───────────────────────┐
│offset: 0│offset: 1…n1 │
├─────────┼───────────────────────┤
│┌──┬───┐ │LEB128 length extension│
││15│low│ │ │
│└──┴───┘ │ │
└─────────┴───────────────────────┘
This is like a reference, but it's implicitly followed.
For example, the value 42
followed by a pointer to it is encoded as:
┌────────┬─────────────────────────┬─────────────────────────┐
│00000000│ 1f 1b f1 ┊ │
└────────┴─────────────────────────┴─────────────────────────┘