-
Notifications
You must be signed in to change notification settings - Fork 139
tuple serialization format
tuples represent metadata in uniboot. a tuple is a map from utf8 strings (buffers) to a union type containing tuples and buffers. tuples are almost exactly like JSON objects except there is no representation for number, and tuples are allowed to contain cycle references.
in the runtime, each tuple value is associated with a dynamic runtime type (not entirely true, buffer values aren't tagged, which we should fix)
keys which do not exist in a given tuple are assigned a default value of the zero-length buffer
tuple serialization is used to append data into the filesystem log, and eventually as a distributed monitoring and control protocol. the serialization is tied to the notion of an ordered stream to allow for inter-object references. current examples include the filesystem transaction log and remote tcp connections.
the current reference for the tuple serialization format is in uniboot/runtime/tuple.c
for internal values, the tuple serialization uses an integer encoding where each chunk (nominally a byte)of the value contains an additional bit (in the most significant position), which indicates whether there are additional chunks of bits to be included in the value.
for each chunk of size N, the implementation:
- shifts the current value left by (N-1)
- ors the lowermost N-1 bits into the result.
- decodes the next chunk of the highest bit (N) is a 1
- returns the value otherwise
its an entirely arbitrary choice as to whether the first chunk encodes the most or least significant bits, currently we choose most.
each value in the tuple serialization is started by a common header value. the header contains 3 fields: o type (1 bit, either tuple = 1 or buffer =0 o immediate (1 bit, immediate = 1, reference = 0) o a length (varint as above, where the first chunk is the 6 remaining lower bits of the first header byte, and subsequent bytes thereafter)
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|
type | reference | length continuation | len4 | len3 | len2 | len1 | len0 |
followed by zero or more:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|
length continuation | len6 | len5 | len4 | len3 | len2 | len1 | len0 |
if the header reference bit is set, then the header refers to an object which has already been seen. references :
o provide a first level compression, so that things like keys aren't
repeated
o allow mutations to existing tuples within a stream to be represented.
o allow representation of cyclic tuple graphs
both the encoder and the decoder keep a dictionary of mappings between stream indices and runtime objects. on the sender side this is from objects to indices, and on the receiver side for in
there is an open issue as to whether it would be helpful to maintain multiple sets of dictionary indices based on the type. this would result in a slightly more compact representation and provide some internal consistency checking. sets could be just tuples and buffers, but keys could also be maintained disjointly.
it may also be fruitful to waste another bit to exclude an immediate value from being tracked as a reference in some usages.
the serialization for tuples is: (tuple(1), immediate(1), number of values) (tuple(1), reference(0), number of values) for each value a key name (a buffer, immediate or reference), a value which is either a tuple or a buffer.
deletion of a key is accomplished by setting the key to the empty buffer
(buffer(0), immedate(1), length of the string) (buffer(0), reference(0), length of the string)
numbers are currently encoded and decoded as base 10 utf8 strings. it may be more convenient to encode them variable length integers