Skip to content

IPLD and CID

Samuel Burnham edited this page May 30, 2022 · 4 revisions

IPLD

IPLD Stands for Interplanetary Linked Data. It is described on the repository's README.md as "a series of standards and formats for describing data in a content-addressing-emphatic way."

It is not too dissimilar from data formats like JSON as seen in the list of variants or kinds. What makes the data format linked is that it natively supports:

  1. encoding and decoding (a codec) of a block of data into a unique CID
  2. CID links being embedded in one block pointing to the CIDs for other blocks.

Before we continue describing IPLD, let us take a detour to understand how CIDs can be generated and parsed. In its most basic form a CID is an abstract datatype built out of two parts: the multicodec, and the multihash. Very briefly, the multihash is a self-describing hash of the constituent data structure of the block, and the multicodec is additional data used that helps decode the rest of the block.

CID (Content Identifier)

A CID is a binary identifier calculated from any piece of data. It is an identifier for the content of the data, and not any location to find it. The encoding scheme for CID follows the Type-length-value structure. In particular its binary form begins with the multicodec substring, which encodes various identifiers describing the type of data, followed by a string of bits describing the length of the digest or hashed bits, and finally ending with the digest itself.

The multibase field is then added when the binary CID is converted to a string for a more concise and human-readable format. In order to account for different base encodings, a short (often one-character) prefix is added, e.g. b for base32. In practice, one can distinguish the two different CID versions by checking the first few characters of the CID string. If it starts with Qm, it is version 0 and if it starts with anything else (often bafy...) it is version 1. See the CID cheat sheet below for a worked example.

Returning to IPLD, DAG-PB is one of the IPLD formats that can be encoded using the CID standard. Others include DAG-JSON and DAG-CBOR. The DAG stands for directed acyclic graph, which is precisely the primitive data-structure that is generated in this scheme with links between CIDs.

An explanation on DAGs

Suppose some block of data A contains a CID linking to another block B. For example, in the DAG-JSON standard this can be implemented as A having a primitive JSON sub-object { "/" : "baf..." } where baf... is the CID of block B. On the other hand, it is impossible (this is too strong a claim) for B to contain a link to A, as the CID for A depends on the CID of B itself. Hence the link structure on for DAG-JSON and its related variants is a directed acyclic graph.

CID Cheat Sheet

CIDMultibaseMulticodecMultihash

Cidv0:

<multihash-algo><multihash-length><multihash-digest>

where multihash-length is in bytes

Implicit fields:

  • multibase = base58btc
  • multicodec-version = cidv0
  • multicodec-content-type = dag-pb

Cidv1:

<multibase><multicodec-version><multicodec-content-type><multihash>

where multihash is the same as Cidv0

  • The multibase table contains the different base encodings used as a prefix to the CID's string form.
  • The multicodec table contains codecs for the version, content type, and hash algorithm.
  • Content types are a somewhat ambiguous concept. Usually, they refer to the codec used (e.g. dag-cbor) to convert to IPLD before hashing, but they are also used for other multiformats and use-case specific data (e.g. the Lurk multicodec)

CID Conversion Example

Given some IPLD data: Ipld::List(Ipld::Integer(5)), convert to CID using standard multiformats seen in Yatima.

  1. Serialize the IPLD as bytes using the DAG-CBOR codec, based on CBOR
    List: major type = 4, length = 1 => [4 << 5 | 1]
    Integer: major type = 0, data = 5 => [0 << 5 | 5]
    Result: [0x81, 0x05]
    
    • See Radiya's DAG-CBOR implementation for the full serialization algorithm
  2. Hash the DAG-CBOR bytes using sha2-256
    b8aa8c1e3b7597cd09410468c2d70dcc83a8cb1c3663fd1b2a4b4290daba16d7
    
  3. Add the multihash algorithm and digest length
    sha2-256 = 0x12
    32 bytes = 0x20
    <0x12><0x20><b8aa8c1e3b7597cd09410468c2d70dcc83a8cb1c3663fd1b2a4b4290daba16d7>
    
  4. Add the multicodec version and content type
    v1 = 0x01
    dag-cbor = 0x71
    <0x01><0x71><0x12><0x20><b8aa8c1e3b7597cd09410468c2d70dcc83a8cb1c3663fd1b2a4b4290daba16d7>
    
  5. Encode as RFC4648 base32 and add the multibase prefix
    base32 conversion of the non-digest fields:
    0000 0001 0111 0001 0001 0010 0010 0000 => 00000 00101 11000 10001 00100 01000 00... => `afyrei...`
    Then add base32 "b" prefix
    Result: bafyreifyvkgb4o3vs7gqsqiendbnodomqoumwhbwmp6rwkslikinvoqw24
    
    • Note: The multicodec and multihash fields are unsigned-varints, but none are greater than 0x7f so the binary is unchanged.

Varints increase flexibility and extensibility by allowing for theoretically infinite length in a multicodec or multihash, although there is a practical limit of 9 bytes currently. Each encoded byte has one continuation bit and 7 data bits.

255 => 1111 1111
Encode right to left starting with the LSB => 111 1111
Need a continuation bit to encode the MSB => 1111 1111
Append the more significant bit(s) => 1111 1111 1
Pad with zeros until reaching continuation bit 0 => 1111 1111 0000 0001

300 => 0001 0010 1100
010 1100
1010 1100
1010 1100 000 0010
1010 1100 0000 0010

TODOs:

  • Write something about Merkle DAGs and the actual implementation.
  • Write something about the Protobuf data format.

References: