Gossamer Update #1: Merkle-Patricia Trie

Gossamer Update #1: Merkle-Patricia Trie

Authored by Elizabeth

Hey everyone! At ChainSafe we've recently completed the Trie milestone of our Polkadot runtime environment. Check it out in our repo! In our previous article, we discussed some background for Polkadot and went over our timeline for the project. In this article, I'd like to explain the structure of our Merkle-Patricia trie and how it's similar and different to that of Ethereum.

The trie is a data structure used to store the state of the network. We have a database as well as a storage trie; why do we need both? The database is used to look up key-value pairs; the trie is used to confirm the integrity of the data in storage. The trie is an arrangement of the data in storage that can be efficiently hashed to give a Merkle root, allowing for easy validation of the state.

The trie has two types of nodes: branches and leaves. Each branch can have up to 16 children located at indices 0 to 15 (or 0x0 to 0xf, in hexidecimal) as well as a value. Each leaf stores only a value and cannot have children. Both types of nodes also have a partial key, which will be referred to within the node as just key. (If you're familiar with Ethereum's trie, their extension nodes have been merged into the key field of the nodes.)

Let's go over a small example. Let's say we have an empty trie and we want to insert the key-value pairs <0x2378, "penguin"> and <0x2bc, "noot"> into it. The resulting trie looks like this:

Notice that both our keys start with 2. The common key is added to the key field of the root branch. The next nibble in the first key is 3, so we add a leaf node containing the value to the branch at index 3. The rest of the key that hasn't been covered, 78, is added to the key field of the leaf node. Similarly for the second key, the next nibble is b, so we add a leaf node to the branch at index b. The leaf node also contains the rest of the key that hasn't been covered up to that point as well as the value.

Let's add another key-value pair to our trie: <0x2bf, "noodle">. At our root branch, we already have a node at 0x2b, and it's not a branch, so we can't just add our new node to it. We need to convert the existing leaf into a branch. The resulting trie looks like this:

We can trace the keys to the values through the two branches. Notice that the new leaf keys are null; this is because their keys totally are covered by branches above them.

Let's say we add <0x2b, "rat"> to the trie. 2 is covered by the first branch, and there's a branch at index b. So, this value will be put in the value field of that branch.

Note that a branch must have at least two children or one child and a value. Otherwise, it can be simplified into either a leaf or combined with another branch. If we remove <0x2378, "penguin"> from the trie, we don't need the root branch anymore; its key can be combined with the branch below it.

We've combined the previous branch's key, 2, and the child index of the branch below it, b, and added it to the key field of the previous child branch. This keeps the trie nice and compact :)

To get the Merkle value of the storage, we define a way to hash every node in the trie that depends on the nodes below it. First, some terms:

nodePrefix : a 2-bit value which says whether the node is a leaf, branch with value, branch without value, or other (eg. empty trie)
encodedKey: the SCALE encoding of the node's key
encodedValue: the SCALE encoding of the node's value
childrenBitmap: a 16-bit value representing whether there is a child at that index or not. eg: a branch has children at index 0 and index 6. then, the childrenBitmap is 1000001000000000
H(N): the Merkle value (blake2b hash) of node N

To get the Merkle value of a node, the node has to be encoded in a specific way, then the encoding is hashed. A leaf node is encoded using nodePrefix | encodedKey | encodedValue . (Note: | denotes concatentation.) A branch node is encoded using nodePrefix | encodedKey | childrenBitmap | H(c_0) ... H(c_15) | encodedValue where H(c_i) is the hash of the encoding of its child at index i. Since a leaf has no children and is always at the bottom of the trie, it doesn't depend on any other nodes. However, the encoding of a branch depends on all its children and thus every node below it. By altering any node in the trie, we end up altering the encoding of the root node, and thus its hash.

You can learn more about the Polkadot trie in the Polkadot Runtime Environment Specification: https://github.com/w3f/polkadot-re-spec

That's all for the trie :)

While we were working on the trie, we also began working on the libp2p networking layer of gossamer, the RPC server, and integrating the WASM runtime. The networking layer allows the client to connect to the Polkadot network and send and receive messages from other nodes. The WASM runtime is the swappable portion of the code that validates blocks and extrinsics. Here's a link to our repo if you'd like to check it out.

If you'd like to get involved, we also have a couple of good first issues open:

We'd love for you to contribute. Thanks for reading!