Introducing Merkle DAGs

When we represent graphs on a computer, we don’t have the luxury of drawing arrows; we have to encode our data structure by providing a concrete representation of its nodes and edges. Since CIDs can uniquely identify a node, we can use them to express an edge from one node to another. By doing so we create a special sort of DAG known as a Merkle DAG, in honor of the researcher who first described such structures. (Any graph representation that utilizes content addressing in this way is necessarily a DAG, as we’ll soon see).

Let’s take a look at how we can build a Merkle DAG, again using our file directory as an example. The first step is to encode the leaf nodes of our graph—in this case our image files—and give each of them a CID.

For this example, we’ll simplify the representation of these nodes to two attributes: the name of the file, and the data corresponding to the file’s contents. These attributes, bundled together, make up the data of our node, represented below in the orange box.

The label above the node is a simplified representation of the unique CID that’s derived by passing the data of node itself through our cryptographic hashing algorithm. (For an in-depth look at the formatting of a CID, visit our Anatomy of a CID tutorial.) Note that this label is not a part of the node itself.

We can begin building our Merkle DAG by creating its leaf nodes first—one node for every file in our hierarchy—labelling each with its unique CID:

This image depicts a leaf node for each file in the example hierarchy. The nodes correspond to "freshwater.png", "tropical.png", "blowfish.png", "tabby.png", and "black.png", respectively, and are labled "baf...1" to "baf...5".

The node structure for our intermediate nodes—the subdirectories of our hierarchy—has to be a little bit different. Each of these nodes will also contain a name, corresponding to the name of the directory; however, the "content" of a directory node is the list of files and directories it contains, rather than the content of any specific file. We can represent this as a list of CIDs, each of which links to another node in the graph. This list, together with the name of the directory, constitutes the data for these nodes, and from this data we can again derive a CID, as shown below:

A Merkle DAG displaying the "cats" directory. The node for "cats" is labeled "baf...7"; it embeds the CIDs of the nodes corresponding to cat images ("baf...4" for "tabby.png" and "baf...5" for "black.png"). Single-headed arrows are drawn from the "cats" node to the child nodes.

In this figure, the "cats" subdirectory is represented as a very small DAG. The node with the CID "baf...7" links to the nodes that constitute its children, "baf...4" and "baf...5", by embedding their CIDs in its own representation.

Now that we’ve derived representations for both types of nodes in our graph, we can continue to build the graph from the bottom up:

A complete Merkle DAG for the example hierarchy. The node for the "fish" folder is labeled "baf...6", the node for the "cats" folder is labeled "baf...7", and the node for the "pics" folder is labeled "baf...8". All nodes are drawn as described in the previous images.

In a Merkle DAG, each node’s CID depends on every single one of its descendents; should any of those be different, their own labels would also be different. If, for example, the picture of a tabby cat were photoshopped in some way, then its respective node in the graph would receive a different CID. Because the CID of a child node is part of the data of its parent, that parent—in this case the "cats" directory—would itself change, causing it to receive a new CID as well. In turn, the CID of the node for the "pics" directory would also change. This means that we always have to build a DAG from the bottom up: parent nodes cannot be created until CIDs of their children can be determined.

In general, any change to a node in a Merkle DAG is propagated to each of the changed node’s ancestors. However, a change made in one branch of the DAG won’t force a change in the CIDs of nodes in other branches; a node’s CID only changes in response to a change in its own data or that of its descendents. For example, changing "blowfish.png" doesn’t require baf...1, baf...2, baf...4, baf...5, or baf...7 to change.

Note that structures made of nodes that embed the CIDs of their children necessarily cannot contain cycles. The cryptographic functions used in the construction of CIDs and therefore our Merkle DAG make it impossible to describe a "self-referential" path through the graph. This is an important security guarantee: if we traverse a Merkle DAG, we can be certain that we won’t end up in an infinite loop.

Now that we’ve seen how to use CIDs to create structured data, let’s examine some of the properties of Merkle DAGs that we can rely upon!

Take the quiz!

If we were to edit the tabby cat image in our file hierarchy and construct a new Merkle DAG containing the same elements, how many nodes would the new Merkle DAG and previous Merkle DAG differ by?

1 - Only the tabby cat node would change.

3 - The tabby cat node and each of its ancestors would change.

4 - All of the cat pictures and their parent nodes would change.

8 - The entire DAG would need to change.

Oops, you haven't selected the right answer yet!

Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.