Merkle DAGs: Distributability

Merkle DAGs inherit the distributability of CIDs. Using content-addressing for DAGs has several interesting consequences for their distribution. The first, of course, is that anybody who has a DAG is capable of acting as a provider for that DAG. The second is that when we’re retrieving data encoded as DAG, like a directory of files, we can leverage this fact to retrieve all of a node’s children in parallel, potentially from a number of different providers! The third is that file servers are not limited to centralized data centers, giving our data greater reach. Finally, because each node in a DAG has its own CID, the DAG it represents can be shared and retrieved independently of any DAG it is itself embedded in.

Case Study: Distributing Large Datasets

As an example, consider the distribution of a large, popular, scientific dataset. Today, on the location-addressed web:

The researcher sharing the file is responsible for maintaining the file server and its associated costs
The same server is likely used to respond to requests all over the world
The data itself may be distributed monolithically, as a single file archive
It’s hard to locate alternative providers of the same data
Data is likely in large chunks that must be downloaded in serial from a single provider
It’s hard for others to share datasets that build on the original data

Merkle DAGs help us alleviate all of these problems. By distributing the dataset as a content-addressed DAG:

Anybody who wants can help distribute the file
Nodes from all over the world can participate in serving the data
Each part of the DAG has its own CID that can be distributed independently
It’s easy to find alternative providers of the same data
The nodes forming the DAG are small, and can be downloaded in parallel from many different providers
Larger datasets encompassing the original can simply link the original dataset as a child of a larger DAG

All of this works to promote scalable and redundant access to this important data.

Take the quiz!

Which of the following statements about datasets distributed as Merkle DAGs is FALSE?

Anyone who has a copy of the dataset—or a subset of the DAG—can choose to help distribute it.

Because it has a single root node, the full dataset must be retrieved from a single provider.

Different segments of a dataset can be retrieved in parallel from a variety of providers or data centers around the globe.

Oops, you haven't selected the right answer yet!

Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.