Merkle DAGs: Deduplication

Merkle DAGs offer a straightforward way of achieving data deduplication, efficiently storing data by encoding redundant sections as links. This can take place on both small and large scales.

For an example of small-scale data duplication, consider the use case of tracking changes files in a directory over time (this is often called versioning). One change we could make to this directory is to delete the "fish" directory, replacing it with a directory called "dogs". These changes result in a new DAG, representing an updated state of the directory. However, all of the nodes representing the "cats" directory and its files are common to both DAGs. Therefore, we can reuse them, as depicted below, where the orange nodes represent nodes that are only used in the original DAG, the green nodes represent those that are common to both, and the blue nodes represent the extra nodes needed for the new state.

This means that we can actually store both versions of the "pics" directory, without taking up twice as much space as it takes to store a single version! Git, a common version control system, uses Merkle DAGs in a very similar way to track changes to source code in software projects!

Deduplication makes an even bigger difference when you expand this concept to a larger scale. For example, consider the use case of browsing the web! When a person visits a web page using a browser, the browser must first download any resources associated with the page, including images, text, and styling markup. You may have noticed that a lot of web pages actually look fairly similar, using slightly different variants of a common theme. Often, there's a lot of redundancy here—the themes are mostly the same, with tweaks here and there in the data that describes them.

On the location-addressed web, each of these themes, despite sharing lots of the same data, is often completely re-downloaded, as there's no easy way to identify this redundancy. In contrast, if Merkle DAGs were used to distribute the themes, they'd all share an identifiable common core that remained unchanged from one theme to the next, and so the browser could be smart enough to avoid downloading this component more than once. Whenever a user visited a new website using a variation of the theme, the browser would only need to download the nodes of its DAG corresponding to the parts that are different, having already previously downloaded the rest! For resources that are used extremely widely on the Internet (think WordPress themes, the Bootstrap CSS library, or common JavaScript libraries), this could result in tremendous reduction in wasteful, needless downloads!

Content addressing enables us to form something of a global, distributed file system. Using Merkle DAGs, you can "store" a massive dataset without really having to store it: as long as you have an Internet connection, you can retrieve the pieces you need at any given point in time. In fact, nobody has to store the entire dataset! CIDs allow us to seamlessly link and build data collections across computers, helping everybody to use storage space more efficiently (though, it should be noted, there are also often times we want to duplicate data for redundancy).

We also aren’t limited in the granularity of our DAGs: rather than working with large nodes that correspond to entire files, we can split a file up into chunks and make a DAG out of them! When we do so, we can often find ways to deduplicate the contents of similar files!

Take the quiz!

Deduplication...

Allows us to save space by eliminating repeated information in data.

Has no tradeoffs, only benefits!

Can be implemented at multiple levels of granularity.

A and C.

All of the above.

Oops, you haven't selected the right answer yet!

Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.