Multihash

Occasionally, a hashing algorithm may be proven to be insecure, meaning it no longer complies with the characteristics that we defined earlier. This has already happened with sha1. With time, other algorithms may prove to be insufficient for content addressing in IPFS and other distributed information systems. For this reason, and in order to support multiple cryptographic algorithms, we need to be able to know which algorithm was used to generate the hash of specific content.

What is the hashing algorithm used in a hash?

So how can we do this? To support multiple hashing algorithms, we use multihash.

Multihash format

A multihash is a self-describing hash which itself contains metadata that describes both its length and what cryptographic algorithm generated it. Multiformats CIDs are future-proof because they use multihash to support multiple hashing algorithms rather than relying on a specific one.

Multihashes follow the TLV pattern (type-length-value). Essentially, the "original hash" is prefixed with the type of hashing algorithm applied and the length of the hash.

Multihash format

type: identifier of the cryptographic algorithm used to generate the hash (e.g. the identifier of sha2-256 would be 18 - 0x12 in hexadecimal) - see the multicodec table for all the identifiers
length: the actual length of the hash (using sha2-256 it would be 256 bits, which equates to 32 bytes)
value: the actual hash value

In order to represent a CID as a compact string instead of plain binary (a series of 1s and 0s), we can use base encoding. When IPFS was first created, it used base58btc encoding to create CIDs that looked like this:

QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjQuPU

Multihash formatting and base58btc encoding enabled this first version of the CID, now referred to as Version 0 (CIDv0), and its initial Qm... characters remain easy to spot.

However, with time, doubts arose about whether this multihash format would be sufficient:

How do we know what method was used to encode the data?
How do we know what method was used to create the string representation of the CID? Will we always be using base58btc?

To address these concerns, an evolution to the next version of a CID was necessary. In the following lessons we'll explore what was added to the specification to lead us to the current CID version: CIDv1.

Take the quiz!

How do CIDs support multiple cryptographic algorithms?

The length of the hash tells us which cryptographic algorithm was used. If the length is 256 bits (32 bytes) then the algorithm used was sha2-256. If the length is 512 bits (64 bytes) then the algorithm used was sha2-512.

They prefix the hash with a unique identifier that flags which algorithm was used to generate the hash.

They prefix the hash with a unique identifier that flags both the algorithm used to generate the hash and the length of the hash value.

Oops, you haven't selected the right answer yet!

Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.