Let's first examine the model we've traditionally used to access data.
URLs (Uniform Resource Locators) are the primary addresses we give each other for data on the centralized web (you know, that plain old web we're all used to). They make it possible for us to create links and connect data on the web, so they serve a valuable purpose. (The web would be pretty lousy without links!) However, URLs are based on the location where data is stored, not on the contents of the resource stored there. We call this location addressing, and it presents us with some problems.
Most of us have a lot of experience with URLs, and we make some assumptions about them based on our experience. When we see https://www.puppies.com/beagle.jpg
, for example, we'll probably guess from the file name and extension that the data stored at that location is an image of a beagle (in JPEG format), but we can't verify this from the URL alone. There may very well be a photo of a chihuahua hiding at beagle.jpg
, or even worse, an adorable kitty!
Through the domain name, URLs indicate which authority we should go to for the data. Even though the web is already decentralized in a sense (since anyone can link to anyone else), the links referencing the data are location-based, so the data itself must be centralized on an authority in order for us to find it. We make assumptions about these authorities (or domains), just as we do with file names. For example, we might assume that a file hosted at puppies.com
is safer to open than one hosted at evilhacker.com
, but we can't be sure of it.
Ultimately, the contents of a file hosted on the centralized web have no direct relationship with their location-based addresses. If we see a picture of an adorable puppy and are told it's stored on the web, there's no way for us to guess the URL that would lead us to the image. We can determine neither the domain, which tells us who's hosting it, nor the filename.
Since we can't verify what content resides at a particular URL and are dependent on central authorities (and human kindness) to label things as they really are, it's easy for us to get tricked by malicious actors.
It's also easy for 42,000 people to store exactly the same photo of that adorable beagle, but all on different domains and with different filenames, leading to a lot of redundancy. Let's face it, even on our own laptops most of us have accidentally saved the same document as download.pdf
and download(01).pdf
without realizing it, or saved iterations of the same term paper over and over again with v1
or 2018-12-18
added to the title. The web is a confusing mess of data that's saved multiple times at different URLs, and there's no easy way to tell which items are identical to each other.
There must be a better way!
Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.