First off, I find the concept of content-based addressing dope af. 👀
It's an extremely powerful tool for building services that are fundamentally more performant, scalable and secure. 💪
It's related to immutability, decentralization, data integrity, and more buzzwords...
What the hell are you talking about?
You can think of content-based addressing as fingerprinting for data.
Just like how fingerprints allow you to:
- Identify a person based on their fingerprint
- Refer to a fingerprint as a unique ID for the person
- Tell if two people are the same person based on their fingerprints
- Quickly test to see if a person is in a database using just their fingerprint
Just replace "person" with "data" in the above descriptions and you have a rough overview of what content-based addressing enables.
Put another way, content-based addressing allows you to uniquely and efficiently reference data based on it's actual content as opposed to something external like an ID or a URL.
Database-generated IDs, random GUIDs, and URLs are all useful in their own right, but they're not quite as powerful as data fingerprinting (more on this below).
Shut up and show me some code
Let's see how this looks with some real-world code that I've used for reals:
This snippet leaves out the
hash function (more on that below), but it does represent the core algorithm pretty clearly.
It creates a content-based hash
myData that is a unique representation of that object based on the keys we care about
[ 'keyFoo', 'keyBar' ].
In short, this
If two content-based IDs are the same, the data in those objects is the same.
No need for a deep comparison. No need for Redux. Just pure immutable goodness.
So how does this actually work?
myData. This could be a model from your database or some object containing Redux-like app state, for instance.
Second, we clean our data to ensure that we're only considering parts of the data we actually care about via
lodash.pick. This step is optional but usually you'll want to clean your data like this before proceeding. I've found in practice that most of the time there will be parts of your data that aren't actually representative of the uniqueness of your model (we'll refer to this extra stuff as metadata 😉).
As an example, let's say I want to create unique IDs for all of the rows in a SQL table. Most SQL implementations will add metadata to your table like the date an entry was created or modified, and it's unlikely we'd want this metadata to affect our notion of uniqueness. In other words, if two rows were inserted into the table at different times but have the exact same values according to our application's business logic, then we want to treat them as having the same fingerprint so we filter out this extra metadata.
Third, we simplify our cleaned data into a stable, efficient representation that we can store and use for quick comparisons. Most of the time this step involves some sort of cryptographic hash to normalize the way we refer to our content in a unique, concise manner.
There are some details this explanation is glossing over, but that's the beauty of the NPM ecosystem – we don't have to understand all the bits & pieces to take advantage of their abstractions.
Let's hash this thing out
Up until now, we've glossed over the hashing aspect of things, so let's see what this looks like in code:
Note that there are lots of different ways you could define your
hash function. This example uses a very common SHA256 hash function and outputs a 64-character hex encoding of the results.
Here is an example output fingerprint:
Here is an alternative hash implementation that uses the Node.js crypto package directly:
Both of these hash implementations are equivalent for our purposes.
The most important thing to keep in mind here is that we want to use a cryptographic hash function to output a compact, unique fingerprint that changes if our input data changes and remains the same if our input data remains the same.
So where should I go from here?
Once you start thinking about how data can be uniquely defined by its content, the applications are really endless.
Here are a few use cases where I've personally found this approach useful:
- Generating unique identifiers for immutable deployments of serverless functions at Saasify. I know ZEIT uses a very similar approach to optimize their lambda deployments and package dependencies.
- Generating unique identifiers for videos based on the database schema we used to generate them at Automagical. If two videos have the same fingerprint, they should have the same content. One note here is that it's often useful to add a version number to your data before hashing since changes in our video renderer resulted in changes to the output videos.
- Caching Stripe plans and coupons that have the same parameters across different projects and accounts at Saasify.
- Caching client-side messages and HTTP metadata in a React webapp for Eko.
If you enjoy this stuff, I would recommend checking out:
- The power of content-based addressing - An awesome intro to the topic with a focus on content identifiers (CIDs) as they're used in IPFS.
- Multihashes - Self-describing hashes. 💪
- Merkle trees - A recursive data structure built on top of content-based hashes.
- Rabin fingerprinting - An efficient string searching algorithm that uses content-based hashing.
- IPFS - InterPlanetary File System.
- libp2p - Modular building blocks for decentralized applications.
- Saasify - An easier way for devs to earn passive income... Oh wait, that's my company and it's not really related to content-based addressing but cut me some slack haha 😂