That is well-written and a potentially good application for crypto-shredding, which has a lot of drawbacks in other contexts.
The issue with crypto-shredding in many applications is that it aggressively pessimizes the software architectures possible in support of the data model. For things like databases and analytics, that feature increases the required hardware footprint -- compute and storage -- by orders of magnitude. The loss of efficiency has little to do with the crypto per se, since hardware support makes that cheap, but with how the change in representation interacts with idiomatic data structures and algorithms for high-efficiency data systems.
Many system internals are essentially a bag of tricks for constructing the maximally compressive representation of the important relationships in a data model while preserving practical tractability of operations over the data model. Crypto-shredding architectures have the effect of "encrypting" before "compressing", rendering the latter moot.
Hard disagree. I need to forward this post to my CISO and have them be impressed with my breadth of reading and depth of knowledge. When we chat over brewskies later, I want to be able to casually ask what they thought about using HMAC-SHA512 instead of GHASH for making salamanders more opaque. But I can't do that because the number of cartoon furries will make them think I'm not totes profesh. Maybe we can get this post in a more professional setting like LinkedIn or something?
The art included in there is not of a bad quality, and they do link to the details of the artists including other art they had made which is also good.
But it does not really help with this article, either. They do not seem to be made specifically for this article; they show things which do not have much to do with it (this is not a criticism of the art or of the article, but of the way that the art is included in this article). It seems unnecessary to me to add such things into this article (it just takes up more space).
I think it has to do with the timing and history of the internet, which carried a bunch of correlated groups (not just those two) forward together.
Both are fairly niche groups with no real geographic concentration. When someone can't cast around for a local club in town, the internet becomes a game-changer for finding the far-flung and diffuse others in the group. Then once you've found those other people (and related furry/security media to consume/publish) computers are again the way to get anything done, as opposed to snail-mail, phone calls or physical travel.
It might have something to do with the furry community being queer majority from early on, and starting in a time where information security was a personal matter of life and death.
How early on is that? What I remember of furry culture from the early-mid 1990s internet was pretty straight, at least to judge from all the announcements of heterosexual marriages, all the porn going around, and prominent internet-furry voices. I was a bit surprised when furry culture saw a resurgence in wider internet culture during the Tumblr era and now was strongly linked to non-cishet themes.
Bi and similar are also queer. A lot of what appears "straight" is by and for people who are some variety of not monosexual. I've actually seen rants from people outside the fandom claiming it's "all straight men" using example art that I knew for a fact was made by someone who was neither straight nor a man.
Computer industries have correlated psychometric selection effects with a bunch of subcultures. Many of these select for e.g. what is colloquially referred to as "autism", although that term refers to several probably-unrelated things in different contexts.
I’m confused about the purpose of truncating the salt.
How does the attacker learn anything about the salt without breaking Argon2id first? It’s not like the salt is included in the Sigsum ledger.
If the attacker did somehow learn the salt (truncated HMAC), it seems like that would immediately break security, regardless of truncation. Suppose the attacker starts with a dictionary of all possible Actor IDs, which seems reasonable, and that they calculate the truncated HMAC for every possibility until it matches. Due to the truncation, the attacker may get one or more false username matches in addition to the true username match. But if the attacker has narrowed down the username to two or three possibilities, they’ve already basically won. And they could just check the possibilities against Argon2id to learn which one it is.
On the other hand, if the attacker doesn’t know anything about the salt, then why would it matter if the salt did uniquely identify a user? How would they “observe that the recent-merkle-root is the same for two messages, but their salts differ”?
It’s quite possible that I’m misunderstanding something. I am not a cryptography expert.
But more broadly, it just seems wrong to use a salted password hash but have the salt be based on the same password that’s being hashed. I don’t think it’s _insecure_, because the salt also includes a high-entropy part (the recent-merkle-root). But I don’t see how it increases security at all. The security should be the same as if the salt only included the high-entropy part.
In other words, the entire “HMAC -> truncate -> Argon2id” process can be seen as a password hash in its own right, where the input consists of the secret username, plus various bits of public information such as the recent-merkle-root. But that’s essentially a form of rolling your own crypto, and it’s unnecessary. Argon2id is already a secure password hash by itself, so you can just feed that data directly to it.
Also, the way OP picked the length of the truncated salt was troubling to me: "Too small, and we improve the economics of precomputation. Too large, and we risk creating a reliably crib for distinct plaintext values" -- then went on to do a tricky computation to find the balance. The inputs of this computation seem to be some assumptions about the number of users on the network and the computing power available to adversaries -- both of which could be wrong, and/or conceivably change by orders of magnitude over a few years.
Past tense, because OP has changed the post (perhaps in response to your comment) and now says "In my earlier design, we needed to truncate the salt and rely on understanding the birthday bound to reason about its security. This is no longer the case, since each salt is randomized by the same random value used in key derivation."
I haven't read the changes in enough detail to say anything about the new design.
> I haven't read the changes in enough detail to say anything about the new design.
Sorry about that, but the entire reason I decided to write these blog posts was to get feedback about different aspects of my proposal before I tag a release. In one bite, it's too easy to get distracted and miss details.
The commitment (Argon2 salt || Argon2 output) is included in the ciphertext blob.
The format of each encrypted attribute is:
h || r || Q || t || c
Where "h" is the version ID (0x01 currently), "r" is a 256-bit random value for key derivation, "Q" is the plaintext commitment, and "t" is the authentication tag. All of these values are fixed-length. Then "c" is the ciphertext, which is variable length.
> If the attacker did somehow learn the salt (truncated HMAC), it seems like that would immediately break security, regardless of truncation.
> Due to the truncation, the attacker may get one or more false username matches in addition to the true username match. But if the attacker has narrowed down the username to two or three possibilities, they’ve already basically won. And they could just check the possibilities against Argon2id to learn which one it is.
That's an interesting point. It might be beneficial to exclude the plaintext, then, given that we already have a high-entropy value (the previous Merkle root).
We could also just use "r" to calculate the salt, since that's right there.
Thank you for this feedback. My primary motivation in writing this blog post (and the ones that follow it which will focus on different design decisions and trade-offs) was to elicit better feedback for the current draft of my proposal.
Not sure I'm a big fan of focusing on concrete cryptographic primitives; I suspect all you need is a commitment scheme and change the leaves to be commitments to the data (instead of the data itself), but I found it quite difficult to ascertain whether this would meet the requirements
> Not sure I'm a big fan of focusing on concrete cryptographic primitives
It's a versioned protocol.
If you use version 1 of the protocol (which is still being specified), you get a hard-coded suite of cryptographic primitives. There is no flexibility in this list.
If we want to offer an upgrade, we'll specify a new version that has a different suite of cryptographic primitives, with a distinct version identifier.
The concrete primitives are just the ones I'm selecting for v1.
That's fine, I just (subjectively) think that it's difficult to reason about security properties when you have a specific primitive (which satisfies a lot of potentially irrelevant properties)
Somehow, when I accessed it first I got a HTML file, but then I tried again and got JSON instead. Why is that?
> Counter Mode (CTR) is a secure way to turn a block cipher into a stream cipher.
Note that block ciphers can be reversible. CTR mode does not require that the block cipher is reversible. However, it can be made non-reversible, which is how ChaCha20 works; it is a block cipher in CTR mode, but with an extra step to make it difficult to reverse.
> Somehow, when I accessed it first I got a HTML file, but then I tried again and got JSON instead. Why is that?
There's something weird going on with the WordPress.com <-> ActivityPub integration. If you access it, especially from a Fediverse server, sometimes it tries to respond with JSON.
Unfortunately, I haven't been able to find the root cause to propose a fix.
> Note that block ciphers can be reversible. CTR mode does not require that the block cipher is reversible. However, it can be made non-reversible, which is how ChaCha20 works; it is a block cipher in CTR mode, but with an extra step to make it difficult to reverse.
100% correct. This is just a terse informal list of assumptions I'm making about the cryptographic primitives. It might seem silly, but I always include my assumptions with my security goals or threat model documents just so nobody has to guess what they are many years later if one of them is proven false.
> This is just a terse informal list of assumptions I'm making about the cryptographic primitives. It might seem silly, but I always include my assumptions with my security goals or threat model documents just so nobody has to guess what they are many years later if one of them is proven false.
For all that the author doesn’t like exotic crypto, this seems like the sort of problem that the sort of fancy cryptography that the Ethereum folks love would actually be appropriate for. The goals are, approximately:
1. There exists a log, and the log follows certain rules, and everyone agrees on a summary of the log from which log entries can be validated and from which the fact that the log follows the rules can be validated.
2. One can validate a record without retrieving the whole log.
3. Appends can be done, and new summaries generated, without access to some of the entries.
Cryptocurrency, as far as I know, does not care about #3, whereas this project does. But maybe similar sorts of exotic crypto could generate a stronger form of shredding: a scheme where deleted PII only persists in the form of a constant or logarithmic size ever changing summary. It’s a bit hard to argue that, say, 10kb of data contains PII from a million users.
> For all that the author doesn’t like exotic crypto
Hah, I think that's a little imprecise.
I actually love exotic crypto; it makes my life interesting, after all. But I will not rush to deploy it into production, and I'm not going to incorporate it into a project designed with such a humble purpose like making DMs encrypted on the Fediverse. The rules I laid out in this post were with that context in mind.
Let's think about the problem qualitatively, without worrying about specific protocol details, KDF's, AES modes, or bit lengths.
Let us say each DB record is a JSON string containing a name and a base58-encoded public key, for example record[57] = '{"name" : "bob", "pubkey" : "1A1zP1e"}'.
If you have some hash function H = sha256 let us say, you could calculate h = H('{"name" : "bob", "pubkey" : "1A1zP1e"}')[:N] and then set record[57] = h (I'll use a 24-bit h = 71b686 for my examples, OP suggests a 48-bit).
Then when Bob asks for his data to be redacted (say, via GDPR-nuke), you can forget 71b686=H('{"name" : "bob", "pubkey" : "1A1zP1e"}'), leaving the DB with just record[57] = 71b686 but this number is meaningless. (You can be extra sure it's meaningless by encouraging / requiring clients to put a high-entropy throwaway "salt" field inside the JSON.) Note, you can't just del record[57] because that breaks the data store's append-only property.
You want to stream updates to your DB to (untrusted, decentralized) clients to host their own replicas (so they can independently verify you're answering your queries honestly). Those client replicas need to handle queries of the form SELECT i WHERE record[i]["name"] == "bob", or at least verify that a purported response to SELECT i WHERE record[i]["name"] == "bob" is correct. This seems...tricky, if you can't give the clients the "name" field because when you can't trust the client replicas to "forget" it when they're "supposed" to.
Unfortunately, OP is clouded with low-level details to the point that I'm honestly not sure if OP has genuinely discovered a clever way to solve the tricky part, or if OP's claimed solution doesn't actually work.
For that matter, I'm not even convinced the problem OP's trying to solve is possible. How can the replicas verify a query result without having access to the redactable data? I don't understand it at all. OP if you're reading, clarification would be most welcome :)
> If you have some hash function H = sha256 let us say, you could calculate h = H('{"name" : "bob", "pubkey" : "1A1zP1e"}')[:N] and then set record[57] = h (I'll use a 24-bit h = 71b686 for my examples, OP suggests a 48-bit).
You're responding to an earlier revision of the blog post, and the design has been updated to not truncate salts this way (nor store them at all), based on feedback on the design.
> This seems...tricky, if you can't give the clients the "name" field because when you can't trust the client replicas to "forget" it when they're "supposed" to.
The ciphertext is the aspect that's committed to the Merkle tree. The plaintext and the key are not committed to the Merkle tree, and can be freely deleted. That's the important bit.
As for your other concern, the replication aspect is specified clearly.
If a non-conformant replica persists copies of the plaintext indefinitely, then you can now use the same legal process to request erasure from the replica. If they do not comply, then you have the same amount of legal recourse you had to begin with. This becomes a lawyer problem, not a technology problem.
> How can the replicas verify a query result without having access to the redactable data?
The thing you can continue to verify, after a record has been forgotten, is all of the other records in the ledger.
Before erasure, that includes the record in question. After erasure, you shouldn't have anything to verify, except that history hasn't been altered beyond the decryption key being zeroed out.
If you assert your right to be forgotten, the system should forget. You can no longer be verified, because as far as the system is concerned, the record in scope does not exist.
But either way, history remains intact. You don't break the Merkle proofs.
You can erase records by zeroing out the key, but you cannot alter them in any other way.
The plaintext commitment prevents a server that's committed to a ciphertext, Enc(k1, Alice), from responding with a plaintext for Bob.
The commitment is a slow hash that's included in the calculation of the authentication tag over the ciphertext (which is what's committed to the Merkle tree). The tree survives.
If this still doesn't make sense, check the docs on GitHub.
> OP if you're reading, clarification would be most welcome :)
Here's a short and oversimplified workflow to help illustrate it:
Alice -> Directory:
[Encrypted Data], [Keys]
Directory -> SigSum:
[Encrypted Data]
Bob -> Directory:
[Alice]?
Directory -> Bob:
[Encrypted Data], [Plaintext], [Merkle Proofs over Encrypted Data]
Is that easier to follow?
If Alice later requests her keys deleted, then you end up with:
Bob -> Directory:
[Alice]?
Directory -> Bob:
[Encrypted Data], nil, [Merkle Proofs over Encrypted Data]
Honestly, I don't get it. Maybe I don't know enough about the fediverse?
Doesn't the fediverse already have a mechanism where each user can set a profile, which everyone else has a common view of and other people can't change? Couldn't a parallel mechanism distribute public keys in a similar way?
What's the issue here - that the server where you have your account can replace your keys?
I know SSL's "certificate transparency" tries to detect replaced and mis-issued certificates after the fact, by putting them all into an immutable log that can be searched by domain name. It doesn't stop the attack, but it at least makes it detectable after the fact. Yet if the entries in the log can be deleted, so there's no longer a log that would prove an attack had happened, wouldn't that prevent the log fulfilling its sole purpose?
> Doesn't the fediverse already have a mechanism where each user can set a profile, which everyone else has a common view of and other people can't change?
Your instance admin can change it. The goal of encrypting DMs is to protect against honest-but-curious instance admins, especially since often multiple have access to the stored DM contents on disk.
> Couldn't a parallel mechanism distribute public keys in a similar way?
How do you validate that the public key you're seeing is your friend's, and not a snooping administrator performing a MitM attack by substituting the public key with theirs?
> What's the issue here - that the server where you have your account can replace your keys?
Yes, partly.
> I know SSL's "certificate transparency" tries to detect replaced and mis-issued certificates after the fact, by putting them all into an immutable log that can be searched by domain name. But if the entries in the log can be deleted, so there's no longer a log that would prove the attack had happened, wouldn't that prevent the log fulfilling its sole purpose?
The goal is to ensure everyone has the same view of which actor has which currently-trusted public keys. Additionally, the deletion of past log entries requires a legal takedown request. What this process actually looks like is out of scope for my specification, but I imagine it will usually involve lawyers.
> The goal is to ensure everyone has the same view of which actor has which currently-trusted public keys.
But isn't the profile/bio already a mechanism that gives everyone the same view of some per-account details, set by the person who controls the account?
Alas I'm not smart enough to understand your threat model, as I'm just a simple small town software developer who don't even know what a Edwards25519 curve or a SUF-CMA is.
If an admin or hacker tampers with a user's bio - i.e. gains control of a user's account - isn't that essentially indistinguishable from a user who lost their phone/forgot their password?
Is the intention that E2EE have a separate, parallel account recovery mechanism with more difficult requirements?
A single Public Key Directory will serve many hosts. (You can further add replication and cross-directory checkpointing between directories, to ensure the whole network is transparent, but that's not our concern right now.)
If a host decides to misbehave, there's an independently hosted record of every change to the state machine that maps Actor IDs to public keys. This state machine is backed by a Merkle tree, which includes a proof of inclusion for every change to its state, thereby making it an append-only ledger.
In my design, you don't have to trust the hosts' admins to behave, because the directory keeps them honest.
You also don't have to trust as single directory to behave, since its primary purpose is to be transparent and independently auditable.
If a user wants to revoke their key, they can, but it's an additional protocol message appended to the directory's transparency log.
If a user loses all their secret keys, one of two things can happen:
1. Their instance admin issues a BurnDown, which allows the user to enroll a new public key as if they were a new user.
2. If the user has issued a Fireproof message, the instance admin cannot issue a BurnDown, and the user is locked out of the protocol. This is a trade-off.
Users can move hosts. Users can publish auxiliary data (e.g., SSH public keys), if the extension for that key type is defined and enabled by the directory. Even an attacker with root access to the host cannot arbitrarily or selectively replace a user's public key with their own. With your suggestion, this substitution is trivial.
And that's basically how the system works. But that's not the focus of this blog post.
This blog post is how to solve the seemingly incompatible design goals of:
1. Offer userbase consistency verification for users' public keys.
2. Don't cryptographically inhibit an EU citizen's rights under Article 17 of the GDPR.
3. Don't introduce backdoors to the transparency log that allow them to perform the same trivial attack that your suggestion enables.
Totally unrelated to the thread, just a comment on something I noticed: You seem to be posting from 2 different accounts: soatok and some_furry. Dunno if that’s on purpose, but in the off chance that it is not: Now you know.
Love you blog, it’s nice to read something written by an actual human being nowadays. I keep several of you articles bookmarked for reference.
> You seem to be posting from 2 different accounts: soatok and some_furry. Dunno if that’s on purpose, but in the off chance that it is not: Now you know.
When folks have interesting comments that warrant a response (or answer, if it contains a question), I very quickly stumble over HN's rate limit. Using this second account works around it.
As a rule, I only use this fallback account when my main (some_furry) is unable to respond to someone's questions.
If a user has to call into your keyserver to get a key before they can start a conversation with a new friend, as you're the sole authority who can decrypt the Merkle tree entries - does that introduce any problems?
And how will you authenticate shredding requests? Does that just happen out-of-band?
> If a user has to call into your keyserver to get a key before they can start a conversation with a new friend, as you're the sole authority who can decrypt the Merkle tree entries - does that introduce any problems?
It would, but they're addressed by the total design.
> And how will you authenticate shredding requests? Does that just happen out-of-band?
Essentially, yes, it's out-of-band. The actual shredding isn't part of the protocol.
The way I see it is, this only matters when the requestor's lawyers issue a takedown for their client's Personal Data (previously referred to erroneously as PII, though the distinction between the two jargony terms wasn't something I ever needed to care about).
If I didn't take the steps outlined in this blog post, the director's operator would be in a precarious legal situation.
But with this specification, the operator just queries their database for the in-scope records and deletes the stored key.
How that's actually implemented in software, and how the operator verifies that the legal notice for the takedown is authentic, aren't problems I have a readily available solution for. There may not even be a one-size-fits-all solution here.
As I've said, my goal isn't "GDPR Compliance". That's not a property I'm advertising. My goal is to create Key Transparency and a PKI without Authorities for the Fediverse.
I simply don't want to make it logistically impossible for someone else to deploy this in the EU.
The issue with crypto-shredding in many applications is that it aggressively pessimizes the software architectures possible in support of the data model. For things like databases and analytics, that feature increases the required hardware footprint -- compute and storage -- by orders of magnitude. The loss of efficiency has little to do with the crypto per se, since hardware support makes that cheap, but with how the change in representation interacts with idiomatic data structures and algorithms for high-efficiency data systems.
Many system internals are essentially a bag of tricks for constructing the maximally compressive representation of the important relationships in a data model while preserving practical tractability of operations over the data model. Crypto-shredding architectures have the effect of "encrypting" before "compressing", rendering the latter moot.
reply