Why I Have Settled on XChaCha20+Blake3 for AEAD

chasil · on Nov 29, 2021

As much as we see AES-GCM, this particular observation has concerned me:

"The GCM slide provides a list of pros and cons to using GCM, none of which seem like a terribly big deal, but misses out the single biggest, indeed killer failure of the whole mode, the fact that if you for some reason fail to increment the counter, you're sending what's effectively plaintext (it's recoverable with a simple XOR). It's an incredibly brittle mode, the equivalent of the historically frighteningly misuse-prone RC4, and one I won't touch with a barge pole because you're one single machine instruction away from a catastrophic failure of the whole cryptosystem, or one single IV reuse away from the same. This isn't just theoretical, it actually happened to Colin Percival, a very experienced crypto developer, in his backup program tarsnap. You can't even salvage just the authentication from it, that fails as well with a single IV reuse ("Authentication Failures in NIST version of GCM", Antoine Joux)."

https://www.metzdowd.com/pipermail/cryptography/2016-March/0...

Aachen · on Nov 29, 2021

> the single biggest, indeed killer failure of the whole mode, the fact that if you for some reason fail to [implement a core and simple feature of the mode], you're sending what's effectively plaintext (it's recoverable with a simple XOR). It's an incredibly brittle mode

I think most people would agree about that if you omit parts of cryptographic algorithms, it is not surprising that it's now broken.

Also who even implements GCM by hand? Who's supposed to forget to increment that counter?

> the equivalent of the historically frighteningly misuse-prone RC4

Libraries let you call RC4(data, key) and it would work as designed. It was up to you to realize that you shouldn't reuse the key. With GCM, incrementing the counter is done in the library and not exposed for you to forget.

Don't roll your own crypto is the only takeaway here.

formerly_proven · on Nov 29, 2021

> With GCM, incrementing the counter is done in the library and not exposed for you to forget.

GCM has two counters, the message-block-counter (4 bytes), and the message counter / nonce (12 bytes). The former is inside the crypto library, the latter is supplied by the client and can be mismanaged like any other IV.

flatiron · on Nov 29, 2021

isn't this what happened to the ps3? the documentation said "use a random number" and they calculated it once and put it as a constant instead of...yah know...using a random number (each time)?

feb · on Nov 29, 2021

In the PS3 they lacked a real random number in ECSDA computation. That effectively broke the signature and actually made it possible to recover the private key.

For more info see for example:

* https://www.cs.uaf.edu/2012/fall/cs441/students/sc_ps3.pdf

* A presentation at CCC about how hackers broke the different security features or PS3 including ECDSA https://media.ccc.de/v/27c3-4087-en-console_hacking_2010#t=2...

alexeldeib · on Nov 29, 2021

https://xkcd.com/221/

tptacek · on Nov 29, 2021

I think Guttman is referring to the GCM nonce when he talks about the "counter"; GCM has a famously short nonce, so the common advice is to use an incrementing counter as your nonce rather than risk colliding random nonces. In that context, "forgetting to increment the counter" is a huge footgun, not a "core and simple feature of the mode".

(You can't really forget to increment the underlying CTR counter; your system won't work.)

chasil · on Nov 29, 2021

I was actually thinking of a rowhammer attack on the counter.

Aachen · on Nov 29, 2021

That might be a valid thing to defend against by using specific algorithms that protect against it, but that's distinct from what you wrote about it being easy to forget to implement a crucial part of it. Everything in cryptography is easy to get wrong, that's why we use tested libraries. But if there is new advice on protecting against attacks like spectre, rowhammer, and others that exploit hardware problems, I'd be interested to learn more. If you have a blog or paper where someone explores the fragility of different algorithms in that regard, let me know!

1vuio0pswjnm7 · on Nov 30, 2021

That submission illustrates a common, bizzare thinking pattern seen with programmers. "Everything else sucks" so they adopt solution X. Then other programmers start to criticise solution X. "Monoculture", "fanboyism" or whatever the lingo. Meanwhile everyone forgets about why "everything else sucks". Other programmers will even try to counter the allegation that everything else sucks, as if it were not true.

"Everything else sucks" is why I "settle" for volunteer-supported, free, UNIX-like OS. It is why I "settle" for using djb's software (the stuff IETF would never approve of, not just the crypto work). It does not mean I think that these solutions are ideal in any objective sense. It means "everything else sucks". And when these solutions help me avoid the pain of having to use "everything else", then it stands to reason they will appear not only better than everything else but even high quality in an objective sense. (To use Gutmann's analogy, the oasis looks refreshing, regardless of the actual water quality.)

Another example are the discussions on the k language/intepreter on HN. Commenters would focus on crticisms of k instead of considering why other solutions cannot achieve the same performance. As one person put it, the question should not be "Why is this so concise and fast?" but "Why is everything else so bloated and slow?" (paraphrasing)

diamondlovesyou · on Nov 30, 2021

If this particular failure mode is so easy to crack, why is testing specifically for this failure mode not acceptable (ie via continuous integration testing)?

api · on Nov 29, 2021

SIV modes fix this.

tptacek · on Nov 29, 2021

SIV mode (mostly) fixes this by (usually) jettisoning most of the benefit of using GCM in the first place, which is that it is very fast when hardware-accelerated.

Compared with large random nonces, I think nonce-misuse-resistance is somewhat oversold.

api · on Nov 29, 2021

No it doesn't. SIV is almost as fast as AES-GCM when implemented with AES-CTR and GMAC (in my benchmarks on x64 and Apple M1). It requires two passes to encrypt but on the second pass the packet is in L1 cache already.

EDIT: nonce misuse resistance is IMHO a robustness thing. Yes large nonces give you similar bounds, but there are numerous banana peels like cloning VMs, live migration, IoT devices with bad PRNGs, incorrectly implemented concurrent counters, concurrent counters that work fine on x64 and fail on ARM, etc. that can cause nonce reuse.

Sudden death from nonce reuse just give me the willies. Kind of a big footgun.

tptacek · on Nov 29, 2021

I honestly don't care if you want to use straight SIV, AES-GCM-SIV, or some hand-rolled in-between SIV. SIV is fine. But it's not GCM.

tptacek · on Nov 29, 2021

To your additions to the comment above: the point of large nonces is that you always just use random nonces; you retain the RNG point of failure (which you almost certainly have anyways) but you don't have the "concurrent counters" problem anymore, since you're never going to use counters as a nonce (or, if you do, you do so in addition to a 128 bit random nonce).

A nice property of the XChaCha+Blake scheme proposed here is that Blake doesn't have any nonce dependency at all, and most of the terror of nonce misuse comes from how the MAC deals with repeated nonces, not how the underlying stream cipher does (though, obviously, a reliably repeating nonce gives you a pretty devastating confidentiality failure).

bawolff · on Nov 29, 2021

That may be true, but does it really matter? The name of the cryptosystem seems pretty irrelavent unless you have some compliance checkbox to check.

aidenn0 · on Nov 30, 2021

It makes the "SIV fixes this" a non-sequitur. "SIV fixes this" is like saying "not using GCM fixes this" which (for avoiding a problem in GCM) is somewhat uninteresting.

api · on Nov 30, 2021

Actually you're right. I was equivocating because there are SIV modes that use GMAC, the MAC component of GCM, for its performance benefits on hardware.

SAI_Peregrinus · on Nov 29, 2021

I don't like that the author seems to be confusing Blake3's MAC mode with HMAC. Just because it's a hash with a key that's safe to use as a MAC (and acts as a PRF) doesn't mean it's HMAC, there's no double hashing nor is there the use of the HMAC-specific padding constants. HMAC is a particular standard, not any hash-based MAC.

I agree that XChaCha20+Blake3 is a good construction.

It's also "easy" to make an XChaCha20-Blake3-SIV construction, by using the Blake3 MAC of the plaintext as the Nonce for XChaCha20 as well as the tag. That makes it a deterministic key-committing authenticated encryption with associated data. If you stick a 256-bit random nonce in the AAD, then you have a full Authenticated Hedged Encryption with Associated Data (AHEAD) system. The disadvantage is (as always with SIV-like constructions) that it's two-pass, so not suitable for streaming encryption, but still good for quite a few uses and a bit safer than non-deterministic AEADs.

thatonelutenist · on Nov 29, 2021

That's a valid point on wording, that I tried to toe carefully with the "effectively an HMAC", as this post was written for a more general audience and I didn't want to go too far into the details, I wrote it to have something to point to when the inevitable questions roll in.

Balancing the lies-to-children is always a hard job

stouset · on Nov 29, 2021

What's your perspective on reusing the same key between XChaCha20 and Blake3 invocations? While not necessarily a footgun in practice, reusing the same key between two contexts is generally avoided as a matter of good cryptographic hygiene.

You could either require a single larger key that's (effectively) a key for each cipher concatenated with one-another, or use one 256-bit key from which you derive independent keys via a pass through the hashing function. I personally prefer something closer to the former, since it requires less cryptography in general and there aren't many situations where you're sweating the size difference between 512-bit keys vs. 256-bit keys.

I suspect most systems implementing XChaCha20+Blake3 just reuse the key, which is probably fine... but it leaves a bad taste in my mouth and seems like just another potential risk that's not all that difficult to sidestep. And we're going to feel really dumb one day if it turns out there is a way to exploit that kind of key reuse after all.

thatonelutenist · on Nov 29, 2021

In the library I wrote this post for, the encryption and mac keys are separate, with both the keys being randomly generated, and then persisted to disk encrypted with an argon2 produced key (with some tertiary validation), which is what I generally recommend. The use of Blake3 as the mac _should_ avoid most of the problem with using a key that's derived from a password, but I very much like not leaving that door open.

triska · on Nov 29, 2021

RFC 8439 specifies an interesting method where the Poly1305 key is itself generated using the ChaCha20 block function:

https://datatracker.ietf.org/doc/html/rfc8439#section-2.6

tptacek · on Nov 29, 2021

Wouldn't you normally just spool separate encryption and authentication keys out of HKDF for this?

stouset · on Nov 29, 2021

Sure, that's what I was getting at with "a pass through the hashing function" though I could have been more specific. Again, I slightly prefer just using a single larger two-part key (or two separate keys) if reasonable, mostly out of a desire to require the minimum necessary cryptographic constructs.

But the point of my question wasn't necessarily to debate those two approaches as much as it was to politely bring up the detail of using independent keys for each cipher. I'm not sure I'd necessarily call it a footgun as much as I would call it proper hygiene.

tptacek · on Nov 29, 2021

I guess I'm just saying it would be super weird to see a modern cryptosystem share keys for any pair of constructions, because the standard pattern here is to start with some root secret (usually a DH agreement, maybe with a PSK mixed in) and then just HKDF out all the specific secrets needed to drive the rest of the system. I don't know of something that could really blow up if you used a Blake KMAC with the same key as ChaCha20, but if you saw that in a real design, you'd assume other things were wrong with it, right?

stouset · on Nov 29, 2021

I would, but it was unstated in the original article which is why I wanted to bring it up.

XChaCha20+Blake3 as an encrypt-then-mac AEAD is just simple enough that people might think to wire it up themselves.

SAI_Peregrinus · on Nov 30, 2021

HKDF requires HMAC, and you've got Blake3's KDF mode as well. Personally I'd get 2 keys using Blake3's KDF, then use one for Blake3's MAC and the other for XChaCha20.

tptacek · on Nov 30, 2021

I'm getting myself into a lot of trouble today randomly throwing around terms, like as if HKDF meant "any secure hash based KDF, like HKDF but you know with Blake3".

SAI_Peregrinus · on Nov 30, 2021

Yep. I know you know better, but since it's essentially the same carelessness I first complained about I figured I should be consistent and complain again!

One of the nice things about Blake3 is that it does have a KDF mode built in. And a MAC mode. And there's a (stalled) proposal to build in an AHEAD mode, though IMO it needs more academic security analysis (IIRC that's part of why it stalled). It might not be as interestingly innovative as the sponge construction Keccak introduced, but it's a very versatile primitive with excellent software performance.

tptacek · on Nov 30, 2021

No, you're 100% right to call it out! I went into this thinking that much of this discussion was pedantic and missed the point that a generic composition of a good stream cipher and a hash MAC has benefits --- and leave it mostly agreeing that the formalisms that people seemed hung up on are indeed pretty important.

SAI_Peregrinus · on Nov 30, 2021

The formalisms are important. I wish they weren't, and this stuff was all ready to go and easy to use correctly. But there's sadly no such commonly available system for that, everything has tradeoffs and many of them are subtle.

The SIV modes are great, they're much easier to use. When used in a full AHEAD construction (where you stick a random nonce in the AAD of a deterministic AEAD) you get nearly a "best of both worlds" non-deterministic encryption but without the catastrophic failure properties of something like GCM. But they're inherently 2-pass. So the user might have to deal with "chunking" their data, which can be annoying if they're streaming, etc. And since there are two passes over the plaintext with two different algorithms (one for the MAC to make the SIV, one to encrypt) that's two "traces" a side-channel observing attacker has which can make some attacks more powerful.

SAI_Peregrinus · on Nov 29, 2021

I'd just say it's 'effectively a MAC ("eg HMAC")'. Small but important distinction.

rdpintqogeogsaa · on Nov 29, 2021

Meanwhile, XChaCha20 is still stuck at the stage of being a draft at the IETF with no apparent movement for close to two years now[0]. I'd like to see it be standardized considering it's ubiquity that possibly dwarves that of pure ChaCha20.

[0] https://datatracker.ietf.org/doc/draft-irtf-cfrg-xchacha/

tptacek · on Nov 29, 2021

On the other hand: who cares? IETF standards only matter if we let them matter. No cryptography engineer believes that XChaCha20 is problematic for lack of an RFC.

cesarb · on Nov 29, 2021

That might be the case when you already know a lot about cryptography. But when you're a newbie, an IETF standard which hasn't been marked as obsolete is a safe choice, in this case ChaCha20+Poly1305 (RFC 8439). Besides, if you don't have a standard document, there's the risk that your XChaCha20 is not the same someone else's XChaCha20, leading to interoperability issues.

tptacek · on Nov 29, 2021

The IETF is not providing as much safety for implementors as your comment implies that it does. Cryptography engineering is dangerous, even for standardized crypto (the implementation track record for IETF-standardized cryptography is not good).

_wldu · on Nov 29, 2021

PGP has RFCs (2440 and 4880) but people still complain about it.

https://datatracker.ietf.org/doc/html/rfc4880

tptacek · on Nov 29, 2021

Because nothing in the PGP RFCs matters; all that matters is what the most widely installed version of GnuPG accepts. This is part of what I'm talking about when I'm derisive of IETF cryptography standards, which are, generally, a force for evil.

_wldu · on Nov 29, 2021

That makes me wonder if people are complaining about OpenPGP or instead about GnuPG? It probably doesn't really matter, but many newer encryption solutions ridicule PGP in general. Now I wonder if they mean to ridicule GnuPG instead?

tptacek · on Nov 29, 2021

It's all the same thing. People talk about the standard as if it's an important thing by itself, but it has literally no meaning outside the installed base. You can't rehabilitate OpenPGP by observing that the standard allows you to implement it safely, because the installed base is what controls.

rdpintqogeogsaa · on Nov 29, 2021

IETF standards matter to IETF protocols, which will then choose from what the IETF has already standardized because trying to push your own RFC through is hard enough as-is; you don't really want to get into a kerfluffle with the CFRG at the same time.

tptacek · on Nov 29, 2021

TLS and JWT are both "IETF protocols", but the IETF police won't arrest if you if you slot XChaCha20 into either of them.

This is actually how it's supposed to work: RFCs are meant, at least somewhat routinely, to trail implementations. "Loose consensus and working code" used to be the motto. Get XChaCha20 deployed, and then let the IETF standardize it because they have to. That's what happened with Curve25519.

R0b0t1 · on Nov 29, 2021

Ah, but lots of middle manager types do.

tptacek · on Nov 29, 2021

If middle managers are dictating your cryptography decisions, (1) you've got bigger problems and (2) you're using AES, not Chapoly.

some_furry · on Nov 29, 2021

Correct: Middle managers almost universally care about NIST recommended algorithms.

flatiron · on Nov 29, 2021

"is it in tls 1.3? ok good, ship it"

vlmutolo · on Nov 29, 2021

Not my blog. Author works on a backup/archival tool called "Asuran", which I assume motivates a lot of the research into AEAD.

I had never heard of the "partitioning oracle" attack, which is a really interesting way to exploit Poly1305 tags under certain conditions. I never would have assumed that it's practical to generate multiple ciphertexts that all authenticate under the same Poly1305 tag. That property leads to some interesting attacks.

Aachen · on Nov 29, 2021

You might also be interested in RSA blind signatures. They hardly ever come up so I was surprised to learn what seemed like an important property years after I thought I knew what RSA did.

some_furry · on Nov 29, 2021

This isn't an AEAD, it's just AE.

AEAD implies the support for additional Authenticated Data (AD); this is just Encrypt-them-MAC, which is a good construction for authenticated encryption, but it's not AEAD (by definition).

If you want an AEAD, I designed one based on ChaCha20 and BLAKE3 as a proof-of-concept a while ago: https://github.com/soatok/experimental-caead

The important details:

1. You need to split a single key into 2 distinct keys (1 for [X]ChaCha, 1 for BLAKE3).

2. You need a canonical way to feed the ciphertext and AAD into BLAKE3.

3. You need to ensure your MACs are compared in constant-time.

The GitHub repository I linked above does all this.

thatonelutenist · on Nov 29, 2021

The library that this was written for does in fact do all of the above, and I actually did look at your library for inspiration when I was doing background research. Big fan of your blog by the way.

Though given the blog post is devoid of the context, and its apparently blown up a little, this is a valid wording concern and I'll edit the title.

tptacek · on Nov 29, 2021

I think this is an interesting conversation, but as normative feedback it seems pretty pedantic.

The "AAD" in "AEAD" is message metadata; canonically, it's stuff like sequence numbers and metadata that appear in headers that need to be in cleartext to make transports work. Another example: Chapoly AAD is used to incorporate the transcript hash into every AEAD invocation of the WireGuard handshake.

It seems like the obvious way you'd get AAD into generically-composed ChaCha+Blake is simply to include it in the Blake KMAC along with the ciphertext. In other words: the obvious way. Is there a footgun here that doesn't have an equivalent in GCM? I don't see it.

The KDF point doesn't really have anything to do with AAD, right --- all you're saying here is that you need to KDF the ChaCha and Blake keys. I guess you're pointing out that GCM and Chapoly effectively do this for you in library implementations. Which, sure.

XMPPwocky · on Nov 30, 2021

"Is there a footgun here that doesn't have an equivalent in GCM?"

Maaaybe- if you're doing the composition yourself.

You have to differentiate between the AAD and the ciphertext somehow, and you can screw that up (e.g. not putting the split location in the AD, sending it unauthenticated instead), while if you're using somebody else's GCM implementation they hopefully handle this correctly.

If somebody tells you to "do AEAD with this block cipher and this generic MAC", it's tempting to just ... find an implementation of the block cipher, find an implementation of the MAC, and play legos- neglecting subtleties like the split location, proper key derivation, etc.

It's far less tempting to implement GCM by composing implementations of CTR and GMAC because ... where would you even find an implementation of GMAC that's not part of a GCM implementation anyways?

tptacek · on Nov 30, 2021

Yes, the AAD encoding is a good point, and you're right (and so is everyone else that pointed this out, including the above commenter) --- there is an actual footgun here with how the AAD is encoded.

thatonelutenist · on Nov 29, 2021

Its also feedback that doesn't really land, since the library I'm working on that this blog post was written in the context of actually does make extensive use of AAD, namely for verifying segment headers in files which must appear in plaintext for the protocol to work, but I'll allow it since this was posted to HN without context

vlmutolo · on Nov 30, 2021

Haha well I did try (unsuccessfully) to give a little context in a comment when I posted it, pointing people to Asuran. But that was immediately buried by far more interesting comments.

Next time it will be title text.

some_furry · on Nov 29, 2021

> The KDF point doesn't really have anything to do with AAD, right --- all you're saying here is that you need to KDF the ChaCha and Blake keys. I guess you're pointing out that GCM and Chapoly effectively do this for you in library implementations. Which, sure.

The motivation for the KDF point is just to ensure we're not using the same key for two different algorithms. This condition almost never leads to any practical security risks (unless you're, I dunno, mixing AES-CBC with CBC-MAC?), but it makes people who care a lot about security proofs happier if we avoid it.

tptacek · on Nov 29, 2021

Right, I certainly agree with you that if you're specifying a composition of ChaCha and Blake, you want to get these details right --- you need a KDF (most designs that use straight-up Chapoly already need one, because they're keyed from a DH) and, as you point out, a canonical encoding for the ciphertext and AAD.

I think that's a good argument against just composing constructions "on the fly" casually, but to me, there are some kind of clear benefits to replacing the poly mac in Chapoly with a hash KMAC; it's worth going through this exercise (maybe just once) and then considering using it instead of a "standard" AEAD.

By way of example: with the KDF and encoding --- which you can pull from x/crypto --- it seems like you could pretty easily make this proposed system fit Golang's AEAD interface.

speedgoose · on Nov 29, 2021

In my understanding, Blake3 is fast mostly because it does fewer rounds than blake2. The logic is that there is no need to be conservative in the number of rounds because the current public attacks work only on few rounds so we can just do a bit more rounds and be good.

I personally avoid Blake3 and stick with blake2.

thatonelutenist · on Nov 29, 2021

It's a bit more complicated than that.

Yes, the core of blake3 is a modified blake2b with fewer rounds, and yes the person who wrote the "Too much Crypto" paper is one of the blake authors, but there are other aspects of the construction that contribute as well.

Blake3 operates in a merkle tree mode using the modified blake2 as a node hashing function, which, in and of itself complicates attacks for reasons explained in the blake3 paper. Blake2 also didn't really have that many rounds to start with, Blake3 only reduces the round count from 10 to 7. A good _most_ of the speedup in Blake3 is due to the change in construction, the merkle tree mode allowing for unbounded parallelism, and not having the SIMD-friendly version be a variant. The parallel variants of Blake2 are already _almost_ as fast as Blake3 even without the round reduction (hell, BLAKE2sp can actually be faster than Blake3 in the right conditions).

CiPHPerCoder · on Nov 29, 2021

Worth noting that the 10 and 7 are double ChaCha rounds, which means that the strength of BLAKE3's bit diffusion is closer to ChaCha14 than ChaCha7.

Given that the best attack against ChaCha fails to break ChaCha8, it's reasonable to conclude that BLAKE3 is secure.

masklinn · on Nov 29, 2021

Merkle tree of what, file blocks?

thatonelutenist · on Nov 29, 2021

Its section 2.1 of the paper: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...

Though, note, blake3 still provides enhanced resistance against the attacks against blake2 even in the case where you only have one block, due to the change in how the fundamental hashing primitive is used.

oconnor663 · on Nov 29, 2021

Reducing the rounds from 10 to 7 results in a ~10/7 = ~1.4x speedup. But single-threaded BLAKE3 is 5-6x faster than BLAKE2b, and most of that comes from SIMD optimizations. Thatonelutenist (the author of the original post) points out in a sibling comment that BLAKE2sp closes a lot of that gap, and that's very true. BLAKE2sp and BLAKE2bp can take advantage of many of the same SIMD optimizations that BLAKE3 can. But not all :)

A fully optimized implementation of BLAKE2bp/sp is about 2x faster than BLAKE2b and more than 3x faster than BLAKE2s. That's an enormous speedup. Consider that SHA-3 has a reputation for being slow, but the software speed difference between SHA-512 and SHA3-256 is smaller than this. There are some complications when you get into short input performance, but any application that needs to hash files (where short input performance is dominated by the cost of opening the file) would almost certainly be better served by BLAKE2bp/sp than by BLAKE2b/s. In that sense, I think the biggest practical advantage of BLAKE3 is just that it doesn't ask users to navigate these tradeoffs. We discussed this a bit in the intro section of the BLAKE3 paper.

oconnor663 · on Nov 29, 2021

> Blake3 is fast, several times faster than ChaCha20 on my machine

The speed differences between BLAKE3 and ChaCha20 depend a lot on what benchmarks you run, and I want to elaborate a bit on the details, in case anyone running benchmarks sees something different and wonders why.

The whole BLAKE family is derived from ChaCha, and there are some close performance relationships between them. As CiPHPerCoder pointed out in another comment, what the BLAKE family calls "rounds" are what ChaCha calls "doublerounds", so we need to multiply by two to compare them. Since BLAKE3 does 7 rounds, its performance should be vaguely similar to ChaCha14, or a factor of ~20/14 faster than ChaCha20.

SIMD optimizations matter an enormous amount here. An implementation of BLAKE3 or ChaCha20 using recent AVX-512 SIMD instructions and running on a CPU that supports them, can be an order of magnitude faster than a generic implementation. This affects both functions equally, but it means that you need to be very carful when benchmarking, to make sure that the implementations you're testing have all been similarly optimized.

One wrinkle here is that, even if BLAKE3 and ChaCha20 are both implementated with AVX-512 optimizations, ChaCha20 will be able to take advantage of many of those optimizations at smaller input sizes. That's because each 64-byte ChaCha20 output block is independent of all the others, so full AVX-512 parallelism can kick in at 16*64 = 1024 bytes of output. (The number 16 comes from the fact that ChaCha20 and BLAKE3 use 32-bit words, and 16 of those words fit in a 512-bit vector.) In contrast, BLAKE3 on the input side needs to parallelize 1024-byte chunks/leaves, rather than individual blocks, so it needs 16*1024 = 16384 bytes of input to reach the same degree of parallelism. So if you benchmark an input size below 16 KiB, you might find that ChaCha20 is faster for that reason. (Note that BLAKE3's output side is more similar to ChaCha20 than it is to BLAKE3's input side. But in practice the official BLAKE3 implementations have some missing optimizations in the XOF.)

Another difference that will come up is multithreading. Both BLAKE3 and ChaCha20 can be multithreaded in theory, but multithreading ChaCha20 is almost never done. This is for a few reasons: 1) BLAKE3's multithreading capabilities have to do with its Merkle tree structure and are one of the things that make it special, so of course the official implementation wants to showcase that. In contrast, any block-based stream cipher like AES-CTR or ChaCha20 can be trivially multithreaded, and it's not very interesting to demonstrate it. 2) Hashing gigantic files is something we actually need to do sometimes, and multithreading can be beneficial in that use case, when we don't have a read bottleneck. In contrast, encrypting a gigantic file is pretty rare, and it tends to involve prohibitive memory requirements and/or an output bottleneck. Most encrypted ciphertexts tend to be pretty small, like TLS records, and it's rare to have a good use case for multithreaded ChaCha20.

So if you're benchmarking BLAKE3 (b3sum) against large input files, there's a good chance your benchmark is benefitting from multithreading. But if you try to encrypt the same file, and you manage not to have an output bottleneck, your ChaCha20 implementation almost certainly isn't multithreaded. So that's another thing we need to be careful about when we compare them.

ComputerGuru · on Nov 29, 2021

Tip: while you still can, edit your post to change * to \* to fix the formatting and make the multiplication symbol show.

oconnor663 · on Nov 29, 2021

Done, thank you.

tromp · on Nov 30, 2021

On a side note, what does the name BLAKE stand for? Are you just fans of the Blake's 7 sci-fi show?

oconnor663 · on Nov 30, 2021

The original BLAKE hash function was based on a previous design called LAKE: https://www.aumasson.jp/data/papers/AMP08.pdf (2008). That said, I don't know where the LAKE name came from.

upofadown · on Nov 30, 2021

>10. Even AES256 has a block size of 128 bits, which means you can start expecting collisions after 2^64 encryptions. This sounds like a lot, and the exact details of how this can cause a mode of operation to break down depend on the particular mode, but that’s a not an unachievable amount of data, the internet does that much every few months (assuming 1 block per encryption)...

Correct me if I am wrong, but doesn't the fact that the internet is not encrypting everything with the same key and nonce mean that you can't just add up all the data transferred for this? Or does this mean that in the future we might be encrypting 256 EB(10^18) sessions/files?

thatonelutenist · on Nov 30, 2021

Of course you can't, its just a visual aid. The point is to say that if a cipher _completely_ breaks down at (some imaginable amount of data), then its probably not behaving itself too well at (some much more reasonable, but still large, amount of data). AES-CTR already starts to questionable in some respects at 2^40 encryption with the same key (nonce isn't relevant, it changes with every block anyway), which is only 128TiB. Sure that's a lot for the average joe, but one could eaisly imagine someone wanting to encrypt that much data with a single key, just go checkout /r/datahoarder if you don't believe me.

And sure, good key rotation fixes that, but that's another foot-gun, and how should the average end user know if their application is using proper key rotation or not?

aidenn0 · on Nov 30, 2021

It's giving an idea of the order of magnitude. Kind of like comparing outputs of an industry to outputs of countries. It's also an argument that we are within a few orders of magnitude of it mattering today, so therefore tomorrow we could be in trouble.

nayuki · on Nov 29, 2021

> and how the [AES] block size is too small

> The block size [of ChaCha20] is 512 bits, compared to AES’s 128 bits, making a wide variety of attacks, such as birthday bound or PRP-PRF distinguishing attacks many orders of magnitude less practical

It's too bad the AES competition didn't accept Rijndael's option of 256-bit block size. I wonder why?

( https://en.wikipedia.org/wiki/Advanced_Encryption_Standard#c... )

some_furry · on Nov 29, 2021

IIRC, the AES competition specifically called for a 128-bit block size, citing the hardware requirements of that era.

(This is an invitation to invoke Cunningham's Law if I'm misremembering.)

tptacek · on Nov 29, 2021

The 16 byte block size was part of the brief for the AES competition; it was established before Rijndael was selected.

Rebelgecko · on Nov 29, 2021

Is OCB usage picking up now that it's no longer patent-encumbered? IIRC it has great performance compared to other AEAD methods

formerly_proven · on Nov 29, 2021

Yes, OCB is still around 50 % faster than the next best[1] AEAD (which is AES-GCM) and almost 3x faster than Chapoly on current x86 and comes pretty close to 10 GB/s/core on a desktop CPU. That being said, all of these are really, really fast in absolute terms, way beyond line-rate even for 10 GbE, which is a measly 1.2 GB/s. In practical terms, all of these modern algorithms are so fast (with hardware support for AES) that very few applications will see a significant burden from symmetric crypto, so you can choose pretty much whatever you feel comfortable with. EtM was usually quite a bit worse, as the hashes used for hMac were usually much slower in comparison. Though Blake3 would seem to eliminate that concern.

[1] in terms of performance, on a modern x86, with AES-NI

Monory · on Nov 30, 2021

More and more both personal computers and servers are using ARM CPUs without AES acceleration, though.

Also, write speeds of modern SSDs comes to be over 5 GBps.

formerly_proven · on Nov 30, 2021

> More and more both personal computers and servers are using ARM CPUs without AES acceleration, though.

The RPi SoC is an exception here - most ARMv8 SoCs should have AES instructions.

> Also, write speeds of modern SSDs comes to be over 5 GBps.

True, but I don't think the expectation is to utilize that throughput with a single application, rather, to still have fast I/O even under load or with multiple applications, and to satisfy burst I/O quickly.

CiPHPerCoder · on Nov 29, 2021

OCB isn't committing, so it doesn't solve the stated problem.

staticassertion · on Nov 30, 2021

So the major reason to not use this is... that no one does this. Right?

I wrote up an implementation of this and it was fairly trivial. The properties all make sense - the main change here is swapping poly1305 for blake3. But BLAKE3 promises (and we seem to trust this!) everything Poly1305 does and collision resistance.

searealist · on Nov 29, 2021

If you are using the reduced rounds of Blake3, seems like you might as well also go to XChaCha12 or XChaCha8?

loeg · on Nov 30, 2021

I.e., as one of the Blake3 authors calls for in Too Much Crypto (2019)[1] (8-round ChaCha in particular).

There is a 2021 update in the Conclusion section:

> Edit (May 24, 2021): 17 months after the original publication of this paper, its most concrete impact seems to be the adoption of ChaCha8 in a number of projects, notably thanks to its addition to the chacha20 crate of the RustCrypto project. We are not aware of new research results contradicting the paper’s claims, nor of results that would justify (from our point of view) a revision of the proposed reduced-round versions in §§5.3.

[1]: https://eprint.iacr.org/2019/1492.pdf (PDF)

cjg · on Nov 29, 2021

"8. Source: My Ass" :-)