Assuming wireguard hashes data shorter than 4k (i.e. most network packets), ther...

loeg · on Jan 9, 2020

That isn't literally true; the reduced rounds make it faster on small inputs, too. And jumbo packets can be 4kB or 9000B or whatever, if wireguard is used on such an interface.

aidenn0 · on Jan 10, 2020

Does BLAKE3 reduce rounds vs BLAKE2s?

loup-vaillant · on Jan 10, 2020

7 rounds instead of 10.

Though for Wireguard, you'd compete with Blake2b as well, which has the advantage of using 64-bit words. And if you want a fair comparison, you should reduce the rounds of Blake2b down to 8 (instead of 12), as recommended in Aumasson's "Too Much Crypto".

On a 64-bit machine, such a reduced Blake2b would be much faster than Blake3 on inputs greater than 128 bytes and smaller than 4Kib.

loeg · on Jan 10, 2020

They address this in the paper, to some extent. With SIMD, you get 128, 256, or 512 bits of vector. You can either store 32x4, 32x8, 32x16, or 64x2, 64x4, 64x8 words. But either way you're processing N bits in parallel.

The concern about 64-bit machines and using 64-bit word sizes vs 32-bit word sizes really only matters if your 64-bit machine doesn't have SIMD vector extensions. (All amd64 hardware, for example, has at least SSE2.) And as they point out, being 32-bit native really helps on low-end 32-bit machines without SIMD intrinsics.

(Re: the hypothetical, if wireguard were to do a protocol revision and replace Blake2B with this, it would make sense to also replace Chacha20 with Chacha8 or 12 at the same time. I doubt the WG authors will do any such thing any time soon.)

loup-vaillant · on Jan 10, 2020

I was talking about small-ish inputs, for which vectorisation wouldn't help.

loeg · on Jan 10, 2020

Yes. It is addressed in TFA, as well as the current top-voted comment on the article.