It looks like the speedup is coming from two main changes.
The first change is reducing the number of rounds from 10 to 7. Think of it like making a smoothie - you add bits of fruit to the drink (the input data), then pulse the blades to blend it up (making the output hash). This change basically runs the blades for 7 seconds instead of 10 seconds each time they add fruit. They cite evidence that the extra 3 seconds aren't doing much - once the fruit's fully liquid, extra blending doesn't help - but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.
The other change they make is to break the input into 1KiB chunks, then hash each chunk independently. Finally, they combine the individual chunk hashes into a single big hash using a binary tree. The benefit is that if you have 4KiB of data, you can use 4-way SIMD instructions to process all four chunks simultaneously. The more data you have, the more parallelism you can unlock, unlike traditional hash functions that process everything sequentially. On the flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit CPU's that struggle with 64-bit arithmetic. The tree structure is a big win overall.
> but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.
This was covered in more detail in previous "Too Much Crypto" paper [1], which argued that many standards have excessively high round counts. Note that Aumasson is author of both Blake3 and Too Much Crypto
> Our goal is to propose numbers of rounds for which we have strong confidence that the algorithm will never be wounded
They take algorithms, past 10 years of public crypto research and shave off rounds, until it just about starts falling apart. AFAIU having security-reducing attacks is the target.
I prefer to have ample confidence in my crypto algorithms. Would not recommend BLAKE3 (without those extra rounds).
Thinking back to Schneier’s running commentary on SHA3, using hierarchical hashes was part of a general attempt to increase the internal state space of the hashes. SHA1 exposes all of the bits. So once you get two prefixes with the same hash, adding a common suffix results in the same output hash.
With hashes of hashes, the prefixes have to be the same length, and possibly very short.
$ time dd if=./1GB of=/dev/null bs=65536
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.10451 s, 10.3 GB/s
real 0m0.107s
sha256sum uses 128KiB blocks when reading, which may be more optimized than the openssl program.
sha256sum supports using openssl libs as they are generally faster. This is enabled on Red Hat flavored distros and arch, but not debian flavored as yet
The upcoming release of sha256sum (8.32) will auto enable use of openssl for >= v 3, as openssl's licence has changed to apache in that version
I remember watching Bao, a general purpose cryptographic tree hash, and perhaps the fastest hash function in the world: https://www.youtube.com/watch?v=Dya9c2DXMqQ a while ago.
Excuse my confusion. I understand "the Bao project evolved into BLAKE3", but "is now based on BLAKE3" confuses me. Bao is based on blake3? But isn't bao ... the blake3 itself now? Circular dependency detected.
Ha, yes, that's confusing. The Bao project was originally two things: 1) a custom tree hash mode, and 2) an encoding format and verified streaming implementation based on that tree hash. The first half evolved into BLAKE3. Now the Bao project itself is just the second half.
The README lists 4 designers, including yourself. However the Bao project doesn't list anybody, so presumably you are the only designer. What exactly were the contributions of the other 3 people to warrant being listed?
At what point did the Bao project become "BLAKE3" and why?
Which, yeah, that alone will get you a significant improvement over Blake2B. But definitely doesn't account for the huge improvement they're showing. Most of that is the ability to take advantage of AVX512 parallelism, I think. The difference will be more incremental on AVX2-only amd64 or other platforms, I think.
[1]: Well, TMC recommended 8 rounds for Blake2B and 7 for Blake2S.
> ah, the catch. blake3 targets 128 bit security. It competes with SipHash for speed and security.
No no, BLAKE3 is a general-purpose cryptographic hash just like BLAKE2, SHA-2, and SHA-3. The confusion here is that a hash function's security level is half of its output size, because of the birthday problem. BLAKE3, like BLAKE2s and SHA-256, has a 256-bit output and a 128-bit security level. (BLAKE3 also supports extendable output, but that doesn't affect the security level.)
A hash can have different security levels against different attacks. BLAKE3 appears to have 128 bits of security against all attacks.
SHA3-256 was originally designed to have 128 bits of collision security and 256 bits of preimage security. NIST then made a change to it giving it 128 bits of security against all attacks. A lot of people got mad. Then NIST caved and changed it back to 128 bits of collision security and 256 bits of preimage security.
It looks like BLAKE3 agrees with how NIST wanted SHA3 to be. I wonder if people will be mad at BLAKE3.
For a more fair performance comparison against SHA3, you should compare against SHAKE128(256). That is, the version with 128 bits of security all around and a 256 bit output (how NIST wanted it). Although maybe it's pointless, because according to Wikipedia SHAKE128(256) is only 8% faster than SHA3-256 for large inputs.
> BLAKE3 appears to have 128 bits of security against all attacks.
That's not accurate. The best pseudo-preimage attack on BLAKE2s has complexity 2^{253.8} against 6.75 rounds (section 3.2 of https://eprint.iacr.org/2019/1492.pdf ). The best full-preimage attack on BLAKE2s is against 2.75 rounds. BLAKE3's round function is identical to BLAKE2s (although used in a different mode). Currently there isn't any known classical preimage attack on BLAKE3 better than these ones against reduced BLAKE2s. This should be interpreted with caution since the design has only just been published.
[Disclosure of interest: I know Zooko and work for Electric Coin Company. This is only based on a cursory review of the paper, though; I had not seen it prior to publication.]
>Although maybe it's pointless, because according to Wikipedia SHAKE128(256) is only 8% faster than SHA3-256 for large inputs.
This is mainly due to SHA3's humongous 1600-bit state, which is not very friendly to embedded systems. In sponge constructions with smaller states, or generally primitives with smaller states, the difference is much larger.
Also in general I would say that small message performance is usually more important than large message performance, since large messages with desktop/laptop CPUs are so incredibly fast anyway with most hash functions that the bottleneck goes somewhere else. (Storage, network, etc.)
They're probably right to be upset by that change. Collisions require a lot more resources than simply trying every input on a GPU/ASIC, and as https://news.ycombinator.com/item?id=22007810 points out, an attacker that can only cover some fraction of the key space is massively more disadvantaged when they're performing a collision attack. You need more raw bits of preimage resistance to make the attacks equally hard.
I don't know if you need twice as many bits of preimage resistance, but I'd feel a lot more comfortable with an extra 32.
> Collisions require a lot more resources than simply trying every input on a GPU/ASIC
Would a 128 bit difficulty preimage attack against SHAKE128(256) be as simple as trying every single input? Trying every single input would be a 256 bit difficulty attack I would assume. To get it to a 128 bit attack I would think the attack would need something more advanced.
Yeah, I should have stuck with my initial wording about having major space requirements and not just time requirements.
You do need a more complex calculation to try to mount a preimage attack. But from what I can figure out it still tends to use big blocks of standalone arithmetic, something that's still very easy to accelerate with limited I/O.
> BLAKE3, like BLAKE2s .. has .. a 128-bit security level.
Skimming and interpreting the Too Much Crypto paper[1], the security target is strictly less than 128-bit security. If it maintained 128-bit security, it would be considered too many rounds.
Not really. 128 bits for collision (using a 256 bit hash) is not the same as 128 bits for key recovery or preimage. It is much, much stronger.
Preimage and key recovery attacks have a linear drop off: halving the number of tries halves your chances of success. Collisions have a quadratic drop off: halving your number of tries divides your chances of success by four.
Moreover, it is easier to find one key if you have many keys to try (finding 1 key among N is N times easier than finding any specific key). No such considerations for collisions: finding any collision is just as hard as finding one collision among many.
512-bit hashes were never needed for their security levels. Though in some circumstances (deriving multiple keys, EdDSA), bigger digests come in handy.
That's an interesting point I hadn't thought of before. If someone says "I want < 1% chance of an attack" then 128 bits of collision resistance will be stronger than 128 bits of preimage resistance.
On the other hand if someone says "I want < 99% chance of an attack" then 128 bits of preimage resistance will be more secure. But people probably don't have this goal, so 128 bits of collision resistance is better overall.
But another aspect is once you start finding collisions, you're likely to find more faster (whereas with preimages they keep coming linearly). This would actually be a worse point for collision resistance.
> (finding 1 key among N is N times easier than finding any specific key)
Yes, but only up to a point. You have to compare the cost of computing a hash vs comparing the hash. If the time to hash is 10000x the time to compare, then no matter how large N is, you'll never get faster than a 10000x speedup. This is essentially Amdahl's law. But this speedup is still quite significant.
I am also curious about how it performs as a PRF in places where e.g. Chacha20 is used as a keystream generator now. Also as a reduced round variant in places where non-cryptographic PRNGs are used for very fast RNG needs: JSF, SFC, Lehmer, Splitmix, PCG.
In my extremely limited testing (on AVX2, but not AVX512 hardware), (buffered) reduced (four) round Chacha is only about 1.5-2x slower than fast non-cryptographic PRNGs like JSF, SFC, Lehmer, or pcg64_fast (all with Clang -O2 -flto, the fast PRNGs are header-only implementations and only chacha is two files).
This thing still uses 7 rounds, but that is easy to tune down. Very neat.
I can't seem to find any non-rust implementations in the works yet, so I may sit down and adapt the reference to C# this weekend. Anyone know how the single/few threads performance holds up excluding avx512?
It's of course much faster as most of the other crypto hashes, but not faster than the hardware variants of SHA1-NI and SHA256-NI.
About 4x faster than blake2.
Faster than SipHash, not faster than SipHash13.
The tests fail on MomentChi2 dramatically, which describe how good the user-provided random seed is mixed in. I tried by mixing a seed for IV[0], as with all other hardened crypto hashes, and for all 8 IV's, which didn't help. So I'm not convinced that a seeded IV is properly mixed in. Which is outside the usage pattern of a crypto or digest hash (b3sum), but inside a normal usage.
Rust staticlib is still in work, which would parallize the hashing in chunks for big keys. For small keys it should be even a bit slower. b3sum is so much faster, because it uses many more tricks, such as mmap.
Do you have single-thread cpb benchmark figures on amd64 hardware without AVX512?
Clearly the benefits of AVX512 really exaggerate the comparison on hardware that supports it, and the benefit over Blake2S is pretty muted on hardware without vector intrinsics (low end 32-bit ARM). But I'm interested in the middle — e.g., Zen1/2 AMD, Broadwell and earlier Intel x86-64.
I'm also curious for benchmarks on amd64 without AVX512, as that's the majority of existing x86 metal and the entire AMD product line, even the new Zen2 stuff.
Why are there no AES-like hashing algorithms out there? AES design is very suitable to be used as a building block in a hash if you remove "add round key" operation.
I helped design Meow Hash using AES-NI. It is not general purpose crypto strength, but ridiculously fast, targeting a theoretical performance of 16 bytes per cycle on some processors, too fast for memory to keep up. https://github.com/cmuratori/meow_hash
Regarding the speed, meow_hash is certainly performant, but when hashing a single file, b3sum actually seems to be faster, perhaps due to using more threads. Could meow_hash be made even faster by running more operations in parallel?
time ~/.cargo/bin/b3sum ubuntu-19.10-beta-desktop-amd64.iso
0899b731b6b57d75a65273fe29d54802653b3bbe3bae6732140c487c4f0ece71 ubuntu-19.10-beta-desktop-amd64.iso
real 0m0,180s
user 0m1,735s
sys 0m0,092s
time meow_example ubuntu-19.10-beta-desktop-amd64.iso
meow_example 0.5/calico - basic usage example of the Meow hash
(C) Copyright 2018-2019 by Molly Rocket, Inc. (https://mollyrocket.com)
See https://mollyrocket.com/meowhash for details.
Hash of "ubuntu-19.10-beta-desktop-amd64.iso":
E5CC524A-522BDE7E-ED34A277-5C7D73AF
This made me curious. Is it because at this stage it is a proposal that has not yet been verified/analysed or are there actual reasons that you know of that make this not "general purpose strong"?
I don't actually have proof that it isn't crypto strength. But comparing it to other algorithms that have been broken, it seems unlikely that it would hold given the rather modest amount of computation done.
I do believe that it meets the requirements for being a MAC function, and I'm completely certain that it is a great non-cryptographic hash function.
Is it possible to benchmark agaist blake2 etc. but where they have the same number of rounds, testing both for reducing blake2 and also increasing blake3? Also, in that vein, offering the version with more rounds could win over the "paranoid" for mostly being a faster Blake2 thanks to SIMD and extra features thanks to the Merkle tree?
Benchmark #1: cat b1
Time (mean ± σ): 1.076 s ± 0.007 s [User: 5.3 ms, System: 1069.4 ms]
Range (min … max): 1.069 s … 1.093 s 10 runs
Benchmark #2: sha256sum b1
Time (mean ± σ): 6.583 s ± 0.064 s [User: 5.440 s, System: 1.137 s]
Range (min … max): 6.506 s … 6.695 s 10 runs
Benchmark #3: sha1sum b1
Time (mean ± σ): 6.322 s ± 0.086 s [User: 5.212 s, System: 1.103 s]
Range (min … max): 6.214 s … 6.484 s 10 runs
Benchmark #4: b2sum b1
Time (mean ± σ): 13.184 s ± 0.108 s [User: 12.090 s, System: 1.080 s]
Range (min … max): 13.087 s … 13.382 s 10 runs
Benchmark #5: b3sum b1
Time (mean ± σ): 577.0 ms ± 5.4 ms [User: 12.276 s, System: 0.669 s]
Range (min … max): 572.4 ms … 587.0 ms 10 runs
Benchmark #6: md5sum b1
Time (mean ± σ): 14.851 s ± 0.175 s [User: 13.717 s, System: 1.117 s]
Range (min … max): 14.495 s … 15.128 s 10 runs
Summary
'b3sum b1' ran
1.86 ± 0.02 times faster than 'cat b1'
10.96 ± 0.18 times faster than 'sha1sum b1'
11.41 ± 0.15 times faster than 'sha256sum b1'
22.85 ± 0.28 times faster than 'b2sum b1'
25.74 ± 0.39 times faster than 'md5sum b1'
gotdang that's some solid performance. (here running against 10GiB of random bytes; machine has the Sha ASM extensions, which is why sha256/sha1 perform so well)
edit: actually not a straight algo comparison, as b3sum here is heavily benefiting from multi-threading; without that it looks more like this:
Benchmark #1: cat b1
Time (mean ± σ): 1.090 s ± 0.007 s [User: 2.9 ms, System: 1084.8 ms]
Range (min … max): 1.071 s … 1.096 s 10 runs
Benchmark #2: sha256sum b1
Time (mean ± σ): 6.480 s ± 0.097 s [User: 5.359 s, System: 1.115 s]
Range (min … max): 6.346 s … 6.587 s 10 runs
Benchmark #3: sha1sum b1
Time (mean ± σ): 6.120 s ± 0.090 s [User: 5.027 s, System: 1.082 s]
Range (min … max): 5.979 s … 6.233 s 10 runs
Benchmark #4: b2sum b1
Time (mean ± σ): 12.866 s ± 0.208 s [User: 11.722 s, System: 1.133 s]
Range (min … max): 12.549 s … 13.124 s 10 runs
Benchmark #5: b3sum b1
Time (mean ± σ): 5.813 s ± 0.079 s [User: 4.606 s, System: 1.202 s]
Range (min … max): 5.699 s … 5.933 s 10 runs
Benchmark #6: md5sum b1
Time (mean ± σ): 14.355 s ± 0.184 s [User: 13.305 s, System: 1.039 s]
Range (min … max): 14.119 s … 14.605 s 10 runs
Summary
'cat b1' ran
5.33 ± 0.08 times faster than 'b3sum b1'
5.62 ± 0.09 times faster than 'sha1sum b1'
5.95 ± 0.10 times faster than 'sha256sum b1'
11.81 ± 0.21 times faster than 'b2sum b1'
13.17 ± 0.19 times faster than 'md5sum b1'
still beating the dedicated sha extensions, but not nearly as dramatically.
Anywhere where you do hashing of stuff and you'd like it to be faster. Content addressable systems use hashing to generate identifiers for content (where the same content has the same identifier, and different content for sure has different ID). If you have a 2GB file you want a identifier for, using BLAKE3 would make that a lot faster.
> Capable of verified streaming and incremental updates, again because it's a Merkle tree.
I don't really understand what this means (verified streaming + incremental updates), could someone clarify? Merkle Trees are simple (https://en.wikipedia.org/wiki/Merkle_tree for people who don't know)
Basically to verify a video file using serial hash functions you need to download the entire video file before you can perform the hashing. In BLAKE3 you can verify each chunk of the video as it is being streamed because the hash internally is just a Merkle Tree.
Thanks for the link mauricio! That's very interesting and makes a ton of sense. Gonna have a lot of fun playing around with BLAKE3 for different things.
Fast hashes are useful for signing, MACs (symmetric "signatures" so to speak), key derivation (HKDF and all kinds of Diffie-Hellman handshakes come to mind), as part of cryptographically secure PRNGs (though most of the world has moved on to stream ciphers for that instead) and probably more.
While programming, just try to think of a scenario where having a mapping between some kind of arbitrary data (and maybe a key) and a fixed-size, uniformly random-looking output could be useful. Opportunities to sprinkle some hashes on things come up quite often when you look for them.
So I’m not super familiar with things like this, but for example, WireGuard uses BLAKE2 for hashing. What level of undertaking would it be to move from BLAKE2 to BLAKE3 in regards to WireGuard? Can you just pop out BLAKE2 and pop in BLAKE3?
The two hashes aren't compatible, so a hash of the same message will yield two different hashes under BLAKE2 and BLAKE3.
As far as I can tell, BLAKE2 has effectively all the properties BLAKE3 has (arbitrary output length, keyed hash mode), so the upgrade for communications boils down to negotiating/determining which of the two hash functions to use over the wire (with all the downsides that come with agility of cryptographic primitives); for stored hashes, they have to be recomputed and replaced (or you could store a flag is BLAKE2/is BLAKE3 and update them as you touch hashes, kind of similar to how password hashes are swapped at login time).
Note that BLAKE3 existing doesn't break BLAKE2. It's perfectly fine to just keep trucking BLAKE2, it's just that BLAKE3 has better performance characteristics that make it very attractive.
Wireguard uses pre-shared keys, which are manually (or through some separate program) configured on each end, so one could in theory make "oh, and by the way we will use BLAKE3 instead of BLAKE2" part of that pre-shared key.
Assuming wireguard hashes data shorter than 4k (i.e. most network packets), there is no reason to switch; BLAKE3 is only faster than BLAKE2 on data longer than 4k.
That isn't literally true; the reduced rounds make it faster on small inputs, too. And jumbo packets can be 4kB or 9000B or whatever, if wireguard is used on such an interface.
Though for Wireguard, you'd compete with Blake2b as well, which has the advantage of using 64-bit words. And if you want a fair comparison, you should reduce the rounds of Blake2b down to 8 (instead of 12), as recommended in Aumasson's "Too Much Crypto".
On a 64-bit machine, such a reduced Blake2b would be much faster than Blake3 on inputs greater than 128 bytes and smaller than 4Kib.
They address this in the paper, to some extent. With SIMD, you get 128, 256, or 512 bits of vector. You can either store 32x4, 32x8, 32x16, or 64x2, 64x4, 64x8 words. But either way you're processing N bits in parallel.
The concern about 64-bit machines and using 64-bit word sizes vs 32-bit word sizes really only matters if your 64-bit machine doesn't have SIMD vector extensions. (All amd64 hardware, for example, has at least SSE2.) And as they point out, being 32-bit native really helps on low-end 32-bit machines without SIMD intrinsics.
(Re: the hypothetical, if wireguard were to do a protocol revision and replace Blake2B with this, it would make sense to also replace Chacha20 with Chacha8 or 12 at the same time. I doubt the WG authors will do any such thing any time soon.)
> as part of cryptographically secure PRNGs (though most of the world has moved on to stream ciphers for that instead)
My understanding is that plenty of stream ciphers are based on hashes. For example each block of the stream can be hash(key + nonce + block counter + constants) that you xor with your plaintext (or don't, if you just want a CSPRNG).
BLAKE and Chacha are pretty intimately related; if this is faster, I don't see any reason to not use it as a CSPRNG over, say, Chacha20. You may have to be careful about not rolling the counter, unlike Chacha20 (which can easily be extended to have a 128-bit counter).
> unlike Chacha20 (which can easily be extended to have a 128-bit counter)
Is this frequently done in practice? The CSPRNG code for ChaCha20 I've looked at rotates the key itself using 32 out of every 768 bytes. In that case rolling the counter isn't a concern.
The 128-bit counter would work, and would remove the 32 bytes of overhead. The speed difference however is fairly negligible, and you would lose forward secrecy in the process (if your unrotated seed gets stolen, all past random numbers are revealed).
Now I wonder where this 768 bytes could possibly come from. It's only a multiple of 256, which can only take advantage of 128-bit vectors (4 blocks at a time). Ideally you want an 8 way parallelism (AVX2) or even 16 way parallelism (AVX-512). That is, either 512 byte blocks, or 1024 byte blocks.
> Now I wonder where this 768 bytes could possibly come from.
This is totally implementation defined, it's not required by the spec. As loeg says (below) I was looking at a reference implementation by djb. I did a quick skim of OpenBSD's arc4random (which also uses ChaCha20) and if I'm reading it correctly, it rekeys every 1024 bytes.
> Ideally you want an 8 way parallelism (AVX2) or even 16 way parallelism (AVX-512)
My guess is that 768 was thought to be a decent enough trade-off between maximum and average latency for calls to the CSPRNG. I wouldn't be surprised to see that most implementations that are optimized for specific CPU architectures use different values.
Oh, I see. Then people blindly copied it, without taking into account that Chacha20 has bigger blocks than AES, and could benefit from vector implementations (while AES has bit slicing or AES-NI).
I don't know about frequently; it is done at least once. (Of course, nothing about a wide counter prevents you from rotating the key, too, for forward secrecy properties.)
Hashes are useful in a wide variety of cryptographic applications, of which file signing is just one example. But you are correct, you would not want to use this by itself for password hashing. But that is true of any cryptographic hash. not just Blake3.
I'm not sure what you mean, do you mean something like Bcrypt on its own isn't enough for password hashes?
My understanding is that for password hashing you want two things - cryptographic security (can't go from hash -> password by attacking the algorithm ) - and you want it to be "as slow as is usable" to counter brute force attacks? Also per-password salts like bcrypt uses counters rainbow attacks don't hurt.
There are two different kinds of cryptographic hashes. It's confusing because there isn't standard terminology to distinguish between them, so I'm just going to call them Type 1 and Type 2. Type 1 hashes are designed to be fast and Type 2 hashes are designed to be slow. MD5, SHA, and Blake are Type 1 hashes. Bcrypt is a type 2 hash, as is PBKDF. Type 2 hashes are built using Type 1 hashes as components. Type 2 hashes also typically include additional features like salts that Type 1 hashes do not have.
And there's the third and most important property: constant time. timing resistant. Usually not too fast and not too slow, but always in the same speed for each input. no shortcuts.
No, hashes do not need to be constant time because they aren't keyed. In fact, they cannot be constant time because they have to handle arbitrary-length input.
The first two references you cite are attacks against a password hash. Blake3 is not a password hash, it is a general-purpose hash.
The third reference is not an attack on the hash function timing, it is an attack against hash table timing, which is not at all the same thing. From the abstract:
"Our attack does not rely on any weakness of a particular hash function and can work against any hash."
Once again: Blake3 is a general-purpose hash, not a password hash. General-purpose hashes cannot be constant-time because they must operate on inputs of arbitrary length.
Then I think you misunderstood rurban's post. He's saying it is a property cryptographic hash functions must have that they are constant time, constant power. Yes, all cryptographic hashes suitable for general-purpose use have this property. That's part of why they are good general-purpose hash functions.
And each round of Blake3 is constant-time, irrespective of input. We're talking about rounds (or, alternatively, constant-time for equivalent length input).
> He's saying it is a property cryptographic hash functions must have that they are constant time, constant power.
It's a little unclear whether rurban was saying that they must have this property, or merely that they do have it. But that is neither here nor there because...
> each round of Blake3 is constant-time
That is not the same thing as the entire algorithm being constant-time.
This whole thread has turned into a horrible mess.
Yes, all else being equal, constant time/power is nice to have. But the only circumstance under which it is necessary is if you are processing secret data in a situation where an adversary can potentially observe side channels. But this is true for any algorithm, not just hashes. Furthermore, most common application of hashing a secret is password hashing, and there is is much more important that the hash be expensive than that it be constant time and power.
But Blake3 is not a password hash. It can be used as a component of a password hash, and there its constant-round time becomes a useful property. But to emphasize this in the context in which rurban's comment appears is at best badly misleading.
The first change is reducing the number of rounds from 10 to 7. Think of it like making a smoothie - you add bits of fruit to the drink (the input data), then pulse the blades to blend it up (making the output hash). This change basically runs the blades for 7 seconds instead of 10 seconds each time they add fruit. They cite evidence that the extra 3 seconds aren't doing much - once the fruit's fully liquid, extra blending doesn't help - but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.
The other change they make is to break the input into 1KiB chunks, then hash each chunk independently. Finally, they combine the individual chunk hashes into a single big hash using a binary tree. The benefit is that if you have 4KiB of data, you can use 4-way SIMD instructions to process all four chunks simultaneously. The more data you have, the more parallelism you can unlock, unlike traditional hash functions that process everything sequentially. On the flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit CPU's that struggle with 64-bit arithmetic. The tree structure is a big win overall.