> ...on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now
But one of the sweet benefit of blake3 is that it is parallelized by default. I picked blake3 not because it's 2.5x faster than sha256 when running b3sum with "--num-threads 1" but because, with the default, it's 16x faster than sha256 (on a machine with "only" 8 cores).
And Blake3, contrarily to some other "parallelizable" hashes, give the same hash no matter if you run it with one thread or any number of threads (IIRC there are hashes which have different executables depending if you want to run the single-threaded or multi-threader version of the hash, and they give different checksums).
I tried on my machine (which is a bit slower than yours) and I get:
990 ms openssl sha256
331 ms b3sum --num-threads 1
70 ms b3sum
That's where the big performance gain is when using Blake3 IMO (even though 2.5x faster than a fast sha256 even when single-threaded is already nice).
yup, for comparison, same file as above using all the threads (32) on my system, I get about 45ms with fully parallel permitted b3. It does run into diminishing returns fairly quickly though; unsurprisingly no improvements in perf using hyperthreading, but also improvements drop off pretty fast with more cores.
b3sum --num-threads 16 /tmp/rand_1G ran
1.01 ± 0.02 times faster than b3sum --num-threads 15 /tmp/rand_1G
1.01 ± 0.02 times faster than b3sum --num-threads 14 /tmp/rand_1G
1.03 ± 0.02 times faster than b3sum --num-threads 13 /tmp/rand_1G
1.04 ± 0.02 times faster than b3sum --num-threads 12 /tmp/rand_1G
1.07 ± 0.02 times faster than b3sum --num-threads 11 /tmp/rand_1G
1.10 ± 0.02 times faster than b3sum --num-threads 10 /tmp/rand_1G
1.13 ± 0.02 times faster than b3sum --num-threads 9 /tmp/rand_1G
1.20 ± 0.03 times faster than b3sum --num-threads 8 /tmp/rand_1G
1.27 ± 0.03 times faster than b3sum --num-threads 7 /tmp/rand_1G
1.37 ± 0.02 times faster than b3sum --num-threads 6 /tmp/rand_1G
1.53 ± 0.05 times faster than b3sum --num-threads 5 /tmp/rand_1G
1.72 ± 0.03 times faster than b3sum --num-threads 4 /tmp/rand_1G
2.10 ± 0.04 times faster than b3sum --num-threads 3 /tmp/rand_1G
2.84 ± 0.06 times faster than b3sum --num-threads 2 /tmp/rand_1G
5.03 ± 0.12 times faster than b3sum --num-threads 1 /tmp/rand_1G
(over 16 elided from this run as they're all ~= the 16 time)
Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... is a related benchmark. On that particular machine we saw good scaling up to 16 threads, but yeah somewhere in that neighborhood you start to run into memory bandwidth issues. Which is the problem you want I guess :)
But one of the sweet benefit of blake3 is that it is parallelized by default. I picked blake3 not because it's 2.5x faster than sha256 when running b3sum with "--num-threads 1" but because, with the default, it's 16x faster than sha256 (on a machine with "only" 8 cores).
And Blake3, contrarily to some other "parallelizable" hashes, give the same hash no matter if you run it with one thread or any number of threads (IIRC there are hashes which have different executables depending if you want to run the single-threaded or multi-threader version of the hash, and they give different checksums).
I tried on my machine (which is a bit slower than yours) and I get:
That's where the big performance gain is when using Blake3 IMO (even though 2.5x faster than a fast sha256 even when single-threaded is already nice).