If we're talking about vectorized instruction sets like AVX (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with SHA256. I don't think Blake3 has any hardware implementation at all yet.
Your typical cell phone running ARMv8 / NEON will be more efficient with the SHA256 instructions than whatever software routine you need to run Blake3. Dedicated hardware inside the cores is very difficult to beat on execution speed or efficiency.
I admit that I haven't run any benchmarks on my own. But I'd be very surprised if any software routine were comparable to the dedicated SHA256 instructions found on modern cores.
followup to this now with further blake3 improvements, on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now. (b3sum 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than sha256sum from coreutils (GNU, 9.3) which does not use the sha extensions.
Benchmark 1: openssl sha256 /tmp/rand_1G
Time (mean ± σ): 576.8 ms ± 3.5 ms [User: 415.0 ms, System: 161.8 ms]
Range (min … max): 569.7 ms … 580.3 ms 10 runs
Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G
Time (mean ± σ): 228.7 ms ± 3.7 ms [User: 168.7 ms, System: 59.5 ms]
Range (min … max): 223.5 ms … 234.9 ms 13 runs
Benchmark 3: sha256sum /tmp/rand_1G
Time (mean ± σ): 2.062 s ± 0.025 s [User: 1.923 s, System: 0.138 s]
Range (min … max): 2.046 s … 2.130 s 10 runs
Summary
b3sum --num-threads 1 /tmp/rand_1G ran
2.52 ± 0.04 times faster than openssl sha256 /tmp/rand_1G
9.02 ± 0.18 times faster than sha256sum /tmp/rand_1G
> ...on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now
But one of the sweet benefit of blake3 is that it is parallelized by default. I picked blake3 not because it's 2.5x faster than sha256 when running b3sum with "--num-threads 1" but because, with the default, it's 16x faster than sha256 (on a machine with "only" 8 cores).
And Blake3, contrarily to some other "parallelizable" hashes, give the same hash no matter if you run it with one thread or any number of threads (IIRC there are hashes which have different executables depending if you want to run the single-threaded or multi-threader version of the hash, and they give different checksums).
I tried on my machine (which is a bit slower than yours) and I get:
990 ms openssl sha256
331 ms b3sum --num-threads 1
70 ms b3sum
That's where the big performance gain is when using Blake3 IMO (even though 2.5x faster than a fast sha256 even when single-threaded is already nice).
yup, for comparison, same file as above using all the threads (32) on my system, I get about 45ms with fully parallel permitted b3. It does run into diminishing returns fairly quickly though; unsurprisingly no improvements in perf using hyperthreading, but also improvements drop off pretty fast with more cores.
b3sum --num-threads 16 /tmp/rand_1G ran
1.01 ± 0.02 times faster than b3sum --num-threads 15 /tmp/rand_1G
1.01 ± 0.02 times faster than b3sum --num-threads 14 /tmp/rand_1G
1.03 ± 0.02 times faster than b3sum --num-threads 13 /tmp/rand_1G
1.04 ± 0.02 times faster than b3sum --num-threads 12 /tmp/rand_1G
1.07 ± 0.02 times faster than b3sum --num-threads 11 /tmp/rand_1G
1.10 ± 0.02 times faster than b3sum --num-threads 10 /tmp/rand_1G
1.13 ± 0.02 times faster than b3sum --num-threads 9 /tmp/rand_1G
1.20 ± 0.03 times faster than b3sum --num-threads 8 /tmp/rand_1G
1.27 ± 0.03 times faster than b3sum --num-threads 7 /tmp/rand_1G
1.37 ± 0.02 times faster than b3sum --num-threads 6 /tmp/rand_1G
1.53 ± 0.05 times faster than b3sum --num-threads 5 /tmp/rand_1G
1.72 ± 0.03 times faster than b3sum --num-threads 4 /tmp/rand_1G
2.10 ± 0.04 times faster than b3sum --num-threads 3 /tmp/rand_1G
2.84 ± 0.06 times faster than b3sum --num-threads 2 /tmp/rand_1G
5.03 ± 0.12 times faster than b3sum --num-threads 1 /tmp/rand_1G
(over 16 elided from this run as they're all ~= the 16 time)
Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... is a related benchmark. On that particular machine we saw good scaling up to 16 threads, but yeah somewhere in that neighborhood you start to run into memory bandwidth issues. Which is the problem you want I guess :)
Benchmark 1: openssl sha256 /tmp/rand_1G
Time (mean ± σ): 540.0 ms ± 1.1 ms [User: 406.2 ms, System: 132.0 ms]
Range (min … max): 538.5 ms … 542.3 ms 10 runs
Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G
Time (mean ± σ): 279.6 ms ± 0.8 ms [User: 213.9 ms, System: 64.4 ms]
Range (min … max): 278.6 ms … 281.1 ms 10 runs
Benchmark 3: sha256sum /tmp/rand_1G
Time (mean ± σ): 509.0 ms ± 6.3 ms [User: 386.4 ms, System: 120.5 ms]
Range (min … max): 504.6 ms … 524.2 ms 10 runs
further research suggests that GNU coreutils cksum will use libcrypto in some configurations (though not mine); I expect that both both your commands above are actually using sha-ni
I'd be curious to see power consumption. SHA (and AES) are usually available as what amounts to an ASIC built into the processor, while this requires a lot more work to be done with vector instructions.
That's not how it works on modern CPUs. Power draw at "100%" utilization can vary widely depending on what part of the core is being utilized. The SIMD units are typically the most power hungry part of the CPU by a large margin so just because a job finishes in less time doesn't mean total energy is necessarily lower.
The AES and SHA instructions are part of the vector units so their energy will be similar to other integer SIMD instructions. The overhead of issuing the instruction is higher than the work that it does so the details don't matter.
this is less precise than the perf numbers as I don't really have a way to measure power directly, but with rerunning the benchmarks above locked to a cpu core, it boosted ~the same level for all 3 commands (about 5.5ghz), so should be ~the same power usage.
> The blake3 Rust crate, which includes optimized implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with automatic runtime CPU feature detection on x86. The rayon feature provides multithreading.
There aren't blake3 instructions, like some hardware has for SHA1, but it does use hardware acceleration.
edit: Re-reading, I think you're saying "If we're going to talk about hardware acceleration, SHA1 still has the advantage because of specific instructions" - that is true.
I just tested the C implementation on a utility I wrote[0] and at least on macOS where SHA256 is hardware accelerated beyond just NEON, BLAKE3 ends up being slower than SHA256 from CommonCrypto (the Apple provided crypto library). BLAKE3 ends up being 5-10% slower for the same input set.
As far as I'm aware, Apple does not expose any of the hardware crypto functions, so unless what exists supports BLAKE3 and they add support in CommonCrypto, there's no advantage to using it from a performance perspective.
The rust implementation is multithreaded and ends up beating SHA256 handily, but again, for my use case the C impl is only single threaded, and the utility assumes a single threaded hasher with one running on each core. Hashing is the bottleneck for `dedup`, so finding a faster hasher would have a lot of benefits.
Keep in mind that many CPUs out there don't support those instructions (notably Intel's Skylake and ARM's Cortex A72). BLAKE3 will be significantly faster than SHA2 on many platforms out there.
[snip]
> AVX in Intel/AMD, Neon and Scalable Vector Extensions in Arm, and RISC-V Vector computing in RISC-V. BLAKE3 can take advantage of all of it.
Uh huh... AVX/x86 and NEON/ARM you say?
https://www.felixcloutier.com/x86/sha256rnds2
https://developer.arm.com/documentation/ddi0596/2021-12/SIMD...
If we're talking about vectorized instruction sets like AVX (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with SHA256. I don't think Blake3 has any hardware implementation at all yet.
Your typical cell phone running ARMv8 / NEON will be more efficient with the SHA256 instructions than whatever software routine you need to run Blake3. Dedicated hardware inside the cores is very difficult to beat on execution speed or efficiency.
I admit that I haven't run any benchmarks on my own. But I'd be very surprised if any software routine were comparable to the dedicated SHA256 instructions found on modern cores.