> BLAKE3 is much more efficient (in time and energy) than SHA256, like 14 times ...

eatonphil · on Nov 13, 2023

From another thread:

> On my machine with sha extensions, blake3 is about 15% faster (single threaded in both cases) than sha256.

https://news.ycombinator.com/item?id=22237387

vluft · on Nov 13, 2023

followup to this now with further blake3 improvements, on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now. (b3sum 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than sha256sum from coreutils (GNU, 9.3) which does not use the sha extensions.

  Benchmark 1: openssl sha256 /tmp/rand_1G
    Time (mean ± σ):     576.8 ms ±   3.5 ms    [User: 415.0 ms, System: 161.8 ms]
    Range (min … max):   569.7 ms … 580.3 ms    10 runs

  Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G
    Time (mean ± σ):     228.7 ms ±   3.7 ms    [User: 168.7 ms, System: 59.5 ms]
    Range (min … max):   223.5 ms … 234.9 ms    13 runs

  Benchmark 3: sha256sum /tmp/rand_1G
    Time (mean ± σ):      2.062 s ±  0.025 s    [User: 1.923 s, System: 0.138 s]
    Range (min … max):    2.046 s …  2.130 s    10 runs

  Summary
    b3sum --num-threads 1 /tmp/rand_1G ran
      2.52 ± 0.04 times faster than openssl sha256 /tmp/rand_1G
      9.02 ± 0.18 times faster than sha256sum /tmp/rand_1G

TacticalCoder · on Nov 13, 2023

> ...on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now

But one of the sweet benefit of blake3 is that it is parallelized by default. I picked blake3 not because it's 2.5x faster than sha256 when running b3sum with "--num-threads 1" but because, with the default, it's 16x faster than sha256 (on a machine with "only" 8 cores).

And Blake3, contrarily to some other "parallelizable" hashes, give the same hash no matter if you run it with one thread or any number of threads (IIRC there are hashes which have different executables depending if you want to run the single-threaded or multi-threader version of the hash, and they give different checksums).

I tried on my machine (which is a bit slower than yours) and I get:

   990 ms openssl sha256
   331 ms b3sum --num-threads 1
    70 ms b3sum

That's where the big performance gain is when using Blake3 IMO (even though 2.5x faster than a fast sha256 even when single-threaded is already nice).

vluft · on Nov 13, 2023

yup, for comparison, same file as above using all the threads (32) on my system, I get about 45ms with fully parallel permitted b3. It does run into diminishing returns fairly quickly though; unsurprisingly no improvements in perf using hyperthreading, but also improvements drop off pretty fast with more cores.

  b3sum --num-threads 16 /tmp/rand_1G ran
    1.01 ± 0.02 times faster than b3sum --num-threads 15 /tmp/rand_1G
    1.01 ± 0.02 times faster than b3sum --num-threads 14 /tmp/rand_1G
    1.03 ± 0.02 times faster than b3sum --num-threads 13 /tmp/rand_1G
    1.04 ± 0.02 times faster than b3sum --num-threads 12 /tmp/rand_1G
    1.07 ± 0.02 times faster than b3sum --num-threads 11 /tmp/rand_1G
    1.10 ± 0.02 times faster than b3sum --num-threads 10 /tmp/rand_1G
    1.13 ± 0.02 times faster than b3sum --num-threads 9 /tmp/rand_1G
    1.20 ± 0.03 times faster than b3sum --num-threads 8 /tmp/rand_1G
    1.27 ± 0.03 times faster than b3sum --num-threads 7 /tmp/rand_1G
    1.37 ± 0.02 times faster than b3sum --num-threads 6 /tmp/rand_1G
    1.53 ± 0.05 times faster than b3sum --num-threads 5 /tmp/rand_1G
    1.72 ± 0.03 times faster than b3sum --num-threads 4 /tmp/rand_1G
    2.10 ± 0.04 times faster than b3sum --num-threads 3 /tmp/rand_1G
    2.84 ± 0.06 times faster than b3sum --num-threads 2 /tmp/rand_1G
    5.03 ± 0.12 times faster than b3sum --num-threads 1 /tmp/rand_1G

(over 16 elided from this run as they're all ~= the 16 time)

oconnor663 · on Nov 13, 2023

Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... is a related benchmark. On that particular machine we saw good scaling up to 16 threads, but yeah somewhere in that neighborhood you start to run into memory bandwidth issues. Which is the problem you want I guess :)

0xdeafbeef · on Nov 13, 2023

Interestingly, sha256sum and openssl don't use sha_ni.

iced-cpuid $(which b3sum) AVX AVX2 AVX512F AVX512VL BMI1 CET_IBT CMOV SSE SSE2 SSE4_1 SSSE3 SYSCALL XSAVE

iced-cpuid $(which openssl ) CET_IBT CMOV SSE SSE2

iced-cpuid $(which sha256sum) CET_IBT CMOV SSE SSE2

Also, sha256sum in my case is a bit faster

Benchmark 1: openssl sha256 /tmp/rand_1G Time (mean ± σ): 540.0 ms ± 1.1 ms [User: 406.2 ms, System: 132.0 ms] Range (min … max): 538.5 ms … 542.3 ms 10 runs

Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time (mean ± σ): 279.6 ms ± 0.8 ms [User: 213.9 ms, System: 64.4 ms] Range (min … max): 278.6 ms … 281.1 ms 10 runs

Benchmark 3: sha256sum /tmp/rand_1G Time (mean ± σ): 509.0 ms ± 6.3 ms [User: 386.4 ms, System: 120.5 ms] Range (min … max): 504.6 ms … 524.2 ms 10 runs

vluft · on Nov 13, 2023

not sure that tool is correct; on my openssl it shows same output as you have there, but not aes-ni which is definitely enabled and functional.

ETA: ahh you want to do that on libcrypto:

  iced-cpuid <...>/libcrypto.so.3:
    ADX AES AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL AVX512_IFMA BMI1 BMI2 CET_IBT CLFSH CMOV D3NOW MMX MOVBE MSR PCLMULQDQ PREFETCHW RDRAND RDSEED RTM SHA SMM SSE SSE2 SSE3 SSE4_1 SSSE3 SYSCALL TSC VMX XOP XSAVE

vluft · on Nov 13, 2023

further research suggests that GNU coreutils cksum will use libcrypto in some configurations (though not mine); I expect that both both your commands above are actually using sha-ni

api · on Nov 13, 2023

I'd be curious to see power consumption. SHA (and AES) are usually available as what amounts to an ASIC built into the processor, while this requires a lot more work to be done with vector instructions.

wmf · on Nov 13, 2023

If Blake3 is 2.5x faster then it's going to be roughly 2.5x less energy.

silotis · on Nov 13, 2023

That's not how it works on modern CPUs. Power draw at "100%" utilization can vary widely depending on what part of the core is being utilized. The SIMD units are typically the most power hungry part of the CPU by a large margin so just because a job finishes in less time doesn't mean total energy is necessarily lower.

wmf · on Nov 13, 2023

I'm assuming that both SHA256 and Blake3 are using integer SIMD.

api · on Nov 13, 2023

In some CPUs that may be true, but in many there are dedicated SHA instructions that amount to a SHA ASIC in the CPU.

AES units are even more common. Most non-tiny CPUs have them these days.

wmf · on Nov 13, 2023

The AES and SHA instructions are part of the vector units so their energy will be similar to other integer SIMD instructions. The overhead of issuing the instruction is higher than the work that it does so the details don't matter.

vluft · on Nov 13, 2023

this is less precise than the perf numbers as I don't really have a way to measure power directly, but with rerunning the benchmarks above locked to a cpu core, it boosted ~the same level for all 3 commands (about 5.5ghz), so should be ~the same power usage.

insanitybit · on Nov 13, 2023

> I don't think Blake3 has any hardware implementation at all yet.

> https://github.com/BLAKE3-team/BLAKE3

> The blake3 Rust crate, which includes optimized implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with automatic runtime CPU feature detection on x86. The rayon feature provides multithreading.

There aren't blake3 instructions, like some hardware has for SHA1, but it does use hardware acceleration.

edit: Re-reading, I think you're saying "If we're going to talk about hardware acceleration, SHA1 still has the advantage because of specific instructions" - that is true.

jonhohle · on Nov 13, 2023

I just tested the C implementation on a utility I wrote[0] and at least on macOS where SHA256 is hardware accelerated beyond just NEON, BLAKE3 ends up being slower than SHA256 from CommonCrypto (the Apple provided crypto library). BLAKE3 ends up being 5-10% slower for the same input set.

As far as I'm aware, Apple does not expose any of the hardware crypto functions, so unless what exists supports BLAKE3 and they add support in CommonCrypto, there's no advantage to using it from a performance perspective.

The rust implementation is multithreaded and ends up beating SHA256 handily, but again, for my use case the C impl is only single threaded, and the utility assumes a single threaded hasher with one running on each core. Hashing is the bottleneck for `dedup`, so finding a faster hasher would have a lot of benefits.

0 - https://github.com/ttkb-oss/dedup

RaisingSpear · on Nov 14, 2023

Keep in mind that many CPUs out there don't support those instructions (notably Intel's Skylake and ARM's Cortex A72). BLAKE3 will be significantly faster than SHA2 on many platforms out there.