SHA256 was based on SHA1 (which is weak). BLAKE was based on ChaCha20, which was based on Salsa20 (which are both strong).
NIST/NSA have repeatedly signaled lack of confidence in SHA256: first by hastily organising the SHA3 contest in the aftermath of Wang's break of SHA1
No: SHA2 lacks the structure the SHA1 attack relies on it (SHA1 has a linear message schedule, which made it possible to work out a differential cryptanalysis attack on it).
Blake's own authors keep saying SHA2 is secure (modulo length extension), but people keep writing stuff like this. Blake3 is a good and interesting choice on the real merits! It doesn't need the elbow throw.
While there is more confidence now on the security of SHA-2, or rather the lack of transference of the SHA-1 approach to SHA-2, this was not the case in 2005-2006 when NIST decided to hold the SHA-3 competition. See for example the report on Session 4 of the 2005 NIST workshop on hash functions [1].
For those who didn't click the link, it should be noted that they're suggesting this because it would be easier to deploy (in places that have a SHA-2 implementation but not SHA-3), not for reasons related to security or anything like that. Looking at the responses, there's also some disagreement on whether it would offer equal security for the particular use case of ML-DSA and ML-KEM (as the final version of Dilithium and Kyber standardized by NIST will be called).
> they're suggesting this because it would be easier to deploy (in places that have a SHA-2 implementation but not SHA-3), not for reasons related to security
That’s a bit absurd, right? Sure, the NSA didn’t overtly say, “we propose you use SHA-2 because we can break it.” That doesn’t mean it’s secure against them.
We can’t look at their stated justification for supporting one algorithm over another because the NSA lies. Their very _purpose_ as an organization is to defeat encryption, and one tactic is to encourage the industry to use something they can defeat while reassuring people it’s secure. We need to look at their recommendations with a lot of suspicion and assume an ulterior motive.
The NSA purpose is also to provide cybersecurity to protect US combat operations, which means they have to secure encryption.[1] I wouldn't go as far as to say the NSA should be trusted, or that they haven't tried to compromise encryption before, just that their motivations are contradictory.
Besides you aren't accounting for reverse psychology. What if SHA2 was insecure, and Blake was secure, and the NSA just tricked people into not using Blake by saying that it's secure? If we can't trust what the NSA says, it would be wisest to disregard what they say, rather than react to it.
[1] https://www.nsa.gov/About/Mission-Combat-Support/
We provide wireless and wired secure communications to our warfighters and others in uniform no matter where they are, whether traveling through Afghanistan in a Humvee, diving beneath the sea, or flying into outer space. Our cybersecurity mission also produces and packages the codes that secure our nation's weapons systems.
Additionally, we set common protocols and standards so that our military can securely share information with our allies, NATO and coalition forces around the world. Interoperability is a key to successful joint operations and exercises.
Most people who publicly opine on the Blake vs. SHA2 debate seem to be relatively uninformed on the realities of each one. SHA2 and the Blakes are both usually considered to be secure.
The performance arguments most people make are also outdated or specious: the original comparisons of Blake vs SHA2 performance on CPUs were largely done before Intel and AMD had special SHA2 instructions.
Sorry, I should have been more precise. JP Aumasson is specifically who I'm thinking of; he's made the semi-infamous claim that SHA2 won't be broken in his lifetime. The subtext I gather is that there's just nothing on the horizon that's going to get it. SHA1 we saw coming a ways away!
The summary is that either you attack a very reduced round variant and you get "practical" complexity for the attack, or you attack almost a full round variant and you get an entirely practical attack.
So I think your interpretation of the subtext is entirely correct.
Who I'm sure actually is informed, but in this particular case is tweeting things that do honestly sound like one of the uninformed commentators pclmulqdq mentioned. I'm not sure why, since as tptacek said, blake3 is good and maybe even preferable on it's own merits without venturing into FUD territory. And if you still wanted to get into antiquated design arguments, picking on SHA256's use of a construction that allows length extension attacks seems like more fair game.
There are fundamentally two kinds of attacks, preimage which splits into two:
In a first-preimage attack, you know a hash value but not the message that created it, and you want to discover any message with the known hash value; in the second-preimage attack, you have a message and you want to find a second message that has the same hash. Attacks that can find one type of preimage can often find the other as well. A weak algorithm allows this to be done in less than 2^(hash length) attempts.
And then there is collision: two messages which produce the same hash. A weak algorithm allows this to be done in less than 2^(half of hash length) attempts.
Weak means that a mathematical flaw has been discovered that makes it inherently insecure, or that it is so simple that modern computer technology makes it possible to use “brute force” to crack. Strong means that it's not either.
Smart idea doing the hash choice per-commit. Just make sure that somebody putting in an obscure hash doesn't mess up everybody's usage of the repo if they don't have a library to evaluate that hash installed.
There will be a set of presets of hash function and settings; if BLAKE3 fails, then I'll actually have to add SHA3 or something, with a set of settings, as presets.
The per-commit storage will then be an enum identifying the hash and its settings.
This will let me do other things, like letting companies use a 512-bit hash if they expect the repo to be large.
> letting companies use a 512-bit hash if they expect the repo to be large.
A repo would have to have more than 1e32 documents for a one in a trillion chance of a collision with a 256 bit hash. (Total annual world data production is estimated at less than 1e24 bytes.)
A 512 bit hash thus seems overkill for almost all purposes.
For normal VCS's, you are absolutely right. And you're actually right for mine, but I decided to redo the math to make sure.
My VCS will track things at a finer level than documents. In a C file, it will track individual functions and structs. In a Java file, it will track individual fields and classes. In a Microsoft Word document, it might track individual paragraphs. And in a Blender file, it will track each object, material, texture, etc. individually.
Yes, it will handle binary files.
Anyway, it will also be designed for non-technical users. To that end, it will hook into the source software and do a "commit" every time the user saves.
It will also track individual directories to make renames work.
I am a ctrl+s freak, so I save once a minute or more. However, other people are not, so let's assume 10 minutes (for autosave, perhaps).
Now let's assume a monorepo for a company of 100,000 people. And let's assume that when they save every 10 minutes, they save one object in one file (also tracked) two directories down. That means they create 5 hashes every 10 minutes (the fifth is the top level).
Let's assume an effective 6-hour work day.
That's 5 objects times 6 times per hour times 6 hours. That's 180 objects a day per person.
That's 18,000,000 total objects per day. Times 5 for days in a week, times 50 for work weeks in a year.
That's 4.5 billion.
Let's multiply that by 40 for 40 years that the repo exists, which includes some of the oldest software.
That's 1.8e11 objects. According to [1], a 128-bit hash would not be enough for the error correction on a disk at that point.
However, a 256-bit hash would give us a 10^31 objects before reaching that point, which gives us 10^20 times 40 years of space.
Yep, you're absolutely right that 512 bits is overkill. I stand corrected.
You're tracking things at the content level? How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete? Also, tracking the history of a class makes sense to some extent, but you say you want to commit every time there's a save. How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?
> How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete?
I've never seen a language parser do that, but if I run into a language that does that, I'll probably have my VCS track it at the file level, based on tokens or lines.
Dumb languages don't get nice things. :)
> How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?
This is less of a problem with binary files (assuming the source software does not have bugs in output), but with source files, you're right that that problem does exist.
As of right now, I would do a token-based approach. This approach removes the need for whitespace-only commits, and if I track the tokens right, I should be able to identify which right brace used to end the function until the broken code was saved. Then I would just save the function as broken using that same right brace.
For example, say you have this:
int main() {
return 0;
}
My VCS would know that the right brace corresponds to the end of the function.
Then you write this:
int main() {
if (global_bool) {
return 0;
}
Yes, a dumb system might think that the right brace is for the `if`.
However, if you break it down by tokens, the VCS will see that `if (global_bool) {` were added before the return, so it should be able to tell that the right brace still ends the function.
I hope that makes sense.
Another plausible way to do it (at least in C) would be to look for things that look like declarations. The series of tokens `<type> <name> <left_paren>` is probably a function declaration. Java would be easier; its declarations are more wordy.
I still have to prove this is possible, but I think it is.
In those cases you can just do error recovery in the parser (truncating an erroring function for example) and then store out-of-band information necessary to reconstruct the original file
This is also necessary to deal with whitespace for example (if you just reformat the code, you didnt change the ast but you changed the file)
Maybe you’re already aware, but you glossed over something: Since you’re using the hash to locate/identify the contect (you mentioned Merkle and git), if you support multiple hash functions you need some assurance that the chance of collisions is low across all supported hash functions. For example two identical functions that differ only in the value of their padding bytes (when the input size doesn’t match the block size) can’t coexist.
The multihash format is an excellent format that I am tempted to use.
However, there are a two general problems:
* The spec is not done, and it doesn't look like much has been done.
* While I agree with the FAQ that agreeing on a set of hash functions is possible, the format only allows 256 possible hash functions, so it can run out of space, especially since there can be multiple formats of the same function (BLAKE2B and BLAKE2S, for example), and especially since they want to allow non-cryptographic hash functions.
Then specifically for my use case:
* There is the problem brought up by AdamN [1]: if multihash is supported, an obscure hash may be supported, and that may cause problems.
* As an extension of that, once a hash function and set of settings is supported, it's supported forever, so I want to be more picky, and an enum allows me to restrict what is usable.
* By using a 16-bit enum, I have more space to grow.
* And finally, by using an enum, if my software encounters a repo with a enum value greater than all of the values it knows, it knows that that repo needs a later version of the software, since I will only add enum values.
zpaq archiver solves that by including decompression bytecode inside archives. So check if repository supports your algorithm, if not then include it inside your commit.
For short inputs, Blake3 behaves very similar to Blake2, on which it is based. From Blake's wikipedia page [1]:
BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given long enough input.
While I really like Blake3, for all reasons mentioned in this article, I have to say it does have one tiny disadvantage over older hashes like SHA-256: its internal state is slightly bigger (due to the tree structure which allows it to be highly parallelizable). This can matter when running on tiny microcontrollers with only a few kilobytes of memory.
It's hard to give a short answer to that question :)
- Yes, if you know your input is short, you can use a smaller state. The limit is roughly a BLAKE2s state plus (32 bytes times the log_2 of the number of KiB you need to hash). Section 5.4 of https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... goes into this.
- But it's hard to take advantage of this space optimization, because no libraries implement it in practice.
- But the reason libraries don't implement it is that almost no one needs it. The max state size is just under 2 KiB, which is small enough even for https://github.com/oconnor663/blake3-6502.
- But it would be super easy to implement if we just put the "CV stack" on the heap instead of allocating the whole thing as an array up front.
- But the platforms that care about this don't have a heap.
@caesarb mentioned really tiny microcontrollers, even tinier than the 6502 maybe. The other place I'd expect to see this optimization is in a full hardware implementation, but those are rare. Most hardware accelerators for hash functions provide the block operation, and they leave it to software to deal with this sort of bookkeeping.
However, for smaller inputs (~1024 bytes and down), the performance gap between it and everything else (blake2, sha256) gets much narrower, because you don't get to benefit from the structural parallelization.
If you're mostly dealing with small inputs, raw hash throughput is probably not high on your list of concerns - In the context of a protocol or application, other costs like IO latency probably completely dwarf the actual CPU time spent hashing.
If raw performance is no longer high on your list of priorities, you care more about the other things - ubiquitous and battle-tested library support (blake3 is still pretty bleeding-edge, in the grand scheme of things), FIPS compliance (sha256), greater on-paper security margin (blake2). Which is all to say, while blake3 is great, there are still plenty of reasons not to prefer it for a particular use-case.
I agree that if you can, BLAKE3 (or even BLAKE2) are nicer choices than SHA2. However I would like to add the following comments:
* SHA-2 fixes the problems with SHA-1. SHA-1 was a step up over SHA-0 that did not completely resolve flaws in SHA-0's design (SHA-0 was broken very quickly).
* JP Aumasson (one of the B3 authors) has said publicly a few times SHA-2 will never be broken: https://news.ycombinator.com/item?id=13733069 is an indirect source, can't seem to locate a direct one from Xitter (thanks Elon).
Thus it does not necessarily follow that SHA-2 is a bad choice because SHA-1 is broken.
However, I don't think we can say for sure if SHA2 will be broken. Cryptography is hard like that.
In addition, SHA2 is still vulnerable to length extension attacks, so in a sense, SHA2 is broken, at least when length extension attacks are part of the threat model.
If you want to be pedantic we can say there is definitely a collision in SHA-2. Assume we have 2^256 unique inputs. Hash them all and assume no collisions. Now, if we have one more unique input (so 2^256 + 1 inputs) we have a collision. The same logic applies to BLAKE3.
However we do actually know quite a bit on how to design hash functions to make this hard to do in practice. The latest cryptanalysis (to actually find a collision) either requires a vastly reduced number of rounds or is is computationally infeasible. There's no clear flaw like there was with SHA1, where the path to finding a collision has been known since ~2004.
Length extension "attacks" sure, that's an unfortunate design choice. But it doesn't impact at all on collision resistance, which is what is implied by suggesting SHA1 is vulnerable then SHA2 is.
In the end, if you can use BLAKE3 or BLAKE2, great, I probably would as well. There isn't always a choice (e.g. there's no blake3 support in most crypto hardware) and if there isn't, sha3 or sha2 are fine choices.
What I dislike about BLAKE3 is that they added explicit logic to ensure that identical chunks stored at different offsets result in different Merkle tree nodes (a.k.a. the ‘chunk counter’).
Though this feature is well intended, it makes this hash function hard to use for a storage system where you try to do aggressive data deduplication.
Furthermore, on platforms that provide native instructions for SHA hashing, BLAKE3 isn’t necessarily faster, and certainly more power hungry.
The storage system doing this wouldn’t use that part of the hash, it would do it itself so no issues? (Hash chunks, instead of feeding everything in linearly)
Otherwise the hash isn’t going to be even remotely safe for most inputs?
Answer: identify chunks via something like rsyncs rolling window or GearHash, then name those chunks by Blake3.
Trying to use Blake3's tree structure directly to dedupe is a misunderstanding of the problem you're trying to solve. Removing the counter would not let you use Blake3 alone for this purpose.
Could you point to how this is implemented and how it can be used? From the sound of it, you're trying to do something like rsync's running-window comparison?
Imagine the case where you're trying to create a storage system for a large number of virtual machine images (e.g., you're trying to build your own equivalent of AWS Machine Images). There is obviously a lot of duplication between parts of images. And not necessarily at the same offset, but also at different offsets that are n*2^k bytes apart, where 2^k represents the block/sector size.
You could consider building this storage system on top of BLAKE3's tree model. Namely you store blocks as small Merkle tree. And an image is basically a collection of blocks that has a different 'hat' on top of it. Unfortunately, BLAKE3 makes this hard, because the same block will end up having a different Merkle tree node depending on the offset at which it's stored.
Author of HashBackup here. I don't see how any kind of hash tree would be effective at de-duplicating VM machine images, other than the degenerate case of an exact copy, which is easy to detect with a single file hash.
Most OSes use 4K block sizes. To get the best dedup you have to hash every 4K block and lookup each one individually in a dedup table. Two VM images could both contain an identical 4GB file, but every 4K block of that file could be stored at different offsets in the VM images. A tree hash wouldn't let you dedup anything but identical sections stored at identical offsets, whereas a dedup table and 4K blocks allows you to dedup the entire file.
> You could consider building this storage system on top of BLAKE3's tree model.
Consider a crypto currency pow that did that without the chunk counter. It'd be trivially exploitably by precalculating all the tree but the chunk that changed per nonce.
You can use a CDC algorith, but if you know that duplication mostly occurs at power-of-two boundaries, there is no need to use that. Deduplicating on a binary tree basis is sufficient.
It's an interesting set of reasons, but I prefer Keccak/SHA-3 over SHA-256, SHA-512, and BLAKE. I trust the standards body and public competition and auditing that took place - more so than a single author trumpeting the virtues of BLAKE.
Ironic, because the final NIST report explaining their choice mentions that BLAKE has more open examination of cryptanalysis than Keccak as a point in favor of BLAKE.
At the end of the day, what really matters for most people is
1) Certifications (FIPS...)
2) Speed.
SHA-256 is fast enough for maybe 99,9% of use cases as you will saturate your I/O way before SHA-256 becomes your bottleneck[0][1]. Also, from my experience with the different available implementations, SHA-256 is up to 1.8 times faster than Blake3 on arm64.
I mostly agree with you, but there are a couple other bullet points I like to throw in the mix:
- Length extension attacks. I think all of the SHA-3 candidates did the right thing here, and we would never accept a new cryptographic hash function that didn't do the right thing here, but SHA-2 gets a pass for legacy reasons. That's understandable, but we need to replace it eventually.
- Kind of niche, but BLAKE3 supports incremental verification, i.e. checking the hash of a file while you stream it rather learning whether it was valid at the end of the stream. https://github.com/oconnor663/bao. That's useful if you know the hash of a file but you don't necessarily trust the service that's storing it.
I think SHA-256 is still marginal for speed in modern environments unless your I/O is unusually limited relative to CPU. Current servers can support 10s of GB/s combined throughput for network and storage, which is achievable in practice for quite a few workloads. Consequently, you have to plan for the CPU overhead of the crypto at the same GB/s throughput since it is usually applied at the I/O boundaries. The fact that SHA256 requires burning the equivalent of several more cores relative to Blake3 has been a driver in Blake3 anecdotally creeping into a lot of data infrastructure code lately. At these data rates, the differences in performance of the hash functions is not a trivial cost in the cases where you would use a hash function (instead of e.g. authenticated encryption).
The arm64 server case is less of a concern for other reasons. Those cores are significantly weaker than amd64 cores, and therefore tend to not be used for data-intensive processing regardless. This allows you to overfit for AVX-512 or possibly use SHA256 on arm64 builds depending on the app.
There is a strong appetite for as much hashing performance per core as possible for data-intensive processing because it consumes a significant percentage of the total CPU time in many cases. Due to the rapid growing scale, non-cryptographic hash functions are no longer fit for purpose much of the time.
Fast hash functions are really important, and SHA256 is really slow. Switching the hash function where you can is enough to result in user-visible speedups for common hashing use cases; verifying build artifacts, seeing if on-disk files changed, etc. I was writing something to produce OCI container images a few months ago, and the 3x SHA256 required by the spec for layers actually takes on the order of seconds. (.5s to sha256 a 50MB file, on my 2019-era Threadripper!) I was shocked to discover this. (gzip is also very slow, like shockingly slow, but fortunately the OCI spec lets you use Zstd, which is significantly faster.)
SHA256 is very fast on most modern CPUs, i.e. all AMD Zen, all Intel Atom since 2016, Intel Core Ice Lake or newer, Armv8 and Armv9.
I use every day both SHA-256 and BLAKE3. BLAKE3 is faster only because it is computed by multiple threads using all available CPU cores. When restricted to a single thread, it is slower on CPUs with fast hardware SHA-256.
The extra speed of BLAKE3 is not always desirable. The fact that it uses all cores can slow down other concurrent activities, without decreasing the overall execution time of the application.
There are cases when the computation of a hash like SHA-256 can be done as a background concurrent activity, or when the speed of hashing is limited by the streaming speed of data from the main memory or from a SSD, so spawning multiple threads does not gain anything and it only gets in the way of other activities.
So the right choice between SHA-256 and BLAKE3 depends on the application. Both can be useful. SHA-256 has the additional advantage that it needs negligible additional code, only a few lines are necessary to write a loop that computes the hash using the hardware instructions.
>I use every day both SHA-256 and BLAKE3. BLAKE3 is faster only because it is computed by multiple threads using all available CPU cores. When restricted to a single thread, it is slower on CPUs with fast hardware SHA-256.
That's not actually my experience. Last I tested, BLAKE3 was about twice as fast, single-threaded, as SHA256 on a Zen 3 CPU, which has the extensions.
Lower down in the thread someone else did a comparison, and came out with a similar result.
"modern hardware" deserves some caveats. AMD has supported those extensions since the original Zen, but Intel CPUs generally lacked them until only about 2 years ago.
For many years, starting in 2016, Intel has supported SHA-256 only in their Atom CPUs.
The reason seems to be that the Atom CPUs were compared in Geekbench with ARM CPUs, and without hardware SHA the Intel CPUs would have obtained worst benchmark scores.
In their big cores, SHA has been added in 2019, in Ice Lake (while Comet Lake still lacked it, being a Skylake derivative), and since then all newer Intel CPUs have it.
So except for the Intel Core CPUs, the x86 and ARM CPUs have had hardware SHA for at least 7 years, while the Intel Core CPUs have had it for the last 4 years.
>SHA has been added in 2019, in Ice Lake (while Comet Lake still lacked it, being a Skylake derivative)
Ice Lake was effectively a paper launch with low volume, repeated delays and mediocre performance. The server CPUs weren't released until 2021.
In terms of relevant quantities and relevant markets (e.g. not Atom or gaming laptops), Intel CPUs have only been "shipping" with those extensions for around 2.5 years, not 4.
Not even close. CRC32 can easily run at >50GB/s single thread on this i7-12700K CPU (VPCLMULQDQ implementation). The BLAKE3 page claims around 7GB/s single thread. Fudging the figures a bit to cater to CPU differences, BLAKE3 is still a far cry from CRC32.
To be fair, it really depends on the platform. There's an argument to be made that platforms where you care about the difference are specifically the ones where BLAKE3 is slower (no SIMD, no threads).
If we're talking about vectorized instruction sets like AVX (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with SHA256. I don't think Blake3 has any hardware implementation at all yet.
Your typical cell phone running ARMv8 / NEON will be more efficient with the SHA256 instructions than whatever software routine you need to run Blake3. Dedicated hardware inside the cores is very difficult to beat on execution speed or efficiency.
I admit that I haven't run any benchmarks on my own. But I'd be very surprised if any software routine were comparable to the dedicated SHA256 instructions found on modern cores.
followup to this now with further blake3 improvements, on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now. (b3sum 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than sha256sum from coreutils (GNU, 9.3) which does not use the sha extensions.
Benchmark 1: openssl sha256 /tmp/rand_1G
Time (mean ± σ): 576.8 ms ± 3.5 ms [User: 415.0 ms, System: 161.8 ms]
Range (min … max): 569.7 ms … 580.3 ms 10 runs
Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G
Time (mean ± σ): 228.7 ms ± 3.7 ms [User: 168.7 ms, System: 59.5 ms]
Range (min … max): 223.5 ms … 234.9 ms 13 runs
Benchmark 3: sha256sum /tmp/rand_1G
Time (mean ± σ): 2.062 s ± 0.025 s [User: 1.923 s, System: 0.138 s]
Range (min … max): 2.046 s … 2.130 s 10 runs
Summary
b3sum --num-threads 1 /tmp/rand_1G ran
2.52 ± 0.04 times faster than openssl sha256 /tmp/rand_1G
9.02 ± 0.18 times faster than sha256sum /tmp/rand_1G
> ...on a faster machine now but with sha extensions vs single-threaded blake3; blake3 is about 2.5x faster than sha256 now
But one of the sweet benefit of blake3 is that it is parallelized by default. I picked blake3 not because it's 2.5x faster than sha256 when running b3sum with "--num-threads 1" but because, with the default, it's 16x faster than sha256 (on a machine with "only" 8 cores).
And Blake3, contrarily to some other "parallelizable" hashes, give the same hash no matter if you run it with one thread or any number of threads (IIRC there are hashes which have different executables depending if you want to run the single-threaded or multi-threader version of the hash, and they give different checksums).
I tried on my machine (which is a bit slower than yours) and I get:
990 ms openssl sha256
331 ms b3sum --num-threads 1
70 ms b3sum
That's where the big performance gain is when using Blake3 IMO (even though 2.5x faster than a fast sha256 even when single-threaded is already nice).
yup, for comparison, same file as above using all the threads (32) on my system, I get about 45ms with fully parallel permitted b3. It does run into diminishing returns fairly quickly though; unsurprisingly no improvements in perf using hyperthreading, but also improvements drop off pretty fast with more cores.
b3sum --num-threads 16 /tmp/rand_1G ran
1.01 ± 0.02 times faster than b3sum --num-threads 15 /tmp/rand_1G
1.01 ± 0.02 times faster than b3sum --num-threads 14 /tmp/rand_1G
1.03 ± 0.02 times faster than b3sum --num-threads 13 /tmp/rand_1G
1.04 ± 0.02 times faster than b3sum --num-threads 12 /tmp/rand_1G
1.07 ± 0.02 times faster than b3sum --num-threads 11 /tmp/rand_1G
1.10 ± 0.02 times faster than b3sum --num-threads 10 /tmp/rand_1G
1.13 ± 0.02 times faster than b3sum --num-threads 9 /tmp/rand_1G
1.20 ± 0.03 times faster than b3sum --num-threads 8 /tmp/rand_1G
1.27 ± 0.03 times faster than b3sum --num-threads 7 /tmp/rand_1G
1.37 ± 0.02 times faster than b3sum --num-threads 6 /tmp/rand_1G
1.53 ± 0.05 times faster than b3sum --num-threads 5 /tmp/rand_1G
1.72 ± 0.03 times faster than b3sum --num-threads 4 /tmp/rand_1G
2.10 ± 0.04 times faster than b3sum --num-threads 3 /tmp/rand_1G
2.84 ± 0.06 times faster than b3sum --num-threads 2 /tmp/rand_1G
5.03 ± 0.12 times faster than b3sum --num-threads 1 /tmp/rand_1G
(over 16 elided from this run as they're all ~= the 16 time)
Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... is a related benchmark. On that particular machine we saw good scaling up to 16 threads, but yeah somewhere in that neighborhood you start to run into memory bandwidth issues. Which is the problem you want I guess :)
Benchmark 1: openssl sha256 /tmp/rand_1G
Time (mean ± σ): 540.0 ms ± 1.1 ms [User: 406.2 ms, System: 132.0 ms]
Range (min … max): 538.5 ms … 542.3 ms 10 runs
Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G
Time (mean ± σ): 279.6 ms ± 0.8 ms [User: 213.9 ms, System: 64.4 ms]
Range (min … max): 278.6 ms … 281.1 ms 10 runs
Benchmark 3: sha256sum /tmp/rand_1G
Time (mean ± σ): 509.0 ms ± 6.3 ms [User: 386.4 ms, System: 120.5 ms]
Range (min … max): 504.6 ms … 524.2 ms 10 runs
further research suggests that GNU coreutils cksum will use libcrypto in some configurations (though not mine); I expect that both both your commands above are actually using sha-ni
I'd be curious to see power consumption. SHA (and AES) are usually available as what amounts to an ASIC built into the processor, while this requires a lot more work to be done with vector instructions.
That's not how it works on modern CPUs. Power draw at "100%" utilization can vary widely depending on what part of the core is being utilized. The SIMD units are typically the most power hungry part of the CPU by a large margin so just because a job finishes in less time doesn't mean total energy is necessarily lower.
The AES and SHA instructions are part of the vector units so their energy will be similar to other integer SIMD instructions. The overhead of issuing the instruction is higher than the work that it does so the details don't matter.
this is less precise than the perf numbers as I don't really have a way to measure power directly, but with rerunning the benchmarks above locked to a cpu core, it boosted ~the same level for all 3 commands (about 5.5ghz), so should be ~the same power usage.
> The blake3 Rust crate, which includes optimized implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with automatic runtime CPU feature detection on x86. The rayon feature provides multithreading.
There aren't blake3 instructions, like some hardware has for SHA1, but it does use hardware acceleration.
edit: Re-reading, I think you're saying "If we're going to talk about hardware acceleration, SHA1 still has the advantage because of specific instructions" - that is true.
I just tested the C implementation on a utility I wrote[0] and at least on macOS where SHA256 is hardware accelerated beyond just NEON, BLAKE3 ends up being slower than SHA256 from CommonCrypto (the Apple provided crypto library). BLAKE3 ends up being 5-10% slower for the same input set.
As far as I'm aware, Apple does not expose any of the hardware crypto functions, so unless what exists supports BLAKE3 and they add support in CommonCrypto, there's no advantage to using it from a performance perspective.
The rust implementation is multithreaded and ends up beating SHA256 handily, but again, for my use case the C impl is only single threaded, and the utility assumes a single threaded hasher with one running on each core. Hashing is the bottleneck for `dedup`, so finding a faster hasher would have a lot of benefits.
Keep in mind that many CPUs out there don't support those instructions (notably Intel's Skylake and ARM's Cortex A72). BLAKE3 will be significantly faster than SHA2 on many platforms out there.
- SHA-256 has hardware acceleration on many platforms, but SHA-512 mostly doesn't.
- Setting aside hardware acceleration, SHA-256 is faster on 32-bit platforms, like a lot of embedded devices. If you have to choose between "fast on a desktop" vs "fast in embedded", it can make sense to assume that desktops are always fast enough and that your bottlenecks will be in embedded.
On older 64-bit CPUs without hardware SHA-256 (i.e. up to the Skylake derivatives), SHA-512 is faster.
Many recent Arm CPUs have hardware SHA-512 (and SHA-3).
Intel will add hardware SHA-512 starting with Arrow Lake S, to be launched at the end of 2024 (the successor in desktops of the current Raptor Lake Refresh).
Most 64-bit CPUs that have been sold during the last 4 years and many of those older than that have hardware SHA-256.
One metric that is seldom mentioned for crypto algos is code complexity.
I really wish researchers would at least pay lip service to it.
TEA (an unfortunately somewhat weak symmetric cipher) was a very nice push in that direction.
TweetNaCl was another very nice push in that direction by djb
Why care about that metric you ask?
Well here are a couple of reasons:
- algo fits in head
- algo is short -> cryptanalysis likely easier
- algo is short -> less likely to have buggy implementation
- algo is short -> side-channel attacks likely easier to analyse
- algo fits in a 100 line c++ header -> can be incorporated into anything
- algo can be printed on a t-shirt, thereby skirting export control restrictions
- algo can easily be implemented on tiny micro-controllers
> One metric that is seldom mentioned for crypto algos is code complexity. ... TEA (an unfortunately somewhat weak symmetric cipher) was a very nice push in that direction.
Spec is also a push in that direction [0]. It's code looks to be as complex as TEA's (1/2 a page of C), blindingly fast, yet as far I know has no known attacks despite being subject to a fair bit of scrutiny. About the only reason I can see for it not being largely ignored is it was designed by NSA.
SHA3 is also a simple algorithm. Downright pretty, in fact. It's a pity it's so slow.
How does the extended output work, and what's the point of extended output?
From what I can see, BLAKE3 has 256 bits of security, and extended output doesn't provide any extra security. In this case, what's the point of extended output over doing something like padding with 0-bits or extending by re-hashing the previous output and appending it to the previous output (eg, for 1024 bits, doing h(m) . h(h(m)) . h(h(h(m))) . h(h(h(h(m))))). Either way, you get 256 bits of security.
Is it just because the design of the hash makes it simple to do, so it's just offered as a consistent option for arbitrary output sizes where needed, or is there some greater purpose that I'm missing?
> From what I can see, BLAKE3 has 256 bits of security, and extended output doesn't provide any extra security.
128 bits of collision resistance but otherwise correct. As a result of that we usually just call it 128 bits across the board, but yes in an HMAC-like use case you would generally expect 256 bits of security from the 256 bit output. Extended outputs don't change that, because the internal chaining values are 256 bits even when the output is larger.
> extending by re-hashing the previous output and appending it to the previous output
It's not quite that simple, because you don't want later parts of your output to be predictable from earlier parts (which might be published, depending on the use case). You also want it to be parallelizable.
You could compute H(m) as a "pre-hash" and then make an extended output something like H(H(m)|1)|H(H(m)|2)|... That's basically what BLAKE3 is doing in the inside. The advantage of having the algorithm do it for you is that 1) it's an "off the shelf" feature that doesn't require users to roll their own crypto and 2) it's slightly faster when the input is short, because you don't have to spend an extra block operation computing the pre-hash.
> what's the point of extended output?
It's kind of niche, but for example Ed25519 needs a 512 bit hash output internally to "stretch" its secret seed into two 256-bit keys. You could also use a BLAKE3 output reader as a stream cipher or a random byte generator. (These sorts of use cases are why it's nice not to make the caller tell you the output length in advance.)
That makes sense. I hadn't thought about using that as a PRNG, but the idea is interesting to me. I might play around with it and profile it to see how these use cases play out. Implementing a BLAKE3-backed Rust rand::RngCore sounds like a fun little exercise, and would make it easy to profile compared to other PRNGs.
Actually, looking at that trait right now, I see that there are already ChaCha implementations, so the concept is already being exercised in the same family.
Thanks for the explanation. I'm far from a security expert, so more off-the-shelf bits at my disposal means fewer opportunities for me to accidentally implement security vulnerabilities by trying to do it myself.
It's very hard to see Blake3 getting included in FIPS. Meanwhile, SHA256 is. That's probably the biggest deciding factor on whether you want to use it or not.
I dunno, if your crypto choices were just "the best thing that won't be included in FIPS" you would do pretty well; blake3, chacha20, 25519 sigs & dh...
Keccak is my preference. Keccak is substantially easier to implement in hardware: fewer operations and no carry propagation delay issue because there's no addition.
SHA-256 has the advantage that it is used for BitCoin. It is the biggest bug bounty of all time. There see literally billions riding on the security of SHA-256.
It bears mentioning `shasum` is better supported in that it ships with operating systems (macOS, I guess Linux depends on the distro, don’t know about Windows) and is available directly from programming languages (Ruby, Swift, Python, …) without the need for libraries.
Even if BLAKE3 is massively faster, it’s not like I ever noticed SHA256’s supposed slowness. But I do notice its ubiquity.
Based on the article, I would consider switching to BLAKE3 immediately where I use checksums. But until it gets wider support (might be easier with a public domain license instead of the current ones) I can’t really do it because I need to do things with minimal dependencies.
Best of luck to the BLAKE3 team on making their tool more widely available.
SHA256 was based on SHA1 (which is weak). BLAKE was based on ChaCha20, which was based on Salsa20 (which are both strong).
NIST/NSA have repeatedly signaled lack of confidence in SHA256: first by hastily organising the SHA3 contest in the aftermath of Wang's break of SHA1
No: SHA2 lacks the structure the SHA1 attack relies on it (SHA1 has a linear message schedule, which made it possible to work out a differential cryptanalysis attack on it).
Blake's own authors keep saying SHA2 is secure (modulo length extension), but people keep writing stuff like this. Blake3 is a good and interesting choice on the real merits! It doesn't need the elbow throw.