Hacker News new | past | comments | ask | show | jobs | submit login
A SHA-1 chosen-prefix collision attack (zdnet.com)
178 points by hansoolo on May 14, 2019 | hide | past | favorite | 71 comments



Their attacks are based on previous chosen-prefix work from Marc Stevens, who tweeted this about the attack [1]:

"Their $100K figure is based on as-of-yet undisclosed improvements. History shows many claims of low-cost SHA-1 attacks that have not stood up to peer review. I am very sceptical that their attack costs in total less than the $110K building block (SHAttered) that they use."

[1]: https://twitter.com/realhashbreaker/status/11282604227868549...


Marc Stevens quotes $500K, which is very much still a threat (even an order of magnitude more would be). Plenty of organizations would be willing to spend that much pocket change on a single attack.

The game-changer is it's chosen-prefix. A vendor can produce a pair of entirely different binaries with the same hash, but most importantly, they look and behave sane except for the last few blocks of the file. This is easily hidden, especially if the binary is encrypted.

It's not a stretch of the imagination to see how, for example, an IP camera vendor could do exactly this. Yes, it requires a nefarious/complicit vendor, or an insider who can pull this off undetected (not everyone has a fully automated build/release pipeline).

So it changes the threat model. SHAtter was waived by many because the threat model didn't convincingly apply to them. Example: git. That analysis needs to be repeated.

(All this assuming the attack described in the paper is correct and practical in real world implementation)


> SHAtter was waived by many because the threat model didn't convincingly apply to them. Example: git.

Git quickly switched to the sha1collisiondetection library[1] by default after the SHAttered attack was published. It's a SHA-1 library written by the authors of the paper which the attack.

Edit: Marc Stevens saying that existing library will mitigate this new attack: https://twitter.com/realhashbreaker/status/11284190295369236...

1. https://github.com/cr-marcstevens/sha1collisiondetection


That still does not solve the issue with OpenPGP signatures though, does it?


It does, because a tag pointing to the malicious content wouldn't hash with sha1collisiondetection's modified SHA-1, just like you can't add the SHAttered PDFs to git.


Git isn't really designed for cryptographic security, is it? I have heard that Linus wants it mostly secure so people can verify the integrity of linux source code, but it's not its core competency, so to speak. Though I suppose a project the size of the linux kernel could be a serious target for a collision attack.

Regardless, it's switching to SHA-256: https://stackoverflow.com/questions/28159071/why-doesnt-git-...


Git signatures are designed for cryptographic security, and they (currently) rely on the SHA-1 hashes.

When you're signing a commit (or a tag) you're just signing the commit (or tag) message which includes the SHA-1 hash of the relevant "root" tree object (or commit object). The tree object, in turn is (in effect) a list of SHA-1 hashes of the files directly in the directory, along with their file sizes and permissions, plus the hashes of the tree objects corresponding to any subdirectories.

Consequently, if you replaced a file with another having the same SHA-1 hash (and the same file size — a considerable complication), all the hashes would remain the same and the signature would still be valid.

Obviously, once git transitions to SHA-256, the problem will disappear.


An easy example with git would be to create a pair of read-only repositories: one public-facing which is cloned by the general public, and one with (potentially entirely) different contents which can be selectively pulled depending on the client.

There's a complication with the few appended trailing blocks being invalid data, but the format might allow it, and git doesn't verify its integrity recursively.


> git doesn't verify its integrity recursively

Yes it does. On clone the entire history is recursively hashed, and incrementally on fetch.

There isn't even any place in the protocol to transfer pre-hashed content, it must be hashed to make it addressable.


> Git isn't really designed for cryptographic security, is it?

Well, the git documentation says it is: https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work

It is certainly not helpful that Linus at the same time says conflicting things publicly. It would be nice to have some clearly documented expected security properties of the git structure.

This confusion is all a bit unfortunate. While the attack scenarios are obscure, with a secure hash function Git would have some really nice properties to use it in other areas, it would effectively be a secure append-only log. (Some people call this something with the B word which I'll avoid, but that's effectively what it is.)


FWIW, this isn't official Git documentation, just a book about Git.


“Everyone should switch to (in order of preference): • BLAKE2b / BLAKE2s • SHA-512/256 • …”

You know, SHA-512/256 was a terrible name. For someone who’s not a cryptographer, it’s way too easy to confuse the single algorithm SHA-512/256, which resists length extension attacks, with the pair of algorithms SHA-512 / SHA-256, which do not.


"truncated sha512" would be a much better way to talk about it.


It isn't exactly truncated SHA-512 — that is, you can't compute a SHA-512/256 hash by computing a SHA-512 hash and then truncating it. Although the algorithm is the same as doing that, the initial state of the hash context is different.

(The same is true for SHA-224, which could be called SHA-256/224; it's a truncated SHA-256, but with a different internal state.)


True, that’s a good point.


Damn, I knew this and I'd forgotten it. Thanks for pointing it out.


that just goes to show how bad the name is. It doesn't trick you until before you learn, but it keeps on giving!


Can collision attacks provide the same size of data? I suspect it would be dramatically more difficult to produce a collision of equal data-size as the original.

So perhaps, the easiest way to defend against a collision attack is to transmit the size of the data alongside the checksum. Like a checksum, it is extremely lightweight and easy to check.

In my data framework project (where I use sha1 to identify blocks), I've been looking for an excuse to do this, because knowing size of a block is incredibly useful for applying performance heuristics. I suspect it is also a significant defense against collision attacks. Am I wrong?


I included this comment in a response to a child comment, but reproducing it here to make sure you see it:

If your output includes the size of the input, then what you've got is no longer a hash function. A hash function is defined as taking an arbitrary input and returning an n-bit output for some fixed value of n. By including the size of the input in your output, the size of your output grows logarithmically with the size of your input. This may seem pedantic, but fixed-size-output and arbitrary-size-input is extremely important for general usage of a hash function.


Sha1 and sha256 already hash the message length into the hash function and it is limited to 2^64 bits so they do not support arbitrary input size.


For what it's worth, Shattered specifically altered a small part of an existing file. Perhaps with chosen-prefix the story is different.

https://shattered.io (check out the two pdfs)


I forgot to specifically state - I think the two pdfs are the same size.


(This is chosen-prefix attack; there is no "the original".)

In all the attacks on MD5 or SHA-1 I'm aware of, the generated colliding strings had the same length.


So Shattered was "give me two different documents with the same sha1, I don't care what they are". I guess according to Wikipedia that's a "classical" collision.

Then there's "here's two different documents. Append to both of them until the results have the same hash". This is "chosen prefix" collision.

My question is, is there a name for an attack that says "here's a document. give me another document with the same hash."? (in which case, "the original" would have meaning).


It's called "second preimage attack".


Or you could just switch hash algorithms to something stronger.


Yes, if I interpret your suggestion correctly. How would you know that the attacker have not manipulated the size parameter?

That's the best case. Worst case you end up with a memory vulernability (see heartbleed https://xkcd.com/1354/)


> How would you know that the attacker have not manipulated the size parameter?

This scenario doesn't make a lot of sense. Say I have a goodfile that is 512 bytes long and hashes to 3d8850e1, and someone else wants to produce badfile and convince you that it's my goodfile. GP's suggestion is that I publish a size-plus-hash value "512-3d8850e1" for you to check against. If the attacker is in a position to alter the size part, they're also in a position to alter the hash part, in which case why even bother with a collision? They can just change the hash to be whatever badfile hashes to.

The true answer to GP is that if you do this, it's no longer a hash function. A hash function is defined as taking an arbitrary input and returning an n-bit output for some fixed value of n. By including the size of the input in your output, the size of your output grows logarithmically with the size of your input. This may seem pedantic, but fixed-size-output and arbitrary-size-input is extremely important for general usage of a hash function.


From the paper, this doesn't seem to be able to create a collision while retaining the same length of input data.

It seems that checking both the hash and input length would be a very cheap way of identifying attempts at hash collisions.


If you're going to modify your code to add an extra check you might as well just switch to another algorithm entirely.


A lot of existing schemes already check both.

The above comment didn't mention modification explicitly and could be read as talking about the level of threat this poses.


This is one reason why HMAC-SHA1 is still safe.


This is incorrect, the safety of HMAC-SHA1 doesn't have anything to do with input length comparisons. HMAC-SHA1 is still safe because of how an HMAC operates:

Among other operations, HMAC begins by taking the secret key, XOR'ing it with a magic value not under your control, and using this as the first block when calculating an initial hash. In order to guard against an unlikely but potential pathological key / magic value combination, a similar operation is performed as a second round using a different magic value, and this time operating over the hash output from the first round. As such, HMAC operations are safe against chosen prefix attacks against the underlying hash function, because the first block in either round of hashing is entirely outside of your control.

See https://i.imgur.com/PPlVPr0.png for a visual reference. In this diagram, Y is the value being HMAC'ed. As you can see, any attack on the hash function which requires control of the prefix of the value being hashed is a non-starter.


Thanks, I stand corrected.


Where did you read this? Chosen-prefix attack that produces strings of different length would be rather unusual.




Last night I was pondering about future "compression" schemes that relied on hyper-powerful quantum computers that can resolve hash collisions very, very quickly, so that rather than literally compressing the data, you'd just share a hash and then this hyper-powerful computer will enumerate possible strings of data that give that hash, and then check them against some secondary condition (another hash?) to find the right one.

I don't know how feasible this actually is, but it made for an interesting sci-fi thought.

edit: Thanks for all of the enlightening replies!


There are infinitely many strings that produce a given hash. Infinitely many of them will also produce a secondary hash that you check against. There is no general-purpose scheme to compress arbitrary data, regardless of available computational power, because if you can compress all length-N plaintext strings then you must end up with at least 2^N distinct compressed strings which will take on average at least N bits of storage.


The problem is that is a huge number of strings that becomes a hash.

Each SHA-256 hash has 2^(8 * 1024 - 256) = 10^2389 possible 1KB inputs.

So...perhaps it could be done, but the quantum computer would have programmed with a very good selection criteria.


No, there is no way this could be done, because there is no way to know which of multiple colliding inputs was the right one. Imagine a one-bit hash function. You start your “decompression” process and read in a 0. What input produced that bit? Literally fifty percent of all possible strings would produce that same output. Without more information, you cannot choose between them. And the information needed to choose correctly is exactly the same amount of information in the original input.

https://en.m.wikipedia.org/wiki/Pigeonhole_principle


Yup, the pigeonhole principle really defines a hard limit on compression, in that it is impossible to compress (make smaller) ALL files.

The way compression algorithms get around that, is by abusing the fact that we rarely want to send around arbitrary data, real world data has lots of redundancy in it, so we can make algorithms where real world data gets mapped into compressed files smaller then they are (by finding the redundancies in the data), and purely random data gets mapped into files that are actually slightly longer than the data.


Often rendered as the pithy phrase "A compression algorithm is actually an expansion algorithm with some interesting failure cases."


IIRC the best quantum attack against SHA256 clocked in at about 2^85 cost. It's still not practical for most modern hash functions.


2^85 is for finding a collision; finding a preimage of a specific hash would take 2^128 steps of Grover's algorithm. That's thought to be beyond any possible future computer system.


Either way, 85 bits of post-quantum security makes me feel safe.


It may turn out that finding those additional 'hint' bits ("another hash") that dramatically reduce the search space of your computer for a given input is NP hard.


It must be about time for SHA-4.


Nope. SHA-2 (known to developers as SHA224, SHA256, SHA384, and SHA512) was the replacement for SHA-1. SHA-3 was created as an insurance policy in case the SHA-2 family was broken too. So far, it hasn't been.

We won't need a SHA-4 any time soon. SHA-2 is fine, BLAKE2 is fine (and faster), SHA-3 is fine.


Also, SHA-2 underpins Bitcoin; it's the ultimate billion dollar pot of gold. SHA-2 is perhaps the most exhaustively researched (both publicly and privately) cryptographic hash because of this.


Do you know why it's said to be last-resort in the article?


SHA-512/256 is the second resort, and it’s the 512 bit variant of SHA-2 truncated to 256 bits (SHA-384 is similar). Generally any SHA followed by a relatively large number is actually a variant of SHA-2.


Catalin was quoting me in the article, so it's only fair that I elaborate here.

There are four common flavors of the SHA2 family you're likely to run into:

  - SHA-224
  - SHA-256
  - SHA-384
  - SHA-512
And then there are two more variants of "truncated SHA-512" (except they also use different initialization vectors than SHA-512, which is kind of an important detail)

  - SHA-512/256
  - SHA-512/224
These latter two don't have nearly the cross-platform support as the first four.

For example, in PHP, hash('sha256', 'some text') worked since PHP 5.1.2 (without PECL), but hash('sha512/256', 'some text') didn't work until PHP 7.1.0 (which is only a few years old).

See for yourself: https://3v4l.org/A1dZc

As a typical software developer, you might see SHA-{magic/numbers} and probably discover from Google/StackOverflow/etc. that they're in the SHA2 family and that the SHA2 family is secure, and then reason that whatever you're doing must also be also secure.

But there's a problem.

SHA-256 and SHA-512 (not the truncated varieties) are known to be vulnerable to length extension attacks. This is only a problem if you're using these hash functions in a vulnerable way. (Which isn't as uncommon as you'd think in homebrew crypto.)

If you're using HMAC, length extension attacks are a moot point. There's a reason 'tptacek always recommends HMAC for symmetric authentication.

SHA-224, SHA-384, SHA-512/224, and SHA-512/256 are not vulnerable to length extension attacks.

BLAKE2 is not vulnerable to length-extension attacks. It's also at least as secure as SHA2, but faster than MD5. SHA3 is at least as secure as BLAKE2, but is significantly slower in software.

Thus, the recommendations are:

  - BLAKE2
    for speed and security
  
  - SHA-512/256
    if you want speed, length-extension attack resistance, and arbitrary standards compliance
  
  - SHA3-256
    if you care more about security and what the FIPS authors think than speed
  
  - SHA-384 
    if you're concerned about backwards compatibility but don't want to accidentally cause
    junior developers that poorly mimic your designs to introduce LEAs into their code
  
  - Any other SHA-2 family hash function 
    if you're not interested in all of this nuance and want something
    guaranteed to be secure for the next few years that is widely implemented
Of course, there should be a huge asterisk with this list: If you're not a crypto expert, you shouldn't be making this decision. Just stop using SHA1.

It's also worth noting that Marc Stevens-- hash function breaker extraordinaire-- disagrees with my recommendation because BLAKE2 isn't a {NIST,FIPS,ISO,whateverStandardsBodyYouTrust}-approved hash function, and for better or worse, believes strongly in reinforcing public trust in standards organizations. https://twitter.com/realhashbreaker/status/11283816001468948...

He has a point in general, but in this specific case, I think BLAKE2 is going to become the de jure SHA2 successor at least until SHA3 hardware acceleration becomes ubiquitous.

Also, standards bodies have a nasty habit of digging in their heels on their mistakes, instead of issuing new guidance in response to research. See also: WPA3 and Dragonfly vs SPAKE2-EE or OPAQUE.

For a higher-level example, look at the failures baked into the JOSE standards (JWT, etc.) versus PASETO.

Fun fact: If you take JOSE and replace JSON with CBOR, without fixing any of the protocol security problems, you get COSE... which reared its ugly head in W3C's WebAuthn standard. Bad standards never die.

Until we get a standards committee that isn't garbage, I don't entirely agree with Marc's appeal to faith here.

While NIST et al. certainly do a better job at deciding on primitives than self-styled post-2010 cypherpunks (y'know, the ones that try to cascade a bunch of ciphers together in case one is broken but then use CRC32 for mixing files into the encryption key?), their failure to correct (let alone learn from) their mistakes and update recommendations in a timely manner is a problem that can't be dealt with through blind adherence to whatever they publish.


> I think BLAKE2 is going to become the de jure SHA2 successor at least until SHA3 hardware acceleration becomes ubiquitous.

I think you mean "de facto", unless you think NIST is going to amend SHA-3 at this point to designate BLAKE over Keccak.

As someone on the sidelines, having read the linked Twitter discussion you had with Marc Stevens your summary of it honestly seems a bit disingenuous.

You mention speed prominently, but fail to mention his counterargument that raw hash speed isn't relevant for most applications.

In practice hashing speed is drowned out by other things, you're not going to have "Android/iOS devices" (as you bring up) hashing GBs of data as a common use-case, and even if you did the cycles/byte for the hash are nothing compared to other things.

For applications where hashing speed does matter (e.g. some server needing to batch-hash things) you have the option of buying hardware-accelerated SHA-256 to get on the order of 30% faster than BLAKE: https://bench.cr.yp.to/results-hash.html

Then as you note downthread your criteria of "at least as secure" only takes into account "known attacks". The absurd logical conclusion of that criteria taken at face value is that we'd all be better off if we each used our own bespoke hash function, since cryptanalysis would never be able to keep up.

Or, in other words, if the algorithm that became SHA-1 hadn't been picked by NIST in 1995 it would be a viable 160-bit hash function today, since there would likely be no known attacks against it, as it would have been obscure enough that nobody would have bothered with it.

So the criteria for a "secure" hash must consider some balance of its algorithm, as well as (or more importantly) the cumulative amount of attention cryptographers have spent on it.


> The absurd logical conclusion of that criteria taken at face value is that we'd all be better off if we each used our own bespoke hash function, since cryptanalysis would never be able to keep up.

The reason this "logical conclusion" sounds so absurd is that your reasoning here is absurd.

What matters isn't "cryptanalysts haven't published an attack", what matters is "cryptanalysts tried and failed to attack the design".

If you conflate the two, you will end up confused.


A reductio ad absurdum is always going to be absurd if taken at face value, but that doesn't mean the underlying point it's making isn't valid.

Obviously I'm not saying that BLAKE hasn't gotten any analysis. It was a SHA-3 finalist, it's been looked at by a lot of smart people.

I am saying it's a continuum, and it's safe to assume SHA-2 has gotten a lot more eyeballs & man hours by now. Both due to its age, and due to its widespread production use as a NIST standard.

Is that a trite point? Yes, but one you managed to omit in your long summary of a Twitter exchange with a SHA-1 researcher, which is why I'm pointing it out. You're seemingly treating "analyzed" as a boolean property.


> Is that a trite point? Yes

The next question I would ask myself in this context is, "Should I have even raised the trite point in the first place?"


It's trite because it should go without saying, but one you managed to selectively omit when summarizing a Twitter discussion. That makes you seem disingenuous.

It's clear from anyone who reads that Twitter discussion that the thrust of Marc's point is not that we should be "reinforcing public trust in standards". To summarize the discussion in those terms amounts to attacking a strawman.


> the thrust of Marc's point is not that we should be "reinforcing public trust in standards"

It isn't?

https://twitter.com/realhashbreaker/status/11283305376455639...

https://twitter.com/realhashbreaker/status/11283322027758632...

https://twitter.com/realhashbreaker/status/11283784887324385...

"Marc Stevens wants to reinforce the public trust in standards, especially among non-experts" isn't an uncharitable summary of this conversation.

> That makes you seem disingenuous.

If I'm mistaken about the point he's defending, it's not an act of dishonesty.

Given that he at multiple points agreed with my arguments that standards bodies make dumb mistakes and yet continued to double down on "we should just follow standards anyway" without caveat, there aren't many alternative interpretations that I'm aware of.

If I still seem disingenuous, it might be that you want me to seem that way. In which case, there's no point in either of us continuing to participate because it has ceased to be about security and instead has become a discussion of ego, which I'm frankly uninterested in.


Actually, I did add caveats and do recognize that not all standards are equally important. In the tweets you link I'm talking about secure standards, as in standards that are actually considered secure.

But my point was not "just follow standards anyway". This tweet clearly shows that: https://twitter.com/realhashbreaker/status/11283830292552581...

Rather that it is better to promote to work through standards as a community as opposed to your way of where individuals promote their pet primitives. Standards serve a role as a focal point: the benefit of being big targets, getting a lot of scrutiny and easier for non-experts to find out if somebody found security issues.


Okay, then let's work together to get better standards than the ones currently available.


I'm one of the co-author of the SHA-1 cryptanalysis paper. I fully agree with Marc Stevens's: you should not be using Blake2, but SHA2 or SHA3. If speed is an issue, taking reduced step SHA3 (KangarooTwelve) will do the job.


Has KangarooTwelve received more analysis than BLAKE2? Or are both recommendations morally equivalent in the context of standardization and confidence in security margins?

To be clear: I agree with Marc in general, but there are exceptions where I disagree with Marc's "only use {FIPS,NIST,etc.}-approved crypto":

* I prefer Ed25519 over foot-bullety ECDSA (i.e. before RFC 6979).

* I prefer to never use RSA at all.

* I prefer XChaCha20-Poly1305 over AES-GCM, since the former is always constant-time without requiring specialized hardware and you have a negligible chance of nonce misuse even after an absurd number of messages.

* I prefer Argon2id and scrypt over PBKDF2 for key stretching.

* I prefer Argon2id and bcrypt over PBKDF2 for password storage.

* I prefer BLAKE2 over SHA-256, especially when length extension attacks are within the threat model of the protocol being discussed. SHA3 is good. Most SHA2 hash functions are fine (as long as they're not being used stupidly). But BLAKE2 is not only faster, thanks to libsodium, it's more readily available to developers than SHA3 (which, like AES, is only performant with specialized hardware circuits in software that takes advantage of said circuits).

None of the algorithms I've mentioned above are random pet projects by hobbyists.

To resolve this impasse, what needs to happen is: Standards bodies (NIST, FIPS, ISO, etc.) need to stop digging their heels in on sunk-cost fallacies and give at least some of these algorithms a fair consideration. Then one of two things will follow:

1. They'll be found to be as secure as the cryptographers who have studied them already believe, and therefore appropriate for standardization.

2. New attacks will be found, and the state of the art can be moved forward.

And this trend of cryptography towards requiring specialized hardware to be fast and secure? I oppose it wholeheartedly.


What do you mean by "at least as secure"?

They are different algorithms, so I don't see how their security can be provably related.


Saying "BLAKE2 is at least as secure as SHA-256" is saying that:

1. The best known attacks against SHA-256 have a cost of 2^i for some value of i.

2. The best known attacks against BLAKE2 have a cost of 2^j for some value of j.

3. j >= i


Ahh so it's because of length extension attacks. Thanks!


deleted


SHA2 is used in many places in bitcoin. It’s collision resistance is used too, ie in merkel trees.


Is there any reason not to just use SHA-3? It sounds like it's a real swiss army knife to hear the authors talk about it.


Right now it has worse library support, worse hardware support, and isn't analysed as thouroughly as SHA512/SHA256. Also SHA-3 seems to be fighting with Blake2 for popularity, it's not immediatly clear to me that SHA-3 will be popular in 10 years (a point for usage in API design etc).

But those are temporary problems that might not even matter to you.


There are reasons to prefer other functions, but none of those are disqualifying marks on SHA-3.

If you have SHA-3 available, just use it. Everything listed in that article is secure today and will probably be secure 10 years from now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: