From the same thread (Peter Gutmann, Fri Feb 24 00:42:36 EST 2017):
"After sitting through an endless flood of headless-chicken messages on multiple media about SHA-1 being fatally broken, I thought I'd do a quick writeup about what this actually means. In short:
Reports of SHA-1's demise are considerably exaggerated.
What CWI/Google have done is confirmed what we've known for a long time, that SHA-1 is shaky. Using a nation-state's worth of resources and a year of time (https://security.googleblog.com/2017/02/announcing-first-sha...), they've shown that, with a very carefully-crafted document, you can create a collision. Their presentation of the results is detailed and accurate, it's the panicked misinterpretation of those results that are the problem."
110,000$ is not "a nation-state's worth of resources". I agree with the rest though, the sky is not falling but people shouldn't react to baseless alarmist claims with baseless overconfident claims.
The implied meaning might have been "a significant post on a nation-state's cyber attack budget"? I'm pretty sure they did not mean "the total budget of a nation-state" or anything of the sorts, since that's obviously wrong.
One has to agree, an entity willing to drop a cool $100k on finding a single SHA1 collision to try and attack your git repo is a lot closer to nation-state level than the for-the-lulz level.
Why do people always say 'nation-state' specifically in these cases, as well? Some of the richest states in the world aren't nation-states, like the UK.
I imagine because what they actually mean (state) gets ambiguous and confusing because of the united states, which are not really states in the same sense.
$100k isn't that much money, particularly since collisions can be reused for multiple attacks w/ length-extension. Heck Bitcoin has had (ineffective) spam attacks that have probably have cost around that much, and there's good reason to suspect they've been privately funded by angry trolls.
There's a lot of people for whom $100k is "fuck you" money.
"fuck you money" is something different - it's the amount of wealth you need (varying per individual) where you can comfortably say "fuck you" to a particular job or opportunity or proposal someone makes to you if you don't want to do it. I believe the term you're looking for is something like "chump change"
I have seen it used in that (to mean the same as chump change) in linkedin articles by random recruiters, so I guess it will suffer the fate of literally vs. figuratively. Terrible.. but use dictates meaning, if it goes mainstream.
Can you give an example? "Fuck you money" is pretty literal already, - the money required to be able to say "fuck you." I can't see how it can make any sense in any other context.
"For a single malicious C file in the linux kernel, once"
(My understanding of the method is it might be extendable to modifying a comment mid-file and then introducing later code, instead of modifying a JPG inside a PDF)
It cannot. The Google implementation must effectively be done on a blob as the result would not be usable in a structure specific document. What is more, things that require a block chain (like git) are NOT covered with this current attack as both the source and resulting have to be worked on.
Currently the attack vector only works when you can get both documents to "work towards each other" to produce a valid identical SHA1 value.
This was a fixed prefix collision attack. That means they can make two documents (P | A | anything) and (P | B | anything) for a fixed P, and they can find A, B, such that A and B are different but
SHA1(P | A | anything) = SHA1(P | B anything)
The Merkle-Damgård construction (used in MD4, MD5, SHA1 and SHA2 but not in SHA3 and some other modern hashes) invariably means length extension is possible, if you can collide two documents then you can add a suffix to both and also get a collision.
This is how there's already a web site where you feed it images and it makes a "different" colliding PDF, it's just using Google's result with a different suffix after the 128-byte collision near the start.
I think the initial r&d to get to this point is more along the lines of a nation-state investment. Google paid much more than $110k to get this working. It's not clear exactly how much it would cost to "weaponize", either.
That was just SHA1. Linus mentioned the other day that there was another layer and that they weren't worried. It would take considerably more resources to crack that again. But it is rather jolly to speculate about such things and other users of SHA1 (Windows?) might not nearly be so immune?
Is that the cost for just the compute resources assuming time from people with expertise is free? Or setting up the resources to have a stable of people with the right background... Once you have that, then yes maybe its 100k.
It's worth noting that the figure of $110000 was not mentioned in the referenced message, so probably Peter Gutmann was thinking at a different scale when he wrote "a nation-state's worth of resources".
Security is quite often about the amount of money you have to put in to get something or somebody hacked.
110,000 USD is in the ballpark of state level players when we are talking about forging documents to avoid any sort of tampering detection. It has practically zero use of small time hackers or script kiddies. Why would anybody invest 110K into a collision? What is the practical use of it?
The thing people fear is (1) A collision that lets you have good code pass review, then have evil code released to users; (2) That happening to Linux/Android/Firefox/Chrome; (3) The cost of creating a remote code execution exploit being lower than the market value of that exploit on the black market.
I don't know how /realistic/ this fear is. Certainly, if everyone PGP signs all their commits, it's a much-reduced risk - but how many projects mandate that?
or some less scrutinised but widely deployed package
There are many low cost ways of doing "$100k worth of AWS" computation. Eg botnets, distributed volunteering, moonlight use of employers idle servers etc etc.
Also: that cost is certain to drop, and it might drop quite quickly - simply due to software and hardware improvements. If anything algorithmic shows up, it could change dramatically. Let's not wait for that to happen.
You'd ideally want to do this with a binary blob (firmware or graphics driver, because you know there's one sitting in git somewhere). Then, how is anyone going to know the difference?
> Why would anybody invest 110K into a collision? What is the practical use of it?
Suppose you are on the verge of completing a major sale to some large, nervous purchaser -- perhaps a major world military. This is a decent-sized but not huge sale: $2 billion, with profits of around $200 million. The other major competitor for this contract is built around Linux and your offering relies on a custom operating system.
Your head of sales thinks that the the purchasing agent seems particularly concerned about security issues with the operating system -- keeps asking questions like "So, can you document that your system is less vulnerable than some 'open source' system?". The head of sales makes a rough guess that a news story about vulnerabilities in Linux might sway the chance of winning the contract by around 5%.
So: that's $10 million in value to your company that might created by generating publicity about the vulnerability of Git so long as that publicity is generated at the right moment in time. What's the chance that 1% of that can be "found" to make it happen?
The thing is: $110,000 is actually a very SMALL amount of money, relative to the amounts of money that many influential people manage on a daily basis. The use doesn't have to be very practical for it to be well worth it.
Pathes in Linux are reviewed by multiple people before merging. Even if you create a collision and submit patch you cannot really do much without write access to repo. It is even more difficult because person merging path will not fast forward in most cases.
This attack still do not allow for inserting a arbitrary data in arbitrary places to make attack on Linux possible. Finally SHA1 in git also take size into consideration and make this attack even more expensive[2].
People should really chill out. There are cheaper attack vectors that collisions.
Notice that the attack I described does not require actually merging in the patch, it only requires that news stories be written about how there might be such a vulnerability.
I am NOT implying that it might not be hypothetical. I have absolutely no reason to believe that anything like this has been attempted. I'm just trying to point out that for many out there, $100K is chump change.
If there's no practical use for it then even state level players won't bother with it.
If there's ever a practical use for it (i.e. money to be made) 110$k is totally accessible to the private sector. It's definitely not "a nation-state's worth of resources" which is the quote I was replying to.
Fortunately there doesn't appear to be a whole lot of practical use for these collisions for the time being.
In other words, SHA-1 is still nowhere near as insecure as MD5, the latter for which collisions can be generated in seconds on hardware everyone already has.
I feel like this still ignores most of what Linus said on why git isn't broken. In particular "it's fairly trivial to detect the fingerprints of using this attack" in his Google+ post. https://plus.google.com/+LinusTorvalds/posts/7tp2gYWQugL
And there are already patches on the mailing list for that.
I think the fingerprint argument is pretty weak actually. There is still a lot of unreadable content in git repos, including binary blobs in the kernel.
You don't understand the fingerprint argument. For the specific SHA-1 attack, it's possible to detect, while calculating the SHA-1 hash of an object, whether the bit pattern indicative of this specific attack is present. This is done automatically, without needing any human intervention. This is one of the things which Google released immediately as part of their announcement.
The other thing which people seem to miss is that it requires 6,500 years of GPU computation for the _first_ phase of the SHA1 attack, and 110 years of the GPU compatation for the _second_ phase of the attack. You need to do both phases in order successfully carry out this attack. And even if you do, Google released code so that someone can easily tell if the object they were hashing was one created using this parituclar attach, which required 6,500 + 110 years of GPU computation.
But alas, it's a lot more fun to run around screaming that the sky is falling.....
Thanks, I was wrong when saying "fingerprinting". The fingerprinting technique is actually quite reassuring. I was thinking of that he says
"But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice. " , which I still think is a very weak argument.
It might or might not be true for any particular developer, and his argument does not refute the claim that the SHA1 integrity checks for that code is being rendered useless. I specifically recall that Linus previously described the hashed chain of commits as something which would prevent malicious insertion of code. And this has now, at least to some degree, been compromised.
He did provide some solid countermeasures and migration plans, but I think he could have been more acknowledging to all the people who predicted this attack. It would have been a good idea to prepare for changing hash function eventually.
keep in mind you have to maintain/commit the initial blob and then later the malicious one (again and again, this is no pre-image attack - the initial blob has to have a well designed place with random jazz ready to be replaced)
You could just place a malicious one from the get go and no one would know (or they would know just as much -- blob do rely on virtually unconditional trust)
True. But I thought that the point of the hashes was to ensure that something which you had already verified (through review or testing or whatever) could not be tampered with without the changes being brought to your attention. And this property does no longer hold.
Yeah, but in your case you would just get the binary, verify it and push it yourself.
If you're using some weird way of getting a binary that you have already verified, but that could somehow differ, and you're hopping that git will catch the difference, you're doing it wrong to begin with.
"But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice.
"
If you're working on such a massively important git repo with very poor security measures and trust levels at $200k break in status that are practical... yeah, maybe bigger problems.
Its possible to go in and replace the hash algorithm with something else, which none of these "git is going to ruin everything with not replacing SHA1 this instant!" people seem to bother with, to prove their points, instead of endless posturing.
http://stackoverflow.com/a/34599081 has actually gone about doing it, but it has been over a year since that, and as linus says, there has been multiple collision mitigations added as well, so tests should probably be re-done
My git with different hash would be useless. It wouldn't be able to interact with github, bitbucket, or pull/push to anyone else's repositories. I may as well rename the package.
Fixing this is going to require breaking backwards compatability with every program that works with git -- it's going to be a huge undertaking, because early in git's design they didn't support multiple hash functions.
The forged hash collision will always be a weak point. Things will get worse with cost reduction of processing power. It's a loosing battle.
The only way to detect without error two identical files is by comparing the files. This comparison can be speed up by comparing compressed version of the files.
The other functionality of hashes is to build a presumably unique file identifier. The byte sequence of the compressed file could serve as identifer.
So instead of using the file system as index with the sha1 name as file name, we would have to build a specific database organized as a set whose values (compressed files) would be the keys. A hash index could be used to speed up the search and equality test. Here a very fast hash would do the trick. Sha1 or a faster hash would be ok. The file system could then be used to organize the hash buckets as does git.
File comparision would of course first compare compressed and uncompressed file size. Or use other hashes or longer hash values to detect different files. When all these values are identical, then a file comparison must be performed to detect if we have a collision.
File compression can only get better and faster.
So basically git would only need to add hash collision detection and the capacity to support different objects with the same hash identifier.
With reasonably long cryptographically strong hashes it's not a weak point (or you could call almost all cryptography a weak point). Weak point is hard-coded algorithms and sizes into Git (if I understood the problem correctly). Software should be written with more generic approach, so algorithms could be changed and migrated when necessary. SHA-1 is considered weak for many years, git should've migrated from it already.
> With reasonably long cryptographically strong hashes it's not a weak point (or you could call almost all cryptography a weak point).
We do not yet know if one way functions truly exists, so from a theoretical standpoint, any hash function is a weakpoint if you do not properly handle malicious collisions.
> SHA-1 is considered weak for many years, git should've migrated from it already.
Linus addressed this many years ago, when de was working on the first version of Git. It was chosen, despite the fact that it was known to be weakened. I don't know if they lost sight of this, or the geniuenly still believe that malicious colliding hashes are not a problem. I do not know enough about the intimate details of Git to comment on that fact.
Git could have been written with a pluggable hash system, but for something that needs to be changed let's say once every ten years, is the up front cost really worth it?
That's not good enough though if you need to work with repositories that were made with previous releases. Just converting them is one thing, but what if you need to stay compatible indefinitely.
And what about old releases that encounter a new repo?
And what about URLs and emails that reference commit hashes? Think archives of mailing lists that suddenly become useless unless there's a way to keep both hashes around.
Yes. These are all solvable problems (maybe not the old-release needing to handle new-style repo gracefully), but the complexity is much higher than upgrading a global constant.
I think you could solve most problems by just enabling a different hash function with no backwards compatibility. Repositories have a format-version somewhere, and migrating from git to git-with-new-hash should be a fairly simple operation. You can always edit commit messages to add "corresponds to commit <sha1> in <old repo>". This is of course not as nice as full backwards compatibility, but it gets rid of the security problem for relatively cheap.
The problem isn't the number of bits (yet). The problem is the choice of hash. A truncated sha2 would still be fine for years to come (barring a sha2 break, which doesn't look imminent). 160bits may not be a huge margin, but it's still enough.
If you use a strong hash with say 256 bits, it's not a weak point. Random collisions are less likely than cosmic rays flipping bits in your programs and unless you don't believe in strong cryptography attackers can't do much better.
This argument still assumes that Git uses SHA1 for security. Linus points that out and John doesn't attempt to refute it, simply ignoring it. Linus should have used Murmur, CityHash or something - a SHA1 collision was going to happen eventually. By using a content identification hash function we could have avoided this argument entirely.
What you need to understand, is that Git's data structures are essentially annotated Merkle trees [1]. So whatever you sign, be it a tag or a commit, it will be nested sha1 hashes like [someData].sha1([someData].sha1([someData].[aFile]).[someData]).[someData] . And at every level you can conceivable construct a hash collision. So if e.g. you create your own commit on top of a commit of an attacker and sign your commit, you are only signing a (sha1 of a sha1 of a) sha1 of the commit of the attacker. If the attacker's commit was crafted to enable a sha1 collision somewhere, then your signed commit doesn't cover the files and commits you see, but only the sha1 hashes of those objects.
This kind of hairy distinction of what a signature was supposed to mean and what it actually covers is what you get with (semi-)broken cryptographic primitives. It's awful and, frankly, unnecessary.
I appreciate that info, +1. My main point was just if you're dealing with a repository where actors that have those 6 figures to spare to attack you (and that's a minority):
1) you've got to rely on a lot better security than the minimal if at all security provided by git (it assumes a web of trust). If people are signing off with PGP sigs but not watching diffs, you've got big problems.
2) You're probably far more likely to be exploited by far cheaper methods at this point. If they have access to a trusted contributor's keys, it's far more cost effective to slip in other tricks than sha-1 collisions right now. I'd say this is the main point so far, but admittedly maybe not in the future.
3) It sounds like Linus and the git devs have admitted they need to migrate from sha-1, but also I haven't seen any cheap, exploitable PoC for git yet based on this due to how they actually mix in other info instead of raw sha-1 hashes of the files.
4) As far as I know, and I'm sure I'm subject to correction, but there hasn't been a WebKit svn repo-esque calamity yet like what they've experienced dropping 2 sha-1 collision PDFs into the repo in a Git context yet.
Again, I'm totally open to new info, but the sky-is-falling attitude right now is what I'm mainly arguing against.
If you drop two identical files into a linux repo then they will be rejected by the maintainers. You don't even need to get into a technical solution to prevent it.
However, as far as I know, when you sign a git commit you are actually sign the hash of the commit.
With SHA-1 broken in the current way it essentially means someone with 110K to burn could forge a commit and reuse my PGP signature.
Yes, if you sign off on commits you haven't reviewed to confirm the diffs at all, and they're carefully crafted files to make the collision, you may commit a duplicate sha-1, which still doesn't even break Git.
This is an important distinction. Without a preimage attack on sha-1, the only vulnerability is if some part (yet any part) of the git objects reachable from the signed tag or commit is one half of a prepared collision pair.
Well holy smokes. I don't know which repository you contribute to, but if you're getting undermined by such James Bond-esque deception by super villains, in addition to someone spending 6 figures into breaking your stuff, I'd hope you'd at least review the commits you sign with your key after glancing at it.
In addition, you'd have to have everyone else not notice it, all the insanely cheaper exploits not been tried on your current setup, and all the other stars aligning...
That might be a hint that Git isn't something which you should allow it to handle the security. Literally the first step to the entire thing: pick any email or name...
Please provide reasonable security policies in your repos--and if someone is exploited, you've probably got far bigger problems than someone duplicating a sha-1. Not necessarily, but highly likely your system is owned.
My original statement is slightly incorrect; you can. The commit hash is probably used as part of the signature. It would have been better to sign the commit blob directly, as Git stores the length alongside the hash from that point down in the DAG (making prefix attacks impossible).
I could be crazy, but aren't the hashes the diffs of checkins?
It's certainly possible to create a valid patch file that causes a collision, but it seems really hard to make a collision that looks like a valid pull request.
I understand your concern (I think) but consider all the extra stuff that has to happen for someone to accept a pr.
Edit.
I do agree it's time to start thinking about moving to the fire exits
No, that's not how Git works. There are multiple objects in Git that use SHA-1 for identification. A common point of confusion is when someone thinks a commit is essentially a diff. It's not, each commit is a snapshot that can be used to reconstruct the entire work tree. You get the diff when you compare the commit to its parent.
-S[<keyid>], --gpg-sign[=<keyid>]
GPG-sign commits. The keyid argument is optional and defaults to the committer identity; if specified, it must be stuck to the option without a space.
I don't know much cryptography. Wouldn't an attack require you to forge a commit object which is a good-looking patch, along with a valid signature (signed from someone you trust), which has the same identity (SHA1 hash)?
The attack is not as difficult as that. If you can create a valid git object which collides with another git object, signatures for the previous object tree (which is identified by SHA-1 hash) will be valid for the new object tree (which has the same hash).
So a collision in a blob that represents a file (or any other internal git object) will cause in your old signature still being valid for the new file that corresponds to the git collision.
"But the _real_ security comes from the fact that git is distributed, which means that a developer should never actually use a public tree for his development."
Well, he was saved by SHA-1 being still cryptographic enough to rely on the head-sha of his tree to know nobody changed anything after somebody broke into kernel.org. Not sure how his fetch/push policy is, but my guess is this would have been more of a headache if it would have been MD5.
That's not a problem with git. If anything it's an issue with github, but it's a pretty insignificant one IMO.
Yes, if you say you are billg@microsoft.com and make a commit to some repo on github, github will look up the username associated with billg@microsoft.com and show that user as the committer. Should it do that? Eh, probably not but this has come up a few times and github hasn't changed it. So by now we should just start to educate ourselves that this is how github is intended to work.
This reads like a giant "I told you so." circa 2005
"In the next few years, nasty people will teach him the threat model"
I'd like to see those very forceful claims substantiated. Git hasn't said moving forward it will never change from sha-1 and 2005 was a far different era than 2017 for crypto. Let's keep that in perspective.
1) Yes, Git should move to a better hash function at some point.
2) No, even easy "malicious" collisions in SHA-1 will still not break most of Git's usages. You're already trusting the repo you're pulling because of TLS, you're already trusting the commits you're getting because of peer-review (you read the commit) and a web-of-trust (you trust your collaborators). (And you're trusting commits even more when they're signed.)
SHA1 is still OK for identifying files. The probability of random collision is still very low. The only problem is for forged collisions.
The object store could be modified to support file collision. One way to disambiguate collision is to use a randomly generated byte sequence as SHA1 seed or hashed before the file data. This random byte sequence would behave like a salt and disable any forged collisions. A single seed for the whole repository would be enough. It should remain secret to prevent forging a collision with the two hashes. It's harder but not impossible.
To test if a given file is in the object store, one first compute the SHA1 key to use as file name. If no collisions ever occured with an object a file with that SHA1 name will be present in the store. That file contains the usual data plus the second hash computed with the random seed. This second hash could be added as needed to keep backward compatibility and provide silent automatic upgrade.
When one need to test if the file is present in the store, one computes the normal SHA1 key and the secondary hash with the seed. We locate the object in the store uisng the first SHA and test for file equality with the randomly seeded hash.
Using a faster hash like blacke2 to compute the random seeded hash could mitigate the price to compute two hashes. It should be parameterized this time and the hash size should be variable.
If a collision is detected, that is the secondary hash differ, the file is replaced by a directory with the common SHA1 as name. The colliding files would be stored in the directory using the secondary hash as name. Or the files could be packed in a single tar like file with the secondary hash used as file identifier.
This should be enough to protect against forged collisions which is the only real problem. The required change to git would be limited. The only serious disadvantage is the need to compute the randomly seeded hash.
When I designed fingerprint (https://github.com/ioquatix/fingerprint) I allowed multiple checksums, which means you CAN migrate from one checksum to another pretty easily. However, I didn't use a Merkle tree in the initial design and so I hope to revisit it at some point to improve it.
It should be trivial to add an additional checksum to git. Not to replace how SHA1 is currently used, but to add essentially per-commit checksum, which is a checksum of the entire commit contents (including the checksum of the previous commit). It wouldn't be as elegant as using SHA256 in place of SHA1, but at least you could, with some effort, validate the source tree in a cryptographically secure way.
Why does Linux think that the source file in question has to acquire an incomprehensible blob in the middle in order for a commit hash collision to occur? Can't the attacker just make all the changes he wants and then insert a random 1 KiB file somewhere to compensate for the commit hash? It could be totally tucked away somewhere you don't expect... you wouldn't see it just by looking at source code.
The demonstration is a pair of PDFs that display differently. However, Google didn't find two PDFs that just happen to collide. Instead, they built one PDF that contains two jpegs and a switch that selects one of them for display. The neat bit is that the file has the same hash regardless of the switch's setting.
This attack won't work on plain text source because it won't look like source code.
Forgive my ignorance. I don't understand the threat model being worried about here. If someone breaks into your CVS/SVN server and rewrites project history, how is that less bad than someone breaking into your remote git repo and rewriting project history? Don't both attacks require a break-in at the server/remote? Or does CVS/SVN have a better method of detecting such a break-in?
I don't get why the discovery of a non-preimage attack is causing so much consternation.
If the mere existence of collisions is not acceptable to your VCS, then your VCS can't use a hash, period.
If you're worried about an intentional attack, it's no closer today than it was last week: the attacker doesn't control the output hash of the collision, or either input.
No VCS is 100% secure against the possibility of catastrophic failure. If an asteroid wiped out all life on earth there is no VCS that can handle that gracefully. So as long as a hash collision is less likely than that, using a hash and not handling collisions gracefully doesn't make the VCS substantially less safe.
Git should have been written with pluggable hashes, right. Also, SHA1 could have been invented stronger against collisions in the first place.
In the retrospect many decisions which might have simplified current moment's problems are obvious, but in reality of those decisions in the past they never are. This feeling (it's called the retrospective predictability) is not a function of current problem or previous wrong decisions, but the random reality of events in complex systems, open source software implementing fresh ideas in a new way is being random and complex enough for this.
Luckily there's a long thread on bitcoin-dev mailing list about how disastorous the consequences can be if somebody is able to change the git-tree because of this attack scenario, so I'm not the only one who thinks that signing only the commit can make billions of dollars worth of damage. And this doesn't take into account that almost all companies use open source software developed on github, so I believe that any remote possibility of adding a malware to open source software is emergency situation.
"After sitting through an endless flood of headless-chicken messages on multiple media about SHA-1 being fatally broken, I thought I'd do a quick writeup about what this actually means. In short:
Reports of SHA-1's demise are considerably exaggerated.
What CWI/Google have done is confirmed what we've known for a long time, that SHA-1 is shaky. Using a nation-state's worth of resources and a year of time (https://security.googleblog.com/2017/02/announcing-first-sha...), they've shown that, with a very carefully-crafted document, you can create a collision. Their presentation of the results is detailed and accurate, it's the panicked misinterpretation of those results that are the problem."
Continues here: http://www.metzdowd.com/pipermail/cryptography/2017-February...
[edit: typo]