> There is also the risk (which cannot really be made to go away) that the longer hashes used with SHA-256 may break tools developed outside of the Git project
Easy fix if that is really an issue, just truncate sha-256. The length of the hash is not the issue that needs fixing (even if its a nice side benefit).
> that is only the first step in the development of a successful attack. Finding a collision of any type is hard; finding one that is still working code, that has the functionality the attacker is after, and that looks reasonable to both humans and compilers is quite a bit harder — if it is possible at all.
I mean, if you have any sort of binary files in your repo, that's pretty doubtful.
The way you mostly do this, is the colliding part is a short binary blob which is embedded, and then the file has code outside the colliding part that does different things depending on the value of the blob.
Yeah, getting that past human review with a source code file is going to be tricky. Otoh if you have any sort of binary assets in your git, (this might even include images depending on the attack goals, e.g. goatse attack) this seems a lot more plausible
P.s. to be clear, i agree that sha-1dc variant removes most of the urgency.
>> that is only the first step in the development of a successful attack. Finding a collision of any type is hard; finding one that is still working code, that has the functionality the attacker is after, and that looks reasonable to both humans and compilers is quite a bit harder — if it is possible at all.
> I mean, if you have any sort of binary files in your repo, that's pretty doubtful.
> The way you mostly do this, is the colliding part is a short binary blob which is embedded, and then the file has code outside the colliding part that does different things depending on the value of the blob.
> Yeah, getting that past human review with a source code file is going to be tricky. Otoh if you have any sort of binary assets in your git, (this might even include images depending on the attack goals, e.g. goatse attack) this seems a lot more plausible
My view is that if you find yourself rationalizing away potential cryptographic issues with "I bet this will be hard to successfully attack in practice", you're probably better off just fixing the problem if you can. Once you've moved from just relying on cryptographic security to non-cryptographic factors like human code reviews or constrained input formats, you've made it significantly more complicated to evaluate the security of your system and significantly increased the risk that an attacker comes up with an approach you haven't considered.
It's very tempting to conclude that a cryptographic attack isn't really an issue for your system and you don't have to change anything, but that conclusion is almost certainly not based on a real understanding of the risk you're accepting. Just using SHA-256 or something similar is almost always a better answer than coming up with some more complicated reason to keep using SHA-1.
Interesting note - there's already standard representation for truncated sha2 - SHA-512/224 and SHA-512/256, but unfortunately no with output length of sha1. Even more interesting thing is that those truncated representations are more secure against length extension attacks.
A proper solution would be to define SHA-160, similarly to how SHA-224 is defined, i.e. you would use a different initialization constant and truncate output of the core SHA-256 algorithm. But I guess, it would be a bit more difficult to implement than simply truncating output of an existing SHA-256 implementation.
Truncating the sha256 hashes does sound like a reasonable intermediate step and should also enable interoperability (a guess from my side - if it is only about referencing objects, it probably does not matter how the keys were generated).
At some point one could then transition to the full hashes and make the truncated ones an option.
I‘m wondering what tooling is heavily dependent on the length of the hashes. Potentially if you want to keep the size of the transmitted data small (at work, we once considered git as a versioned database for an IoT use case…).
I think the ancestor comment meant SHA1(SHA256(X)) instead. Not clear to me how that wouldn’t have collisions, too. Just that the underlying commits that generate the collisions would need to look different.
I don't follow. If the algorithm is SHA1(SHA256(X)) all an attacker can modify is X. Yes it's possible to find a SHA1() collision, but finding X where the SHA256() will generate a collision -- that is SHA1(SHA256(X)) == SHA1(SHA256(Y)) -- is still required.
The question is does the SHA1 step make this any easier?
Don't you still have to either break SHA256 (predicting the hash it will generate) or do this by brute force?
I’m OP. I mean SHA1(SHA256(X)) but I have no idea if that makes a collision more difficult than SHA1(X) or any other implications. It was a way to reduce to hash length without truncation.
I'd imagine it's easy error to make to just go and load sha1 length of characters from git, or splattering some validation in code going "okay this is not sha1-length hash, must be something wrong with data"
I don't know how the collision detector works, but in general, you don't even need binary files do you? Just add a comment in a source file at the end of some line with near-arbitrary data. Bonus points if it's preceded by enough whitespace to fool the reader into thinking there's nothing there.
I was just going on its a lot harder to trick a human in a text format. Most collisions involve a bunch of binary data that isn't valid utf-8, which looks very conspicious in a text file.
Easy fix if that is really an issue, just truncate sha-256. The length of the hash is not the issue that needs fixing (even if its a nice side benefit).
> that is only the first step in the development of a successful attack. Finding a collision of any type is hard; finding one that is still working code, that has the functionality the attacker is after, and that looks reasonable to both humans and compilers is quite a bit harder — if it is possible at all.
I mean, if you have any sort of binary files in your repo, that's pretty doubtful.
The way you mostly do this, is the colliding part is a short binary blob which is embedded, and then the file has code outside the colliding part that does different things depending on the value of the blob.
Yeah, getting that past human review with a source code file is going to be tricky. Otoh if you have any sort of binary assets in your git, (this might even include images depending on the attack goals, e.g. goatse attack) this seems a lot more plausible
P.s. to be clear, i agree that sha-1dc variant removes most of the urgency.