> The use of SHA-1 in Git is not for security purposes, [...] Only, it is. When ...

gsu2 · on Dec 31, 2022

Git isn't relying on collision-resistance, it's relying on second-preimage[0] resistance, which is to say: in order to sneak a hash collision in to a git repository, you have to sneak _something else_ that's already trusted (e.g. via code review) into the repository; collisions can't (yet) be generated for arbitrary hashes.

I haven't heard of any second-preimage attacks against MD5, much less SHA-1, so mlindner was correct in asserting that MD5 would be fine (assuming 128 bits are enough). See also the analysis in [1].

More to the point, if you're able to sneak something into a repository in the first place (e.g. a benign file that generates a collision with a malicious file), then you're probably able to sneak in something more directly (e.g. [2]) that won't rely on both getting something in a trusted repository and then cloning from a different, untrusted source.

[0]: https://en.wikipedia.org/wiki/Preimage_attack

[1]: this is getting a bit old, but should still be relevant? https://electriccoin.co/blog/lessons-from-the-history-of-att...

[2]: https://en.wikipedia.org/wiki/IDN_homograph_attack

oconnor663 · on Dec 31, 2022

> if you're able to sneak something into a repository in the first place (e.g. a benign file that generates a collision with a malicious file), then you're probably able to sneak in something more directly

Could you imagine using an implementation of TLS that "probably" authenticated your network traffic though? I think there are two separate reasons we prefer to make strong guarantees in cryptography:

1. That's often really what I need. If I'm downloading e.g. software updates over the network, I really need those to be authentic.

2. Even when I arguably don't need strong authenticity, like just reading some news articles, I want to use the same strong tools, because I don't want to have to study and understand (much less teach) the situations where some weaker tool fails. Inevitably I'll get that wrong or just forget, and I'll end up using the weak tool in some case where I should've used the strong one.

In this case, if I imagine teaching how commit signing works with a weak hash function, it sounds like "Signing commits means that no one can sneak malicious content into your repository, unless they first steal your secret signing key, or else you ever committed (or allowed anyone else to commit) a non-text file that they created." Actually writing that second part out makes it feel really bad to me.

seba_dos1 · on Jan 1, 2023

> "Signing commits means that no one can sneak malicious content into your repository

Signing commits does not mean that even when using cryptographically secure hash function. All it means is that you put your signature over a particular state of the repo (and, by extensions, its parent states). It has nothing to do with preventing "sneaking things in" - although it could be a (small) part of the whole set of measures taken to prevent someone from doing that.

eru · on Jan 1, 2023

> All it means is that you put your signature over a particular state of the repo (and, by extensions, its parent states).

That's technically true. Though in practice I think the implied social contract is that signing of a commit means you signal some kind of approval for the diff between the signed commit and its immediate predecessor(s).

gsu2 · on Dec 31, 2022

I'm not 100% sure I understand your point, but it sounds like you're concerned about signing something using a weak hash function (i.e. where the hash of something is what actually gets signed)?

If that's the case, then my point is pretty simple: yes, SHA-1 is broken for signing untrusted input (due to weak collision resistance), but it is not broken (so far) for signing trusted input (due to strong preimage resistance).

My point earlier was primarily that the contents of a repository are generally trusted (via mechanisms like code review), and signing trusted content still works even with SHA-1.

Note that certificate signing vulnerabilities (which I assume is why TLS was mentioned?) usually rely on a malicious actor presenting one certificate and then presenting a different cert later; they can't arbitrarily fake existing certs from somebody else.

The analogous scenario for git repositories would be to have a malicious actor make a commit (or blob, tree, etc.) that could be swapped out for another. But if you already have malicious actors able to make commits in your repository, then the hash function doesn't matter: they can cause damage in many, many other ways.

eru · on Jan 1, 2023

> The analogous scenario for git repositories would be to have a malicious actor make a commit (or blob, tree, etc.) that could be swapped out for another. But if you already have malicious actors able to make commits in your repository, then the hash function doesn't matter: they can cause damage in many, many other ways.

The malicious actor can pose as a good-faith contributor and submit Pull Requests to your repository.

You review the code in the PR, and perhaps even prove it correct. Later on, the malicious actor can do the swapping trick. (Eg by running a mirroring service for your repository.)

gsu2 · on Jan 1, 2023

> You review the code in the PR, and perhaps even prove it correct. Later on, the malicious actor can do the swapping trick. (Eg by running a mirroring service for your repository.)

Having a copy of code that is reviewable and then searching for a malicious collision is a preimage attack; extending two chosen prefixes (e.g. one "valid" and one "malicious") until they meet at a hash collision is how most practical (?) collision attacks work. The latter scenario produces large junk sections in the results, which should be obvious under even mild scrutiny.

If the reviewer misses the kilobytes of garbage in the middle of a file they're reviewing, then an attacker can just sneak malicious code in directly without requiring a hash collision.

If the project relies on an effectively unreviewable binary file that could hold kilobytes of junk (like some YAML files I've seen...), then that's already breaking the review process without requiring a hash collision.

Ignoring all of that, anybody grabbing code from an untrusted source is already vulnerable to whatever attacks that untrusted source wants to employ, with "exploiting hash collision" being one of the higher-effort attacks that could be mounted.

Essentially, any repository that would be vulnerable to any of the known hash collision attacks (via bad review, untrusted upstream, etc.) would be vulnerable to more mundane, easier attacks against the same weaknesses that do not depend on hash collisions.

eru · on Jan 3, 2023

> Having a copy of code that is reviewable and then searching for a malicious collision is a preimage attack;

No, it's not. You can sneak extra entropy into minor formatting choices or variable names etc, or exactly what you write in your commit messages. Or probably even ordering of files in your directories. (I don't think the git protocol enforces that files have to be in eg alphabetical order.)

> Ignoring all of that, anybody grabbing code from an untrusted source is already vulnerable to whatever attacks that untrusted source wants to employ, with "exploiting hash collision" being one of the higher-effort attacks that could be mounted.

I'm not sure. If your hash works fine, as long as someone trusted gives you the commit hash, anyone untrusted can give you the actual source.

And if you mean accepting PRs: accepting PRs from the untrusted internet basically how open source works..

dagenix · on Dec 31, 2022

I don't believe that is accurate.

First - if git really didn't care about collision resistance, there wouldn't have been a need to switch to SHA1DC as the hash function. They switched because they care enough that they were willing to accept the performance penalty.

Second - imagine this scenario: a user creates two commits with the same hash, one with a valid change and the second with a malicious one. The collision could be created by playing around with some data in a binary file - so, this is a collision attack not 2nd pre-image. The user then submits the change to the upstream and gets it approved. The user maintains a mirror of the upstream repo into which they place the malicious commit. Anyone that pulls from this mirror will think they have the same code as the upstream, even if they compare hashes.

So don't use an untrusted mirror? I guess - but that is something that should be possible with a strong hash. And if git really didn't want you to do that, it would provide for better ways of tracking where objects were actually pulled from.

Anyway, collision attacks are real and can impact git. They just aren't as bad as a 2nd pre-image attack.

gsu2 · on Jan 1, 2023

> First - if git really didn't care about collision resistance, there wouldn't have been a need to switch to SHA1DC as the hash function. They switched because they care enough that they were willing to accept the performance penalty.

Git didn't _need_ to switch to SHA1DC, but they did because the cost was minimal and it's still a good idea to defend against known attacks.

> Second - imagine this scenario: a user creates two commits with the same hash, one with a valid change and the second with a malicious one. The collision could be created by playing around with some data in a binary file - so, this is a collision attack not 2nd pre-image. The user then submits the change to the upstream and gets it approved.

This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.

> The user maintains a mirror of the upstream repo into which they place the malicious commit. Anyone that pulls from this mirror will think they have the same code as the upstream, even if they compare hashes.

Having people pull data from an attacker-controlled source is a security issue, regardless of hash collisions.

> So don't use an untrusted mirror? I guess - but that is something that should be possible with a strong hash. And if git really didn't want you to do that, it would provide for better ways of tracking where objects were actually pulled from.

Git was designed for collaboration between trusted parties; collaboration between untrusted parties (e.g. pulling changes from untrusted sources) is a much harder problem that git doesn't pretend to solve.

> Anyway, collision attacks are real and can impact git. They just aren't as bad as a 2nd pre-image attack.

Collision attacks are real, but they have yet to impact git (beyond adopting SHA1DC, I guess), despite how big of a target popular git repositories are.

dagenix · on Jan 1, 2023

> Git didn't _need_ to switch to SHA1DC, but they did because the cost was minimal and it's still a good idea to defend against known attacks.

I'm confused with how a SHA1 collision being found is an "attack" if git truly doesn't care about collision resistance.

> This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.

I don't think you can ignore the use case - people do check binaries into git with the expectation that git will keep track of them.

> Git was designed for collaboration between trusted parties; collaboration between untrusted parties (e.g. pulling changes from untrusted sources) is a much harder problem that git doesn't pretend to solve.

Maybe that is how git was designed. But it's not how git is used. People do pull from repos that they don't fully trust. Maybe just to examine a change before throwing it away. What people don't expect is that by pulling from such a source that an unexpected file could get into their repository due to a collision attack. That is why git switched to SHA1DC - if git truly didn't support that use case, they wouldn't have needed to.

> Collision attacks are real, but they have yet to impact git (beyond adopting SHA1DC, I guess), despite how big of a target popular git repositories are.

I agree that collisions attacks are real but aren't a practical issue yet. What I was responding to was your comment:

> I haven't heard of any second-preimage attacks against MD5, much less SHA-1, so mlindner was correct in asserting that MD5 would be fine (assuming 128 bits are enough). See also the analysis in [1].

In that comment, it seems that you were saying that collisions attacks weren't a problem at all. But, it seems like you are saying in your more recent comment that "collision attacks are real"?

eru · on Jan 3, 2023

> This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.

That's not a problem in general. Eg having a binary bmp in your repository is fine as far as reviews go.

eru · on Jan 3, 2023

> Git was designed for collaboration between trusted parties; [...]

No.

Git was designed for development of the Linux kernel. Contributors to the Linux kernel are generally not trusted.

eru · on Jan 1, 2023

> Git isn't relying on collision-resistance, it's relying on second-preimage[0] resistance, which is to say: in order to sneak a hash collision in to a git repository, you have to sneak _something else_ that's already trusted (e.g. via code review) into the repository; collisions can't (yet) be generated for arbitrary hashes.

Yes, I know. I was arguing the more general point that 'The use of SHA-1 in Git is not for security purposes,'.

Of course, for anything crypto related we go by the maxim 'guilty, until proven innocent'. MD5 might not have a published second-preimage attack, yet; but its broken enough, that you shouldn't rely on it for anything anymore: it's not a acceptable crypto-hash, and if you don't need a crypto-hash, you can use something simpler like a CRC instead.