Deprecating SHA1

mkj · on Oct 24, 2020

In case anyone else is wondering, it looks like this will show if any SHA1 signatures are involved (brew or apt install pgpdump).

  gpg --export -a {keyname} | pgpdump

Surprised it isn't in gpg itself - maybe that's why Debian developers hadn't noticed.

aidenn0 · on Oct 24, 2020

I have no idea how to interpret that output. "SHA1" is listed last under "Hashed Sub: preferred hash algorithms(sub 21)(5 bytes)" but I have no clue what that means. I created the key with a recent gpg using default settings this year.

upofadown · on Oct 25, 2020

That's your preference for what sort of hash you would like others to use when sending you a signed message. It is in order from most preferred to least preferred so SHA1 is last. Look for the "Hash alg" of the signature packets for your signing key and encryption subkey. Mine looks like this for both:

    Hash alg - SHA256(hash 8)

austincheney · on Oct 24, 2020

Strange that people are still implementing technology around SHA1. It is still challenging to find collisions, but none the less SHA1 is provably broken.

userbinator · on Oct 24, 2020

Just like security in the physical world, sometimes the weakness is not that great in practice. If you watch enough lockpicking videos on YouTube you may come to the same conclusion.

Would I still trust MD5 to be a strong check against random/non-malicious corruption? Yes. Would I trust it against malicious corruption? No.

WrtCdEvrydy · on Oct 24, 2020

In forensics work, MD5 is still used for checksums. It's used for consistency, not uniqueness, so it will likely be usable for years to come.

jcrawfordor · on Oct 24, 2020

I think a lot of semi-delusion happens around whether or not cryptographic hash functions are used for integrity and what kind of integrity.

I would argue against this two ways: First, if the only concern was bit-error detection, forensic tools would use a simple checksum. CRC, for example, is far faster to compute than any of these cryptographic functions. Second, one of the functions of evidential custody is to provide some defense against malicious tampering with evidence, even if unlikely. And for this purpose MD5 is clearly an inferior choice to a SHA2 function.

The real reason that forensics tools stick to MD5, I would contend, is a combination of the state of the art in filesystem forensics tools being surprisingly bad and bureaucratic lockin related to documentation and interoperability (the chain of custody form asks for MD5, after all, not SHA512).

This kind of factors into one of my hills to die on, which is that careless or cargo-cult selection of hash functions causes problems. When you select a hash or integrity function for files, you need to make a clear decision about whether your goal is to prevent collision or simply to protect bit errors. You should then choose a function that is state-of-the-art for one of those two categories. The whole line of reasoning behind "MD5 is good enough because we're not really using it for security" is just asking for trouble. When you split the middle you either have a bad integrity check, a bad security control, or more likely both (MD5).

userbinator · on Oct 24, 2020

CRC, for example, is far faster to compute than any of these cryptographic functions.

...and offers far less protection even from random corruption; the biggest common CRC is 64 bits, which is really small in comparison to 128-bit MD5 or 160-bit SHA1.

cyphar · on Oct 24, 2020

I assume GP just used CRC as an example everyone is familiar with. A more modern non-cryptographic hash function like FNV or MurmurHash can produce more output bits (up to 1024 and 128 respectively), reducing random collision probability to a comparable level.

In addition, CRCs are actually quite good at detecting common errors in data transfer (not just single bit flips but burst errors as well). In fact you can prove that certain kinds of errors will always be detected by an n-bit CRC -- which is something you can't prove with modern cryptographic hash functions. Yeah they're useless against attackers but that's not the threat model if you're "not using MD5 for security".

jcrawfordor · on Oct 24, 2020

If you're dealing with a situation in which errors occur at a rate high enough to see random collisions in even CRC16 you need FEC if not spread spectrum techniques. CRCs can provably detect errors in up to n bits of various configurations depending on selection of the polynomial... And CRC dates back many years and is far from the state of the art. MD5 can produce collisions with as little as 6 bits in error, far less than the 128-bit length would suggest.

Cryptographic hash functions are designed with different objectives and methods and are often less well suited, and almost always less well analyzed, for simple error-detection situations.

Dylan16807 · on Oct 24, 2020

Since you're not doing any birthday attacks in checking for corruption, 64 bits is perfectly fine. Corruption checking with 64 bits is safer than many uses of md5 would be, even if md5 had no flaws.

ReactiveJelly · on Oct 24, 2020

If the authors of Blake3 are to be believed, MD5 is slower than Blake3, Blake2, and SHA1. And on par with SHA512.

(On a given architecture)

https://raw.githubusercontent.com/BLAKE3-team/BLAKE3/master/...

I could understand if hashing is a large portion of their CPU time. But if it's not, why not compute MD5 and also something good and store both hashes?

oconnor663 · on Oct 24, 2020

> (On a given architecture)

Yes this is a good caveat. When I've tried the OpenSSL implementations on older ARM chips (a Raspberry Pi Zero), I've seen MD5 come out on top. I'm not sure exactly why it switches.

perihelions · on Oct 25, 2020

I think they're just inverted on x86 because it provides hardware acceleration for SHA-1 and SHA-2.

https://en.wikipedia.org/wiki/Intel_SHA_extensions

https://github.com/openssl/openssl/blob/master/crypto/sha/as...

oconnor663 · on Oct 26, 2020

I find SHA-1 to be faster than MD5 on Skylake, which does not have the SHA extensions.

dreamcompiler · on Oct 24, 2020

I use SHA1 for duplicate detection on hard drives. It's more robust than MD5 and fast on modern processors. Nothing wrong with that. But I would not use it for crypto.

kzrdude · on Oct 24, 2020

Time to move on to SHA-512/256 or BLAKE3

peterhil · on Oct 26, 2020

Or maybe xxHash would work for duplicate finding? https://github.com/Cyan4973/xxHash

bawolff · on Oct 24, 2020

What does consistency mean in this context?

WrtCdEvrydy · on Oct 24, 2020

Given an input image of a hard disk or other storage medium, validate that the hash when given to you and after your analysis are the same.

bawolff · on Oct 24, 2020

In the presence of someone malicious swapping out drives for a malicious end? Or as a guard against accidentally swapping out the wrong hard drive due to an accident?

For the former, md5 isnt secure (although its not really an attack you can pull off after the fact). For the latter, i guess so, but it'd be hilarious if whatever attacker is causing you to do forensics made all the hard drives have the same hash to make your life more complicated.

WrtCdEvrydy · on Oct 24, 2020

Forensics work mostly comprises of people doing criminal or civil things.

Network intrusions, someone sending a fucked up email at work would require you to retrieve data whether it be some logs, or a full hard disk copy.

The idea is that I do analysis and can prove that what I started with (hash1) is the same as when I ended (hash2). If the hash changed, either there was a hardware failure (on write blockers) or a mistake was made when analysing the dataa.

kzrdude · on Oct 24, 2020

If they are already good at keeping multiple points of view in mind: that MD5 can be broken for one application, but it still works for others; then they should also be able to reason that, MD5 might work passably for this application, but it is time to move on.

hyko · on Oct 24, 2020

Yikes. Surely in forensics you’d want the most collision resistant hashing tool available? Doesn’t the uniqueness guarantee the consistency?

WrtCdEvrydy · on Oct 24, 2020

That's kinda the issue, you only want to guarantee that the image didn't change (ie, noone planted anything and the chain of evidence from the crime scene to your nice forensics lab is untampered).

I think once someone makes a tool that allows you to add a file to an image and have it resolve to the same hash (anti-forensics as a field), it will probably change. Most tools do offer dual MD5 and SHA256 support though.

hyko · on Oct 24, 2020

you only want to guarantee that the image didn’t change

But if the algorithm allows you to find hash collisions, you can’t guarantee that the image didn’t change based on the MD5 hash value? e.g. https://natmchugh.blogspot.com/2014/11/three-way-md5-collisi...

Obviously that example is a chosen prefix collision, but this data is coming from an untrusted source after all, so there’s really nothing to stop the attacker choosing the prefix in advance and then publicly shredding trust in the hash at a later date. In practice it sounds like you’d also have a hash of the complete file system, but at this point you’d have to question what advantage there is to using MD5 at all. Attacks never get worse, only ever get better, and the last thing you’d want is for the dam to burst during a lengthy and important investigation.

WrtCdEvrydy · on Oct 24, 2020

Yeah, I do wonder if the tooling to cause a full MD5 hash collision across the file system will come in the future.

wbl · on Oct 24, 2020

That's a second preimage attack not a collision attack.

hyko · on Oct 24, 2020

I thought a second preimage attach implied a collision attack?

SAI_Peregrinus · on Oct 24, 2020

Yes but the reverse isn't true. Collisions don't imply second-preimage attacks, at least not at any practical difficulty.

eli · on Oct 24, 2020

Still seems less than ideal to be able to generate a second document that hashes to the same value.

tialaramex · on Oct 24, 2020

Your phrasing here is unclear.

Being able to construct B such that hash(B) = hash(A) for some document A is Second Pre-image and there is no viable Second Pre-image attack for SHA-1 or even MD5.

What does exist is collision, where you construct documents C and C' such that hash(C) = hash(C').

This means bad guys can persuade humans (and more importantly in most cases, machines) to sign seemingly innocuous document C; and then apparently produce proof document C' was signed even though the human / machine has never seen C' and would not have signed it.

im3w1l · on Oct 24, 2020

Collision combined with length extension could be a neat way of attacking deduping.

Basically first a prefix, then a collision block, then a suffix that is encrypted with a key derived from one of the collision blocks.

If you can arrange for the deduper to find the document with the wrong key first, then it will skip over the super secret stuff.

formerly_proven · on Oct 24, 2020

For deduplicated storage you really want A=B <=> H(A)=H(B) to hold for all data you are going to see, so accessible constructions for collisions should already bar an algorithm from this application.

im3w1l · on Oct 24, 2020

I 100% agree, but some people seem to go "eh it's not a problem in practice" for as long as they possibly can.

saagarjha · on Oct 24, 2020

I would, depending on what I was protecting. I use it every day in Git, for example, on projects where integrity is of moderate importance. Now, against a nation-state? No way.

bawolff · on Oct 24, 2020

Against nation state, that ship sailed a long time ago.

We're currently in the realm of 10-20 thousand dollars to make SHA-1 collisions. That's in the realm of an individual, albeit a reasonably (but not extremely) wealthy one. In 5-10 years it may be in the realm of J. random hacker.

entee · on Oct 24, 2020

Naive question: how often is that collision the same number of bytes as the original?

userbinator · on Oct 24, 2020

I've never seen a colliding pair in a "true" hash function (not just CRC) be of differing lengths. They've always been a pair of blocks with very few differing bits, such that the differences "cancel out" in the computation and the internal state is the same regardless of which of the pair is run through it. I suspect trying to create a colliding pair with differing lengths is much more difficult, since SHA1 also includes the length as part of the calculation.

takeda · on Oct 24, 2020

I'm a bit divided, on one hand I agree with you and I even used md5 for things that don't need security, but then I started thinking about it, and came to conclusion that you never how your code might be exploited in the future.

dataflow · on Oct 24, 2020

It's not "strange"... earlier hash algorithms tend to enjoy better performance and compatibility, and there are still use cases for which SHA1 (and even MD5) aren't broken.

bawolff · on Oct 24, 2020

Sha256 has special instructions in silicon for it. Performance is not a reasonable excuse. If performance mattered and you didnt need sha's security properties, you would be using CRC32 or something.

dataflow · on Oct 24, 2020

Excused or not, I'm merely explaining why this isn't surprising.

cyphar · on Oct 24, 2020

But there are even better choices for non-cryptographic hash functions (FNV and MurmurHash for instance) because they were designed without the need to have a tradeoff between cryptographic security and efficiency.

_jjkk · on Oct 24, 2020

You're assuming that all code performing SHA requires a "secure" hash.

It's only "broken" for security applications, and there are tons of others for which security is not even a consideration, or where collision is not possible due to input constraints.

cyphar · on Oct 24, 2020

There are hashes which require much less computing power and thus have better throughput if you don't require cryptographic security (examples include FNV-1 and MurmurHash). Using SHA-1 because you "don't need it for security" is still a poor choice.

I'd also argue that the default choice of hash function should always be a secure one until a strong counterargument is given, because it turns out that at some point most developers assume hashes have some minimal security properties (Linus always claimed that the choice of hash function in Git was not about cryptographic security -- and yet the security of the GPG signing of Git objects depends entirely on SHA-1 being cryptographically secure because Git signs the commit hash, and the BitKeeper attack against Linux's source tree only failed because the attack was done so poorly they didn't try to find a collision in the commit ID hash function). Better to be safe than sorry.

adrian_b · on Oct 24, 2020

Many modern CPUs (most 64-bit ARM, AMD Ryzen, Intel Goldmont to Tremont and Ice Lake/Tiger Lake) have instructions to compute SHA1.

On these computers computing SHA1 may have much better throughput than any suitable non-cryptographic hash.

_jjkk · on Oct 24, 2020

I agree SHA-1 is a poor choice for signing git objects, for reasons that should have been obvious when it was implemented.

Unfortunately I don't think the same developers who are assuming all hashes have some kind of magic security properties, are the ones who would follow your best practice for default choice of hash functions.

Also I wish some of those higher throughput hashes were available in e.g. python's hashlib by default. For applications that can include minimal or no external packages, hashlib.sha1 still turns out to be the best choice in certain cases.

bawolff · on Oct 24, 2020

That would have to be some pretty constrained input if you are rulling out collisions in principle

ReactiveJelly · on Oct 24, 2020

Someone in my company is using MD5 for passwords. This is recent software, like built in the last decade. Passwords were not a mystery to anyone when it was written, at least not anyone who cared about security.

I don't know what to say about it. It's not the first time something dumb has happened, and I'm tired of screaming at the brick wall.

kzrdude · on Oct 24, 2020

Depends on how they use it. HMAC-MD5 is not broken, for example. MD5 could potentially be a building block for password hashing still (but it has to be a _building block_)

bawolff · on Oct 24, 2020

Well the reason that md5 is a bad choice for that has nothing to do with collisions or this article. Sha-256 would also be a bad choice for that.

upofadown · on Oct 24, 2020

Well they are not in this particular case. These are old keys.

bawolff · on Oct 24, 2020

The article said they were new keys.

tgsovlerkhgsel · on Oct 24, 2020

This is one of the many ways that shows how broken (in the "defective software" sense, not necessarily directly the cryptographic sense, yet) the entire PGP/GPG ecosystem is.

gaius_baltar · on Oct 24, 2020

Does modern gpg still use SHA1 for something? Or is it just kept there for backwards compatibility (e.g. checking old signatures) ?

hannob · on Oct 24, 2020

Some may argue that having an undefined timeframe of backwards compatibility spanning more than a decade is a sign of a broken ecosystem.

wongarsu · on Oct 24, 2020

others would argue that it's a sign of stability and maturity.

tptacek · on Oct 24, 2020

This kind of "stability and maturity" is exactly what you don't want in a cryptosystem.

gaius_baltar · on Oct 24, 2020

Did my research here. I was expecting GnuPG to never emit a message including SHA1, just use it to verify signatures from the existing ones. A pretty sensible thing, AFAIK.

To my surprise, gpg2 added SHA1 as a valid MD algo in a new key because the specification says it MUST be supported and implementations should use it if no other is found in key preferences. However it is the one with the lowest priority and will not be used with any that was generated in the last decade or had its preferences chenged to use another MD.

A key created with the default options:

    gaius@baseship:/tmp$ mkdir newhome
    gaius@baseship:/tmp$ gpg --homedir newhome --gen-key   
    gpg: WARNING: unsafe permissions on homedir '/tmp/newhome'
    gpg (GnuPG) 2.2.4; Copyright (C) 2017 Free Software Foundation, Inc.
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.    

    gpg: keybox '/tmp/newhome/pubring.kbx' created
    Note: Use "gpg --full-generate-key" for a full featured key generation dialog.
                                 
    GnuPG needs to construct a user ID to identify your key.
                 
    Real name: Gaius Baltar                                  
    Email address: baltar@defense.caprica.kob
    You selected this USER-ID:                       
        "Gaius Baltar <baltar@defense.caprica.kob>" 
                                           
    Change (N)ame, (E)mail, or (O)kay/(Q)uit? o
    We need to generate a lot of random bytes. It is a good idea to perform
    some other action (type on the keyboard, move the mouse, utilize the
    disks) during the prime generation; this gives the random number
    generator a better chance to gain enough entropy.
    We need to generate a lot of random bytes. It is a good idea to perform
    some other action (type on the keyboard, move the mouse, utilize the
    disks) during the prime generation; this gives the random number
    generator a better chance to gain enough entropy.
    gpg: /tmp/newhome/trustdb.gpg: trustdb created
    gpg: key 0832341A4A0CE664 marked as ultimately trusted
    gpg: directory '/tmp/newhome/openpgp-revocs.d' created
    gpg: revocation certificate stored as '/tmp/newhome/openpgp-revocs.d/C64E6636890D8BAACA843AFC0832341A4A0CE664.rev'
    public and secret key created and signed.

    pub   rsa3072 2020-10-24 [SC] [expires: 2022-10-24]
          C64E6636890D8BAACA843AFC0832341A4A0CE664
    uid                      Gaius Baltar <baltar@defense.caprica.kob>
    sub   rsa3072 2020-10-24 [E] [expires: 2022-10-24]

    gaius@baseship:/tmp$ gpg --homedir newhome --edit-key baltar
    gpg: WARNING: unsafe permissions on homedir '/tmp/newhome'
    gpg (GnuPG) 2.2.4; Copyright (C) 2017 Free Software Foundation, Inc.
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.

    Secret key is available.

    gpg: checking the trustdb
    gpg: marginals needed: 3  completes needed: 1  trust model: pgp
    gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
    gpg: next trustdb check due at 2022-10-24
    sec  rsa3072/0832341A4A0CE664
         created: 2020-10-24  expires: 2022-10-24  usage: SC  
         trust: ultimate      validity: ultimate
    ssb  rsa3072/89519D45B7AECE4D
         created: 2020-10-24  expires: 2022-10-24  usage: E   
    [ultimate] (1). Gaius Baltar <baltar@defense.caprica.kob>

    gpg> showpref
    [ultimate] (1). Gaius Baltar <baltar@defense.caprica.kob>
         Cipher: AES256, AES192, AES, 3DES
         Digest: SHA512, SHA384, SHA256, SHA224, SHA1
         Compression: ZLIB, BZIP2, ZIP, Uncompressed
         Features: MDC, Keyserver no-modify

    gpg>

upofadown · on Oct 25, 2020

I don't know that the hash preferences in the public key are relevant to the issue talked about in the linked email thread. Currently GPG (and Sequoia) will reject the signing of keys using SHA1 if they claim to be recent.

The preferences would allow you to send me two messages that used the same SHA1 hash for the signature. I would never know that it had happened as I would never see the hash, only the valid signature. Both signatures would be logically valid as they came from your assertive action.

upofadown · on Oct 24, 2020

Au contraire, this is a good example of how well the PGP/GPG ecosystem deals with the realities of a long term cryptography standard. People with SHA1 hashes in their signatures have a reasonable way to transition away from those signatures while maintaining the reputation of those identities. There really isn't anything out there as good.

I should also point out that that the attack this is addressing (Shambles) has close to no practical implications. So this is a good example of how the PGP/GPG ecosystem takes security breakage very seriously.

pmoriarty · on Oct 24, 2020

Just like git, which has much wider impact, considering how many more people use software from git than use pgp or gpg.

saghm · on Oct 24, 2020

I don't think git isn't using SHA1 for security though (outside of potentially when signing commits, but that's generally done with gpg, which you drew a distinction with). Assuming you're talking about commit hashes, I'm not particularly worried about the threat of a someone crafting a commit with a malicious hash to my repo because I have much bigger issues than commit hash duplicates if someone malicious is able to commit to my repo. Given that, SHA1 is just as safe to use for my commit hashes as it's always been; the fact that crafting collisions purposely has gotten easier doesn't make accidental collisions any more likely to happen.

account42 · on Oct 29, 2020

> I have much bigger issues than commit hash duplicates if someone malicious is able to commit to my repo

On github, all forks of a repo shared the commit objects.

londons_explore · on Oct 24, 2020

Our mistake is using cryptosystems that only last 20 years.

We should be using 100 year crypto. That's crypto that a crypto expert is willing to bet her entire net worth will be unbroken in 100 years.

Today, no such schemes exist.

But as far as I know, any two different hashing schemes applied sequentially has never been broken.

Therefore, why not get the Americans to develop their best hashing algorithm, the Chinese to develop their best algorithm, and then use A(C(data)).

Alternatively, bite the performance bullet, and use a provably secure hash algorithm. Build specialized hardware for the problem to speed it up.

Moodles · on Oct 24, 2020

How would hashing twice help with a collision? Surely if your hash function is A(C()) then if you find a collision in C() then that is also a collision in A(C())?

Also, just saying “we should use 100 year crypto” is about as useful as saying “we should hurry up and cure cancer”.

Nobody has a probably secure hashing function. It’s actually a huge problem in cryptography that a lot of the time hashing is modelled as a perfect random oracle and there are schemes which are secure only if modelled with a random oracle, but insecure when instantiated with any concrete hashing function.

marta_morena_28 · on Oct 24, 2020

Probably means (A(text) XOR C(text)) or something. Then you would need to find a collision in both algorithms for the same input and that will be arguably much harder.

I.e. just because someone demonstrates a lack of knowledge in his/her post, you can still show some flexibility for positive interpretation ;).

cryptopls · on Oct 24, 2020

This is only as strong as the stronger of A and C. If we had picked MD2 and MD4 to xor in the 90's we'd be no better off today.

If there was a trivial way to get much stronger hashing with double the compute time we'd do it.

labawi · on Oct 24, 2020

If you xor the hashes together, it may be as weak as the weaker of the two functions, though it probably depends on their structure.

Getting much stronger hashing with double compute time is easy. Double the number of round - most alorithms have about 30-60% security margins. It would probably avoid any breaks of the hash (so 80 bits for SHA1), though it can't be enough if the hash is simply too small (64 bit security for MD5/4/2).

adzm · on Oct 24, 2020

Couldn't you concatenate the two hashes? Twice the size but you'd have to find a collision for both algorithms.

bawolff · on Oct 24, 2020

Finding the two collisions are not independent of each other so its more like finding the collision in whichever function is stronger. (I.e. finding a collision in md5 + sha1 is roughly as hard as finding a collision in just sha1)

_kst_ · on Oct 24, 2020

Is that true?

In the past (long enough ago that using md5 almost made sense), I used a combination of md5 and sha1 to verify file integrity. Both checksums had to match or the file was rejected.

It seems obvious that that's at least as strong as using sha1 by itself. I've always assumed it was significantly stronger than sha1 by itself, but you're saying it isn't.

Creating an md5 collision isn't terribly difficult these days, but I've never heard of an md5 collision that's also a sha1 collision.

Can you cite a source with more information on this?

(I'm not doubting what you say, I'd just like more information.)

bawolff · on Oct 24, 2020

https://www.iacr.org/archive/crypto2004/31520306/multicollis...

Although looking over this again, there is some caveats in that if both hash functions suffer from shortcut attacks, you might not be able to combine both shortcut attacks depending on the details of the shortcut attack.

Regardless, in the sha1+md5 case, even if you cant combine both shortcut attacks (big if. Not sure if that's true or not), the complexity of the sha1 collision is roughly the same as a birthday attack on md5, so i think you still end up with close to same big-oh as if you could combine both shortcut attacks.

Disclaimer: IANAC.

afiori · on Oct 24, 2020

then we just need to use ALL of them

Moodles · on Oct 25, 2020

Yeah, this is similar to how people ended up inventing HMAC I think.

londons_explore · on Oct 24, 2020

Provably secure hashing functions exist. For example Very Smooth Hash.

They just perform poorly.

Moodles · on Oct 25, 2020

You are right

reificator · on Oct 24, 2020

> Our mistake is using cryptosystems that only last 20 years.

> We should be using 100 year crypto. That's crypto that a crypto expert is willing to bet her entire net worth will be unbroken in 100 years.

This sounds nice but let's wait until modern electronic computers have even been around 100 years so we can have an idea of what that means.

Splognosticus · on Oct 24, 2020

We're remarkably close--ENIAC was first powered up in 1946. Kinda cool when you think about it--electronic computers are WWII-era technology and a lot of us reading this now will see their 100-year anniversary.

reificator · on Oct 24, 2020

In other words we're still over a quarter of a century away.

glitchc · on Oct 24, 2020

Mhmm, and what will you pay for that kind of guarantee, hmmm?

reificator · on Oct 24, 2020

I think you responded to the wrong comment.

petertodd · on Oct 24, 2020

> But as far as I know, any two different hashing schemes applied sequentially has never been broken.

Suppose you used SHA256(MD5(x)) as your sequential hashing function.

If I can find two messages m1, and m2 that have the same MD5 digest, the outer SHA256 will also collide. MD5 is sufficiently broken that finding such messages is easy: https://github.com/thereal1024/python-md5-collision

Two different hashing schemes applied in parallel is better. However even there many uses of hashing functions require all the bits in the output digest to be equally unpredictable.

nayuki · on Oct 24, 2020

Two hashing schemes in parallel has problems too. Suppose your parallel hash is SHA3+Ident, where Ident is either the first 100 bytes of the message or the message padded with NUL up to 100 bytes. Because Ident has no preimage resistance, the combo SHA3+Ident has very poor preimage resistance too.

This thread has relevant info: https://crypto.stackexchange.com/questions/270/guarding-agai...

petertodd · on Oct 24, 2020

By "in parallel" I don't mean XOR. I mean concatenation. So in my example sha256+md5 would concatenate sha256's 256 bit digest, and md5's 128 bit digest, to form a 384 bit digest.

The intuitive explanation for why XOR instead of concatenation doesn't work, is it lets the output of the broken hash function influence the output of the remaining good hash function. That is obviously cause for concern!

Dylan16807 · on Oct 24, 2020

> The intuitive explanation for why XOR instead of concatenation doesn't work, is it lets the output of the broken hash function influence the output of the remaining good hash function. That is obviously cause for concern!

The bits of a good hash function should be random. XORing with some unrelated-ish thing should be no problem. So that's not enough of an explanation.

dataflow · on Oct 24, 2020

I still don't understand what you're claiming is wrong with XOR. As I understand it, the XOR of two secure hashes should be at least as strong as each one. Of course you can find ways to deliberately come up with a broken scheme if you really want to do something spectacularly dumb (like XORing a variation of the same algorithm with itself...), at which point the problem isn't really XOR, but assuming you're being reasonable and showing some sense in the process, the likelihood of someone somehow finding an exploitable correlation between them without breaking either one should be astronomically small to not even warrant a second thought.

rawoke083600 · on Oct 24, 2020

Just as a mental exercise (im no expert) how about. GoodHash=HF(input) + HF(reversed-input) Your final hash value will still be twice as bjg yes, but reversing the input should yield a different result ? This is of course for non-crypto usage.

tptacek · on Oct 24, 2020

If one of your hashes is iterated, you also have the multicollision problem: once you find a collision in your weak, iterated hash, you can quickly produce more collisions, so that the strength of your construction just devolves to that of the stronger hash.

delhanty · on Oct 24, 2020

For clarity, which part of petertodd's comment are you replying to?

(I'm learning a little cryptography from this thread, so the above is a straight question.)

bawolff · on Oct 24, 2020

I believe he is responding to "Two different hashing schemes applied in parallel is better."

My understanding is if you have a collision in the stronger function, you have an infinite family of collisions in the stronger function due to the length extension attack, from there you should be able to base the collision for the weaker function on one of the stronger function's (length extended) collisions.

delhanty · on Oct 24, 2020

Thank you for that reply - particularly the informative explanation.

I was sort of assuming it must be a reference to the parallel schemes because with SHA256(MD5(x)), once you had an MD5 collision you would be done.

sjwright · on Oct 24, 2020

> Suppose you used SHA256(MD5(x)) as your sequential hashing function.

SHA256(concat(MD5(x), x)) would be significantly more effective at the possible expense of performance.

Of course I’m sure that if you’re willing to make a performance trade-off there’s going to be a better single hashing algorithm out there.

tptacek · on Oct 24, 2020

Everything in cryptography is more complicated than you'd hope it would be. On this one, see for instance:

https://cryptopals.com/sets/7/challenges/52

bawolff · on Oct 24, 2020

> A(C(data))

This is just as weak as C. Any collision in C is by definition a collision in A(C(data).

You probably meant A(data) ^ C(data) [XOR] which is mostly as weak as the stronger of A and C. Google multicollisions in iterated hash functions for more details.

saagarjha · on Oct 24, 2020

Hashing twice doesn't actually help in many cases that hashes are used. And I don't think there is a such thing as a "provably secure hashing algorithm", although I would very much be interested in learning about the topic if you have a reference :)

edoceo · on Oct 24, 2020

There are some here: https://en.m.wikipedia.org/wiki/Security_of_cryptographic_ha...

But, it does say practically provable

nemothekid · on Oct 24, 2020

>Therefore, why not get the Americans to develop their best hashing algorithm, the Chinese to develop their best algorithm, and then use A(C(data)).

If you can find a collision in C(x), then A(C(x)) will have a collision as well.

fiddlerwoaroof · on Oct 24, 2020

Well, then make your hash:

    A(C(x))+C(A(x))

Just concatenate all the permutations of hashing functions.

Although, maybe just concatenating the single-hash values is just as secure.

    A(x)+C(x)

tatersolid · on Oct 24, 2020

At four times the computation, which is a pretty big problem when you’re trying to authenticate every packet on a >= 10 Gb/s network

Using a better hash like Blake2 makes a lot more sense

fiddlerwoaroof · on Oct 24, 2020

If the point of using two hashes is resistance against people breaking one of the hashing algorithms used, you’re always going to have to pick more than one algorithm to get that benefit. However, I don’t think nesting the calculations is actually necessary: Blake2(msg)+sha512(msg) would give you the same protection.

stordoff · on Oct 24, 2020

> Therefore, why not get the Americans to develop their best hashing algorithm, the Chinese to develop their best algorithm, and then use A(C(data)).

Would it not be better to report both A(data) and C(data) (or maybe all three), so you have to simultaneously find a collision in both? Using A(C(data)), finding a collision in C(data) gives a collision in A(C(data)) as well.

Further, there are other considerations - it doesn't matter how secure the hash is if it's unusably slow.

umanwizard · on Oct 24, 2020

> bite the performance bullet, and use a provably secure hash algorithm

Do these exist? (Genuine question; I’m not familiar with the state of the art.)

verandaguy · on Oct 24, 2020

Depends what you mean by "provably secure." Hashing algorithms are usually only practical because of what compromises they make to achieve "good enough."

- At the most basic level, (practical) hashing cannot map all inputs to the same number of outputs, since consistent output length is usually highly desirable. This means that for sufficiently long inputs you will get collisions. What determines how good a hash is is often how hard that is to do.

- If your inputs map 1:1 to outputs, rainbow tables become harder but you open yourself up to a number of other possible attacks starting with the obvious length analysis.

- Hashing also often has to be very performant, which leads to its own compromises, but in many applications (like message digests) throughput performance is given a lot of weight.

In the most formal way, I guess it's possible to have a provably secure hash for all possible attacks, but this would most likely require a 1:1 mapping between inputs and outputs. This would be impractical for a few reasons.

Dylan16807 · on Oct 24, 2020

The output has to be indistinguishable from random, and colliding n bits needs to take an exponential amount of time.

Output length can be anything. As long as it's indistinguishable from random, the chance of collision will be exponentially rare, and you can truncate or concatenate to get the length you want.

verandaguy · on Oct 24, 2020

You're correct on all points, though this kind of implementation (with variable-length output) is fairly rare AFAIK. Having a fixed-length output lets you reliably define for example a database field length.

I'd love to see a practical implementation of that though, if you know where to find that kind of project off-hand!

Dylan16807 · on Oct 25, 2020

I didn't mean "variable" when I said "anything". I was saying it didn't matter. Variable length would be valid, but so is a fixed 200 bit output.

The main thing I wanted to express is that "provably secure" doesn't mean no collisions.

pixl97 · on Oct 24, 2020

Ya, there are also speed considerations when it comes to hashing that are more important than total security. If your mobile phone catches on fire, or your bandwidth is cut by half the extra security isn't worth the trade.

Also go back to 1920 and design a public crypto system that can't be broken today.