Hacker News new | past | comments | ask | show | jobs | submit login
The long tail of MD5 (tedunangst.com)
80 points by zdw on Dec 4, 2014 | hide | past | favorite | 56 comments



These days, I find using MD5 absolutely inexcusable. SHA-1 is available pretty much everywhere, SHA-2 algorithms are quite readily available too. BLAKE2 (https://blake2.net) is slightly faster than MD5, and significantly more secure. It's based on the SHA-3 finalist BLAKE. Additionally, there are versions of the algorithm that parallelise using SIMD, and there are both 32-bit and 64-bit optimised versions too. It also gives you the possibility of customising your hash, using it as an HMAC at no extra cost, salting it, or adding a personalisation key to effectively have different hash functions for different purposes. As a nice extra, it also uses a third less RAM than SHA-2 or SHA-3. If output length is a concern, truncating the output (with a corresponding increase in the likeliness of collisions) is perfectly fine with any good hash function.


MD5 also remains quite popular for file hashing/hash set analysis. I.e., is this file a member of a known set? For example, NIST's NSRL still includes MD5 hashes.

* This is not an endorsement of MD5 for such use cases.


Was going to make this point. But lets instead talk about the endorsement aspect, because lots of people in the forensic community take the "its popular so i use them, but i neither endorse nor oopose" position.

There is certainly risk of hash collisions in files, but most attacks require you to generate two different files with the same hash? Do people think a pre-image attack is feasible? If feasible, at what scale? What type of actor can give me a file that has different content, but the same hash as winword.exe. If they can, what is the likelihood that said file will also be executable or contain a linkable library

*similarly, and hypocritically, not an endorsement of md5


Do the security problems with MD5 make it bad as a hashing algorithm?


Not unless hash collisions are also a security issue (see http://bugs.python.org/issue13703). But it's much slower than a traditional hashtable hashing function and not much slower than a sha family hash.

Maybe that's the sweet spot for your project, but I think most non-security uses of MD5 are because it already exists in most languages and is pretty universal


Depends. Can an attacker introduce files? If so, he can probably introduce collisions. In most uses of a hash, that's an issue—a performance (DoS) issue if nothing else.

If the application was just looking for a longer CRC/Adler32/etc. and not depending on the hash to be strong against attack, then MD5 wasn't an appropriate choice in the first place, as its much slower than need be. But the security problems are irrelevant.


Do you know the contexts your hashing requirement is likely to be used in, and/or may be adapted to?

The worst and most persistent security problems emerge when someone defends half-assery as acceptable because, take your pick, it's a quick hack / it's a personal project / it's a small project / it's not used for critical infrastructure.

Until eventually is one or more of the above are violated.

The primary advantage of MD5 is that the hashes are (slightly) shorter than those of other checksum methods. This makes it slightly more convenient to manually compare or transmit hashes.

My problem is that I happen to have used md5sum so often and for so long that it's wired into my own wetware and muscle-memory, and I'm not sure which of the alternative SHA sums I should use, and which of those are widely available. I honestly don't know the answer to that off the top of my head, and "what to use instead of md5sum" as a DDG or Google search doesn't turn up a clearly useful answer. "sha1 sha256 sha384 sha512 which to use" does better.

And locally I've got utilities for SHA 1, 224, 256, and 512 installed, that I can tell.

Looks as if SHA-2 and SHA-3, which include keylengths of 224-512, are considered secure:

http://en.wikipedia.org/wiki/Secure_Hash_Algorithm

Then there are some openssl utilities. But let's not go there.


Generally the tuple (<length>,<hash>) is unique since the attacks on MD5 all involve changing the number of octets in the hashed source. That said, it isn't secure if you can change the length independently of changing the hash. Hence the challenge of using it cryptographically. If you're confident you know the correct 'length' value (acts as a sort of openly shared secret in this case) then you trust the hash.


I believe this is wrong, and in this case it's very dangerous information. See for example the Wikipedia article for two chunks of data that only differ in a few bits (not length!) and hash the to same MD5 hash: https://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities


Not necessarily, but if that's your use case (and the hashing speed isn't extremely critical), why not use a better algorithm with a larger output?


Are collisions really that much of a concern (other than for security reasons)? IIRC they are still incredibly rare even with older hashing algorithms like MD5.


Yes, if you don't care about collisions* or being able to reverse the hash (a pre-image attack) then there's nothing wrong with MD5 per se. However, MD5 was designed as a secure algorithm, and makes tradeoffs, mainly in speed, to that effect. If you don't need security, you should use something designed for that, like CRC.

* note that even if it isn't immediately obvious, in most situations you really don't want collisions to be an easy thing to cause. For example, many algorithms and structures involving hashes (even a simple in-memory hash map) suffer massively degraded performance when there's a large number of collisions. If your user can craft an input that causes operations to take orders of magnitude longer, that's a denial of service attack.


Collisions render hash functions useless as a component of most (but not all) digital signature algorithms


Good article that really shows why design decisions can have impact 20, 30 years or more.

This is why I get fairly upset when people design something new I n 2014 that uses md5. Yes even if your application does not use md5 for anything security related, the mere fact that you use a bad, slow algorithm should be considered wrong. And adding new dependencies will make it harder for us to migrate away and extend that tail even further.

If you need a good hash function that is fast, use SipHas. If you need a really secure one, use SHA-512, SHA-3, BLAKE or just SHA-256.


Do NOT use SipHash as a cryptographic hash function. It is designed to be a PRF, and its output length is way too small to make it collision-resistant when used as a hash. SHA-3, SHA-512/256 or BLAKE(2) (in increasing order of performance) are suitable cryptographic hash functions.


I think you meant: it [SipHash] is NOT designed to be a PRF.


No, I meant what I wrote. SipHash is fine as a keyed primitive -- 128-bit security against key recovery, (64 - t)-bit security against 2^t forgery attempts.

However, as a hash function (e.g., with a fixed key), it is entirely too weak: 2^32 work for a collision, and 2^64 work for arbitrary (second-)preimages.


OK... so if it's suitable as a PRF, then you should be able to widen the output by using two different keys and concatenating the results, no?


Unfortunately not: it's only 64 times slower to generate 2^64 collisions for hash function by generic attacks than just one, and those 2^64 messages will contain a collision for another hash function of length 128. The total time is only the sum of the times for each one.


The attack you describe---Joux's multicollisions, as far as I can tell---applies to Merkle-Damgard functions. SipHash is a spongy, wide-pipe, design where that sort of thing doesn't work. The generic bound is (k!)^(1/k) ⋅ 2^(n(k-1)/k)) work to find a k-collision, or approximately k ⋅ 2^n for large k.


Ah, that makes sense. Thanks!


I believe the spirit of the original poster's message was that SipHash is not strong enough to be used for cryptographic purposes.


No, SipHash is acceptable cryptographic PRF; it's just not acceptable as a standalone general-purpose cryptographic hash.

A PRF is secure as long as attackers don't get to know exactly how it works, which, with SipHash, they don't if they don't know the key used for it.

A cryptographic hash function is secure --- it's Hard to find two inputs that produce the same output, it's (even) Hard(er) to find a specific input that generates a given output --- even when it's not keyed.

SipHash is a cryptographic building block that is performant in part because it takes advantage of relaxing the constraint of being a good cryptographic hash function, and only tries to be a good cryptographic PRF.


Makes sense, thanks for the clarification.


Slow? I thought md5 was faster than the SHA-* family


Yes, way faster. It would be great if it weren't for those pesky collisions.


Collisions aren't a concern unless you're hashing trillions of records. It's not really a concern for the vast majority of software projects.


Collisions are a concern if an attacker has any input into what you're hashing, and can therefore induce collisions. MD5 is broken in that there are real-world attacks by which an attacker can do so.


Content-MD5 has been removed from HTTPbis.

Don't forget VBA digital signatures BTW, for which MD5 is the only choice. I wonder how feasible a collision attack would be.


Yeah a couple years ago I was trying to determine whether to support Content-MD5 in some HTTP software I was writing I could find nobody actually using it. There was an attempt a decade ago to make Firefox validate with it, but that died: https://bugzilla.mozilla.org/show_bug.cgi?id=232030

Also the author is wrong that there isn't a proposed replacement -- it's the "Digest:" header (RFC 3230) although I don't think anybody uses that either.


Id hazard the most prolific use case of md5 as http checksum is AWS S3. You can, IIRC, send content-md5 on PUT and POST and theyll reject with a 400 if the body hash doesnt match. Conversely the default etag is the body md5.


Want to kill MD5?

Provide public-domain easy-to-compile/use versions for all languages, and furthermore, get their google page ranks high.

Do not underestimate laziness. If Joe Random can find a suitable MD5 algorithm in 10 seconds but it takes 30 seconds to find a suitable SHA algorithm, guess which one gets used?


Honestly, this is not so much a problem. Cryptographers like to write simple C public domain implementations of their algorithms and then people go to work for months and years squeezing performance out of them, cryptanalyzing them, recoding them in various languages and releasing their own implementations under new licenses. You can see this happening with pretty much every newly designed crypto component from hashes to ciphers to entire crypto systems.

The problem is exactly the one elucidated in the blog post above: the long tail of baked in brokenness. These systems were never designed to be extensible, they are cooked into code that hasn't been serviced in years, perhaps even decades in some cases. They're bolted into specifications in ways that either obsolete the technology completely, or make it so incredibly complicated to update the technology that doing so outweighs the apparent cost. And that's without considering problems like deployment and phase-out.

These types of problems make it very likely we will be stuck with the stupidity of DES and MD5 in strange places until it becomes a fire drill and then all the sudden people will be baking in SHA-1/SHA-2 or BLAKE2 and we'll be going through these very same motions again in 5-15 years, wondering why we didn't learn from the mistakes we made last time.


HMAC MD5 is a perfectly reasonable choice for some situations.


A situation like not caring.


Everybody knows that MD5 is as terribly useless as ROT13

Huh?



Collision attacks are trivial. It's been almost ten years since Ron Rivest declared MD5 broken.


Well, ROT13 is at least immune to collision attacks.


Generating a collision for the "encrypted" phrase "unpx" is not hard.


It's not hard--it's impossible. The only value that ROT13 returns "unpx" for is "hack." There is no possible collision.


Congratulations, you found a collision.

What's meant by that is if you can find a source string that hashes to the same thing. Using ROT13 is pointless since it's trivial to generate those.

MD5 is only slightly harder.


Reproducing the message is not a collision. In a collision attack, you find a message that produces the same hash as another message. You can't take the exact same message and call it a collision. ROT13 is immune to collision attacks by its nature, and the humor is in the fact that its "immunity" comes from not being a one-way hash. It's immune...but it's also practically useless for obfuscating the message, thus it doesn't protect your message at all. Thanks for taking great pains to make sure the joke got ruined.


I know, but this does not make MD5 as broken as ROT13.


Considering that the sentence continues with something that is clearly a joke (the SHA-3 standardisation isn't finalized yet, let alone wasn't thought of 20 years ago), some hyperbole is to be expected.

And the point is fair, there's not really anything today where MD5 is an obvious choice.


Depends on what you place in the expression broken. Md5 is useless in anything remotely related to security. An as a non secure function it is slow.


> Md5 is useless in anything remotely related to security

There are serious flaws with MD5, and I would never recommend its use today. But there is no known attack that can generate a malicious binary with an arbitrary hash. The attacker needs to control both files to generate collisions, which is a huge limitation.

Maybe those attacks will exist some day. But I wouldn't be setting fire to all your legacy software without first understanding exactly how they use MD5, and if that is vulnerable to any known attack.


Given there are SHA256 engines that produce terahashes per second, you could probably do tens to hundreds of terahashes depending on sophistication.

At that point even a sloppy brute-force approach would have a high probability of a collision on at least one key of a sufficiently large key-set. Against a specific target it may or may not be viable.


I think with practical chosen prefix attacks this may have changed.


As a non secure function its "killer app" is being everywhere on every platform and being massive overkill for many applications.


Rot13 is actually useful. I have perl scripts which download my bank statements, running from cron. Gets the latest every month. These perl scripts use passwords, but without knowing that they are there, who would find them?

The trouble is, they have to pass the name of the form field to the server on the other side, which is some variation of "password". So anyone doing a simple string search will eventually find the perl scripts, and have access to my passwords (supposing they have root).

If I rot13 the form field name, and then have the name be rot13('cnffjbeq') in the script, this makes is unsearchable.


Note to self: add "cnffjbeq" to my password-finder scripts.


Might as well add rot-1 through rot-25 to it as well... not like there's a storage constraint, is there?


That's pretty heavy security through obscurity. A more robust solution would be to have all of those passwords encrypted with a master key. You can make the script prompt you for the master key when you start it up, then it runs without needed any more input, but you aren't storing sensitive information.


Yes.

A master key. Which would have to be included in the perl script... so that when they find the script, they have all my passwords instead of just a few. You've solved it. Why didn't I think of that?

> You can make the script prompt you for the master key when you start it up,

Why the fuck would I want to sit around being a meat robot inputting 9 passwords at the beginning of each month just so I can have a copy of my bank statements and electric bill?

The whole point was that this was in a crontab.


You're deluding yourself. Security through obscurity is barely worth the effort.

If you want this done in a fashion that's secure, deploy it on a small device like a Raspberry Pi and firewall that thing so aggressively you can barely get in. It's hard to hack what you can't connect to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: