These days, I find using MD5 absolutely inexcusable.
SHA-1 is available pretty much everywhere, SHA-2 algorithms are quite readily available too.
BLAKE2 (https://blake2.net) is slightly faster than MD5, and significantly more secure. It's based on the SHA-3 finalist BLAKE. Additionally, there are versions of the algorithm that parallelise using SIMD, and there are both 32-bit and 64-bit optimised versions too.
It also gives you the possibility of customising your hash, using it as an HMAC at no extra cost, salting it, or adding a personalisation key to effectively have different hash functions for different purposes.
As a nice extra, it also uses a third less RAM than SHA-2 or SHA-3.
If output length is a concern, truncating the output (with a corresponding increase in the likeliness of collisions) is perfectly fine with any good hash function.
MD5 also remains quite popular for file hashing/hash set analysis. I.e., is this file a member of a known set? For example, NIST's NSRL still includes MD5 hashes.
* This is not an endorsement of MD5 for such use cases.
Was going to make this point. But lets instead talk about the endorsement aspect, because lots of people in the forensic community take the "its popular so i use them, but i neither endorse nor oopose" position.
There is certainly risk of hash collisions in files, but most attacks require you to generate two different files with the same hash? Do people think a pre-image attack is feasible? If feasible, at what scale? What type of actor can give me a file that has different content, but the same hash as winword.exe. If they can, what is the likelihood that said file will also be executable or contain a linkable library
*similarly, and hypocritically, not an endorsement of md5
Not unless hash collisions are also a security issue (see http://bugs.python.org/issue13703). But it's much slower than a traditional hashtable hashing function and not much slower than a sha family hash.
Maybe that's the sweet spot for your project, but I think most non-security uses of MD5 are because it already exists in most languages and is pretty universal
Depends. Can an attacker introduce files? If so, he can probably introduce collisions. In most uses of a hash, that's an issue—a performance (DoS) issue if nothing else.
If the application was just looking for a longer CRC/Adler32/etc. and not depending on the hash to be strong against attack, then MD5 wasn't an appropriate choice in the first place, as its much slower than need be. But the security problems are irrelevant.
Do you know the contexts your hashing requirement is likely to be used in, and/or may be adapted to?
The worst and most persistent security problems emerge when someone defends half-assery as acceptable because, take your pick, it's a quick hack / it's a personal project / it's a small project / it's not used for critical infrastructure.
Until eventually is one or more of the above are violated.
The primary advantage of MD5 is that the hashes are (slightly) shorter than those of other checksum methods. This makes it slightly more convenient to manually compare or transmit hashes.
My problem is that I happen to have used md5sum so often and for so long that it's wired into my own wetware and muscle-memory, and I'm not sure which of the alternative SHA sums I should use, and which of those are widely available. I honestly don't know the answer to that off the top of my head, and "what to use instead of md5sum" as a DDG or Google search doesn't turn up a clearly useful answer. "sha1 sha256 sha384 sha512 which to use" does better.
And locally I've got utilities for SHA 1, 224, 256, and 512 installed, that I can tell.
Looks as if SHA-2 and SHA-3, which include keylengths of 224-512, are considered secure:
Generally the tuple (<length>,<hash>) is unique since the attacks on MD5 all involve changing the number of octets in the hashed source. That said, it isn't secure if you can change the length independently of changing the hash. Hence the challenge of using it cryptographically. If you're confident you know the correct 'length' value (acts as a sort of openly shared secret in this case) then you trust the hash.
I believe this is wrong, and in this case it's very dangerous information. See for example the Wikipedia article for two chunks of data that only differ in a few bits (not length!) and hash the to same MD5 hash:
https://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities
Are collisions really that much of a concern (other than for security reasons)? IIRC they are still incredibly rare even with older hashing algorithms like MD5.
Yes, if you don't care about collisions* or being able to reverse the hash (a pre-image attack) then there's nothing wrong with MD5 per se.
However, MD5 was designed as a secure algorithm, and makes tradeoffs, mainly in speed, to that effect. If you don't need security, you should use something designed for that, like CRC.
* note that even if it isn't immediately obvious, in most situations you really don't want collisions to be an easy thing to cause. For example, many algorithms and structures involving hashes (even a simple in-memory hash map) suffer massively degraded performance when there's a large number of collisions. If your user can craft an input that causes operations to take orders of magnitude longer, that's a denial of service attack.
Good article that really shows why design decisions can have impact 20, 30 years or more.
This is why I get fairly upset when people design something new I n 2014 that uses md5. Yes even if your application does not use md5 for anything security related, the mere fact that you use a bad, slow algorithm should be considered wrong. And adding new dependencies will make it harder for us to migrate away and extend that tail even further.
If you need a good hash function that is fast, use SipHas. If you need a really secure one, use SHA-512, SHA-3, BLAKE or just SHA-256.
Do NOT use SipHash as a cryptographic hash function. It is designed to be a PRF, and its output length is way too small to make it collision-resistant when used as a hash. SHA-3, SHA-512/256 or BLAKE(2) (in increasing order of performance) are suitable cryptographic hash functions.
No, I meant what I wrote. SipHash is fine as a keyed primitive -- 128-bit security against key recovery, (64 - t)-bit security against 2^t forgery attempts.
However, as a hash function (e.g., with a fixed key), it is entirely too weak: 2^32 work for a collision, and 2^64 work for arbitrary (second-)preimages.
Unfortunately not: it's only 64 times slower to generate 2^64 collisions for hash function by generic attacks than just one, and those 2^64 messages will contain a collision for another hash function of length 128. The total time is only the sum of the times for each one.
The attack you describe---Joux's multicollisions, as far as I can tell---applies to Merkle-Damgard functions. SipHash is a spongy, wide-pipe, design where that sort of thing doesn't work. The generic bound is (k!)^(1/k) ⋅ 2^(n(k-1)/k)) work to find a k-collision, or approximately k ⋅ 2^n for large k.
No, SipHash is acceptable cryptographic PRF; it's just not acceptable as a standalone general-purpose cryptographic hash.
A PRF is secure as long as attackers don't get to know exactly how it works, which, with SipHash, they don't if they don't know the key used for it.
A cryptographic hash function is secure --- it's Hard to find two inputs that produce the same output, it's (even) Hard(er) to find a specific input that generates a given output --- even when it's not keyed.
SipHash is a cryptographic building block that is performant in part because it takes advantage of relaxing the constraint of being a good cryptographic hash function, and only tries to be a good cryptographic PRF.
Collisions are a concern if an attacker has any input into what you're hashing, and can therefore induce collisions. MD5 is broken in that there are real-world attacks by which an attacker can do so.
Yeah a couple years ago I was trying to determine whether to support Content-MD5 in some HTTP software I was writing I could find nobody actually using it. There was an attempt a decade ago to make Firefox validate with it, but that died:
https://bugzilla.mozilla.org/show_bug.cgi?id=232030
Also the author is wrong that there isn't a proposed replacement -- it's the "Digest:" header (RFC 3230) although I don't think anybody uses that either.
Id hazard the most prolific use case of md5 as http checksum is AWS S3. You can, IIRC, send content-md5 on PUT and POST and theyll reject with a 400 if the body hash doesnt match. Conversely the default etag is the body md5.
Provide public-domain easy-to-compile/use versions for all languages, and furthermore, get their google page ranks high.
Do not underestimate laziness. If Joe Random can find a suitable MD5 algorithm in 10 seconds but it takes 30 seconds to find a suitable SHA algorithm, guess which one gets used?
Honestly, this is not so much a problem. Cryptographers like to write simple C public domain implementations of their algorithms and then people go to work for months and years squeezing performance out of them, cryptanalyzing them, recoding them in various languages and releasing their own implementations under new licenses. You can see this happening with pretty much every newly designed crypto component from hashes to ciphers to entire crypto systems.
The problem is exactly the one elucidated in the blog post above: the long tail of baked in brokenness. These systems were never designed to be extensible, they are cooked into code that hasn't been serviced in years, perhaps even decades in some cases. They're bolted into specifications in ways that either obsolete the technology completely, or make it so incredibly complicated to update the technology that doing so outweighs the apparent cost. And that's without considering problems like deployment and phase-out.
These types of problems make it very likely we will be stuck with the stupidity of DES and MD5 in strange places until it becomes a fire drill and then all the sudden people will be baking in SHA-1/SHA-2 or BLAKE2 and we'll be going through these very same motions again in 5-15 years, wondering why we didn't learn from the mistakes we made last time.
Reproducing the message is not a collision. In a collision attack, you find a message that produces the same hash as another message. You can't take the exact same message and call it a collision. ROT13 is immune to collision attacks by its nature, and the humor is in the fact that its "immunity" comes from not being a one-way hash. It's immune...but it's also practically useless for obfuscating the message, thus it doesn't protect your message at all. Thanks for taking great pains to make sure the joke got ruined.
Considering that the sentence continues with something that is clearly a joke (the SHA-3 standardisation isn't finalized yet, let alone wasn't thought of 20 years ago), some hyperbole is to be expected.
And the point is fair, there's not really anything today where MD5 is an obvious choice.
> Md5 is useless in anything remotely related to security
There are serious flaws with MD5, and I would never recommend its use today. But there is no known attack that can generate a malicious binary with an arbitrary hash. The attacker needs to control both files to generate collisions, which is a huge limitation.
Maybe those attacks will exist some day. But I wouldn't be setting fire to all your legacy software without first understanding exactly how they use MD5, and if that is vulnerable to any known attack.
Given there are SHA256 engines that produce terahashes per second, you could probably do tens to hundreds of terahashes depending on sophistication.
At that point even a sloppy brute-force approach would have a high probability of a collision on at least one key of a sufficiently large key-set. Against a specific target it may or may not be viable.
Rot13 is actually useful. I have perl scripts which download my bank statements, running from cron. Gets the latest every month. These perl scripts use passwords, but without knowing that they are there, who would find them?
The trouble is, they have to pass the name of the form field to the server on the other side, which is some variation of "password". So anyone doing a simple string search will eventually find the perl scripts, and have access to my passwords (supposing they have root).
If I rot13 the form field name, and then have the name be rot13('cnffjbeq') in the script, this makes is unsearchable.
That's pretty heavy security through obscurity. A more robust solution would be to have all of those passwords encrypted with a master key. You can make the script prompt you for the master key when you start it up, then it runs without needed any more input, but you aren't storing sensitive information.
A master key. Which would have to be included in the perl script... so that when they find the script, they have all my passwords instead of just a few. You've solved it. Why didn't I think of that?
> You can make the script prompt you for the master key when you start it up,
Why the fuck would I want to sit around being a meat robot inputting 9 passwords at the beginning of each month just so I can have a copy of my bank statements and electric bill?
You're deluding yourself. Security through obscurity is barely worth the effort.
If you want this done in a fashion that's secure, deploy it on a small device like a Raspberry Pi and firewall that thing so aggressively you can barely get in. It's hard to hack what you can't connect to.