Though be careful about the use of the phrase "data integrity" -- comparing two files on a file-system by their MD5 hash is probably fine, but comparing the serialization of two PDUs over the wire based on their MD5 hash may be problematic.
I think I understand your meaning to be MD5 certainly still exhibits an avalanche effect; changing a single bit in the input changes about half the bits in the output. And if you trust the way you retrieve the hashed data and the hash (like it's on a local hard drive) then yes, it's certainly acceptable for that use. But collisions and second pre-image generation being faster than brute force are why people generally don't want to use it (MD5) when it's use spans trust domains.
My point is:
a. Don't use MD5 just because Bruce Schneier published a popular book that said it was okay RIGHT BEFORE all the research damning it came out. (Personally... I think Bruce should publish a third edition of this text expressly to remove the bit about MD5 being okay. I cannot tell you how many hours of billable time I've wasted explaining to software engineers that no... MD5 is not recommended for use even though at one time it was considered acceptable. And if you know not to use MD5, you're not the software engineer I'm talking about.)
b. You can use SHA256 and get avalanche, collision resistance AND second pre-image generation resistance. (Pretty sure you also get 1st pre-image generation resistance, but I haven't scanned the literature for that in a while.)
And while I'm thinking about it, let me add these points:
c. There are probably better hashing algorithms than MD5 for use with a hash map/table/tree.
d. If you're interested in how MD5 works, I recommend expanding the scope a bit and study Merkle-Damgård generally. Why MD5 has problems and other hash functions that make use of the Merkle-Damgård construction don't (or have different problems, or the same problems at different amounts of input) is pretty interesting.
And yes, if you happen to have a MD5 hardware accelerator or petabytes of data and MD5 hashes already, it's hard to change that overnight.
>comparing the serialization of two PDUs over the wire based on their MD5
I'm not sure what you mean by this, but this:
>people generally don't want to use it (MD5) when it's use spans trust domains.
is exactly what I mean by "cryptography". I.e. guarding against intentional tampering. Are there a lot of people using it for this purpose? I don't remember seeing one.
>c. There are probably better hashing algorithms than MD5 for use with a hash map/table/tree.
For sure. Cryptographic functions (even obsolete ones) are almost always overkill and too slow for general data structures. Only use them if you can't find something more suitable for your data.
>And yes, if you happen to have a MD5 hardware accelerator or petabytes of data and MD5 hashes already, it's hard to change that overnight.
I was actually talking about SHA-256 acceleration, since I just saw like an hour ago that recent Intel CPUs have it. If your CPU has such instructions, by all means use it instead of software MD5, if all you need is data integrity.
I think I understand your meaning to be MD5 certainly still exhibits an avalanche effect; changing a single bit in the input changes about half the bits in the output. And if you trust the way you retrieve the hashed data and the hash (like it's on a local hard drive) then yes, it's certainly acceptable for that use. But collisions and second pre-image generation being faster than brute force are why people generally don't want to use it (MD5) when it's use spans trust domains.
My point is:
a. Don't use MD5 just because Bruce Schneier published a popular book that said it was okay RIGHT BEFORE all the research damning it came out. (Personally... I think Bruce should publish a third edition of this text expressly to remove the bit about MD5 being okay. I cannot tell you how many hours of billable time I've wasted explaining to software engineers that no... MD5 is not recommended for use even though at one time it was considered acceptable. And if you know not to use MD5, you're not the software engineer I'm talking about.)
b. You can use SHA256 and get avalanche, collision resistance AND second pre-image generation resistance. (Pretty sure you also get 1st pre-image generation resistance, but I haven't scanned the literature for that in a while.)
And while I'm thinking about it, let me add these points:
c. There are probably better hashing algorithms than MD5 for use with a hash map/table/tree.
d. If you're interested in how MD5 works, I recommend expanding the scope a bit and study Merkle-Damgård generally. Why MD5 has problems and other hash functions that make use of the Merkle-Damgård construction don't (or have different problems, or the same problems at different amounts of input) is pretty interesting.
And yes, if you happen to have a MD5 hardware accelerator or petabytes of data and MD5 hashes already, it's hard to change that overnight.