> I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function.
I wish this misconception would die. There is a great theory of algorithmic probabilistic hash functions, completely distinct from cryptographic hash functions. If you are designing a hash table, or a different algorithm using a hash function, you nearly always want the former kind.
The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_.
Here the probability is over the random seed of h.
Lots of good hash functions, like UMASH (https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...) has this guarantee.
Other fast hash functions, like MURMUR don't.
When a function doesn't have this guarantee, it means I can find sets of values x1, x2, ... that will likely collide under _any_ or most seeds!
Sure, if your inputs are basically random, this probably won't happen, but people can still use this to DDoS your hash table, or whatever you are coding.
Notice again, this has nothing to do with cryptography. It is all about probabilistic guarantees.
You can't just test the hash function on a fixed number of inputs and say it's good, since you may just have moved the "bad set" to somewhere else.
In this day and age there are super fast algorithmic hash functions with guaranteed low expected collisions. It's just silly to use one that you can break so easily.
> The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_.
That sounds like such a function is strongly collision resistant, which means it's also second preimage resistant. And that gets you most of the way to a cryptographic hash function.
Is the only difference that it doesn't have to be first preimage resistant? Compared to cryptographic hashes, does that expand the set of viable functions a lot, to allow first preimages while still not allowing second preimages?
> It is all about probabilistic guarantees
So are cryptographic hash functions.
When I search for `algorithmic probabilistic hash functions` I just get results about bloom filters.
Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.
It's muddied a bit by the fact that cryptographers also use universal hashing (or probabilistic hashing, or what I called algorithmic hashing) for stuff like UMACs, https://en.m.wikipedia.org/wiki/UMAC#NH_and_the_RFC_UMAC , but they often have a lot of extra considerations on top of just collision resistance.
Some algorithms also need stronger probabilistic guarantees than just collision resistance (see e.g. https://en.m.wikipedia.org/wiki/K-independent_hashing#Indepe... ). These properties are usually too hard to test for with an experimental testing suite like SMhasher, but if your hash function don't have them, people will be able to find inputs that break your algorthm.
> Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.
Eh, that's how I usually see collision resistance described. The probability is based on generating fresh inputs with any method you want/the most effective attack method available.
But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.
So I'm still not sure in what sense a hash like this is facing different requirements than a cryptographic hash.
> You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.
I'm curious if you can link to such an analysis. These functions are notoriously much harder to analyze than simple functions like "h(x) = ax+b mod p" which is all you need for the probabilistic guarantee.
But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.
By definition, if they're secure then they should meet the requirements, right?
> But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.
I understand that, I'm just trying to figure out how a universal hash is easier to construct. But as you've gone through the descriptions here I think I understand how the collision resistance necessary is much much simpler, and there seems to be an assumption that the output of the hash will not be available to the attacker.
> But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix.
Yes, but the point is that hash functions used for hash tables are much, much faster than these cryptographic ones.
I wish this misconception would die. There is a great theory of algorithmic probabilistic hash functions, completely distinct from cryptographic hash functions. If you are designing a hash table, or a different algorithm using a hash function, you nearly always want the former kind.
The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_. Here the probability is over the random seed of h. Lots of good hash functions, like UMASH (https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...) has this guarantee. Other fast hash functions, like MURMUR don't.
When a function doesn't have this guarantee, it means I can find sets of values x1, x2, ... that will likely collide under _any_ or most seeds! Sure, if your inputs are basically random, this probably won't happen, but people can still use this to DDoS your hash table, or whatever you are coding.
Notice again, this has nothing to do with cryptography. It is all about probabilistic guarantees. You can't just test the hash function on a fixed number of inputs and say it's good, since you may just have moved the "bad set" to somewhere else.
In this day and age there are super fast algorithmic hash functions with guaranteed low expected collisions. It's just silly to use one that you can break so easily.