I'm disappointed a the hashing is just based on training on microbenchmarks and SMHasher, rather than designing a fast _provably_ universal hash.
Suites like SMHasher are never complete. They are just trying to catch the most common weaknesses. If you train on the test cases you'll only get an algorithm that passes the tests, but people can always find a set of values on which you will do badly.
I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function, but outside of cryptographic applications that is rarely the requirement. And (huge caveat, this isn't my domain expertise) I'm not sure what security properties are really "proven" about existing cryptographic hash functions, AFAIK existing cryptographic hash functions are considered secure because we don't know how to break them, not because of some fundamental mathematical property about them.
For the other 99.999% of hashing applications there is a balance between collision resistance and hashing latency. For example, in a hash table (probably the most common use for a non-cryptographic hash function) there is a cost incurred by hash collisions because lookups on keys with collisions may have to do extra probing. On the other hand, every hash table lookup requires doing at least one hash operation, regardless of whether or not it collides. So it may make sense to have a slightly worse hash function (in the sense that it is more likely to have collisions with pathological inputs) if it has slightly lower latency. The only way to really know what is faster for a real world application is to have some kind of benchmark to train against as a loss function.
> I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function.
I wish this misconception would die. There is a great theory of algorithmic probabilistic hash functions, completely distinct from cryptographic hash functions. If you are designing a hash table, or a different algorithm using a hash function, you nearly always want the former kind.
The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_.
Here the probability is over the random seed of h.
Lots of good hash functions, like UMASH (https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...) has this guarantee.
Other fast hash functions, like MURMUR don't.
When a function doesn't have this guarantee, it means I can find sets of values x1, x2, ... that will likely collide under _any_ or most seeds!
Sure, if your inputs are basically random, this probably won't happen, but people can still use this to DDoS your hash table, or whatever you are coding.
Notice again, this has nothing to do with cryptography. It is all about probabilistic guarantees.
You can't just test the hash function on a fixed number of inputs and say it's good, since you may just have moved the "bad set" to somewhere else.
In this day and age there are super fast algorithmic hash functions with guaranteed low expected collisions. It's just silly to use one that you can break so easily.
> The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_.
That sounds like such a function is strongly collision resistant, which means it's also second preimage resistant. And that gets you most of the way to a cryptographic hash function.
Is the only difference that it doesn't have to be first preimage resistant? Compared to cryptographic hashes, does that expand the set of viable functions a lot, to allow first preimages while still not allowing second preimages?
> It is all about probabilistic guarantees
So are cryptographic hash functions.
When I search for `algorithmic probabilistic hash functions` I just get results about bloom filters.
Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.
It's muddied a bit by the fact that cryptographers also use universal hashing (or probabilistic hashing, or what I called algorithmic hashing) for stuff like UMACs, https://en.m.wikipedia.org/wiki/UMAC#NH_and_the_RFC_UMAC , but they often have a lot of extra considerations on top of just collision resistance.
Some algorithms also need stronger probabilistic guarantees than just collision resistance (see e.g. https://en.m.wikipedia.org/wiki/K-independent_hashing#Indepe... ). These properties are usually too hard to test for with an experimental testing suite like SMhasher, but if your hash function don't have them, people will be able to find inputs that break your algorthm.
> Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.
Eh, that's how I usually see collision resistance described. The probability is based on generating fresh inputs with any method you want/the most effective attack method available.
But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.
So I'm still not sure in what sense a hash like this is facing different requirements than a cryptographic hash.
> You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.
I'm curious if you can link to such an analysis. These functions are notoriously much harder to analyze than simple functions like "h(x) = ax+b mod p" which is all you need for the probabilistic guarantee.
But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.
By definition, if they're secure then they should meet the requirements, right?
> But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.
I understand that, I'm just trying to figure out how a universal hash is easier to construct. But as you've gone through the descriptions here I think I understand how the collision resistance necessary is much much simpler, and there seems to be an assumption that the output of the hash will not be available to the attacker.
> But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix.
Yes, but the point is that hash functions used for hash tables are much, much faster than these cryptographic ones.
>If you really need a collision resistant hash you need to use a cryptographic hash function, but outside of cryptographic applications that is rarely the requirement.
There are reasons to use (strongly) collision resistant hashes outside of cryptographic settings. E.g., the default Rust hash function, used in hash maps and sets, has strong collision resistance, because otherwise you could open up applications to DoS attacks (the attacker uses lots of inserts with collisions to kill performance of accesses and further inserts at those buckets).[0]
>I'm not sure what security properties are really "proven" about existing cryptographic hash functions, AFAIK existing cryptographic hash functions are considered secure because we don't know how to break them, not because of some fundamental mathematical property about them.
There are provably secure hash functions[1] (typically using the same sort of primitives as public key crypto), but they're generally only used when certain properties need to be composed, and are often less secure than the non-provable ones in practice anyway. This is pretty similar to the state of symmetric vs. asymmetric cryptography in general: primitives like RSA, DH, etc. have much stronger proofs than AES, but algorithms built using AES for security are generally viewed as a lot less likely to be broken any time soon than algorithms built using typical asymmetric primitives for security, even ignoring things like quantum advantage.
Indeed, and this has been the case for quite a while now. You can always improve on some general algorithm by taking advantage of knowledge of the data but that never generalizes and usually leads to either worse performance on other data and/or new pathological cases that result in results that are unusable.
>Indeed, and this has been the case for quite a while now. You can always improve on some general algorithm by taking advantage of knowledge of the data but that never generalizes and usually leads to either worse performance on other data and/or new pathological cases that result in results that are unusable.
Deepmind did the exact same thing with AlphaTensor. While they do some geniunely incredible things, there's always a massive caveat that the media ignores. Still, I think it's great that they figured out a way to search a massive space where most of the solutions are wrong, and with only 16 TPUs running for 2 days max. Hopefully this can be repurposed into a more useful program, like one that finds proofs for theorems.
Ship the optimization framework in with the application, sample from the user data, and optimize for that? It isn’t overfitting if you overfit on the data you care about, right?
Data tends to change over time, and once a hash function is in use you can't really replace it easily without a lot of overhead, possibly quite a bit more overhead than what you saved in the first place. There are some examples of this in the sorting arena too, such as 'Timsort', personally I haven't found any that gave a substantial boost, but probably there are some cases where they do. Unless sorting or hashing (and lookup) are the main bottleneck for an application I would spend my time on other aspects of it.