A Seven Dimensional Analysis of Hashing Methods [pdf]

aappleby · on April 26, 2017

Really interesting analysis, but constraining their keys to 64-bit integers and using 'hash functions' that only accept 64-bit integers as input skews the results away from real-world use cases.

In practice I've found that a variant of Robin Hood hashing with cache-line-sized bins and a hard limit on how deep to search for movable keys when inserting new keys (max of 3 possible bin locations per key, max search depth of 3 moves per insert) gives very good performance even with high load factors.

-Austin, author of MurmurHash

tjalfi · on April 25, 2017

The actual title is "A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing".

nosefouratyou · on April 25, 2017

Recently I was reading about hash tables on wikipedia and I was surprised to learn about how much the birthday problem influenced their design.

"A real world example of a hash table that uses a self-balancing binary search tree for buckets is the HashMap class in Java version 8."

https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_w...

rurban · on April 26, 2017

This is only to counter worst-case attack scenarios which do not happen in a realistic scenario. Like >100 collisions.

Robin Hood or its better cousin Hopscotch hashing, which was forgotten here, counter that scenario much better, with only ~1-3% performance loss on average in the best case, and much better perf. numbers in the average write-heavy case, where their "move to front" strategy pays off. And they have the worst case scenario also covered, unlike all others.

shouldbworking · on April 25, 2017

The birthday problem directly relates to Bloom and Cuckoo filters as well. The false positive rate is how often two hashes collide.