I've had some luck turning a cuckoo hash into a sort of LRU cache. Whenever you do an insert, replace the older item. Iirc, everything else stayed the same. Using more than 2 hashed really helped.
I'd just like to point out that the worse your hash algorithm is, the better Cuckoo Hashing compares to traditional hashing...... therefore, study your hash functions!
The ones that come with most languages or libraries by default quite simply aren't optimized for use on anything. There's hash functions that work better on strings and hash functions that work better on numbers (length bias). In the perfect world, this wouldn't be true, but in the real world, yes it is.
That said, may I recommend MurmurHash3: code.google.com/p/smhasher/wiki/MurmurHash3
Switch your hash tables to this. The performance difference is incredible.
Cuckoo hashing is not only optimal in theory, it is also very fast in practice. The only downside, though, is that the insert time has linear-time worst case. Thus it may be not the best solution if latency is an issue.
There was a talk around where they were using two replacements at most(or two hash tables), and two hash functions. At insert time you add the object to the first hash table, if there is an object in the bucket, you then try with the second hash table. If you follow this approach a very large percentage of the buckets is going to be full. The remaining objects set (those that cannot be inserted in the first two hash tables because the buckets are full) is so small that you can keep them in a list or in a "standard" hash table. At some point I had a secondary storage implementation of this approach and it was good indeed.
Linear time worst case and amortized constant time are not mutually exclusive. You are right that most containers that we use have amortized constant time complexity, but they also have linear worst case complexity. Most times this is because an internal array has to be resized and the contents have to be copied over to the new enlarged array.
No, they're not mutually exclusive. But what actually matters in most software is "amortized constant time": linear time worst case only matters in hard real-time systems which can't tolerate any deviation from the expected runtime.
Even algorithms like binary trees can end up with very bad worst cases unless great care is taken in allocating their nodes.
The linear time behavior of cuckoo hashing is not triggered only when the table has to be enlarged, like standard containers (say, std::vector or python list): a linear numbers of keys can be "cuckooed" out of their slots for an insertion even when the load factor is under the capacity threshold.
In standard containers if you pre-define the capacity (say with std::vector::reserve) all the insertions are guaranteed to be constant time. Cuckoo hashing can't give this guarantee.
Chaining requires a memory allocation and list insertion on collisions - is it fair to call that absolute constant time? Probability of this happening depends on the number of elements in the container.
Pooling memory for allocating links turns the allocation into amortized constant time.
List insertion is O(1) if you insert it at the head, and while moving the most-recently-accessed link to the head of the chain doesn't change the worst case behavior, it greatly improves the average in practice (and is dead easy).
As silentbycicle said, list insertion at the head is constant time.
About allocation, I agree with you that it is "slow", but all reasonable computation models count a memory allocation as a constant time operation. If we want to go down that road and measure also memory operations, even random memory access would not be worst-case constant time: the MMU has to map the page to the physical location, and this is usually a logarithmic or amortized constant time operation.
Well, there is perfect hashing [1]. Of course it does only work under several assumptions. And you pay for it in terms of time it takes to construct the hash function or in terms of space.
I coded a cuckoo hashed hashtable for a class once, and recently had a lecture on cuckoo hashing given to my class by Rasmus Pagh himself (inventor of cuckoo hashing). Its really surprising actually how complex the probability behind cuckoo hashing really is (and how tied to graph theory/bipartite matchings it is).
The amazing part is, not only is access O(1), the expected insertion time whenever there is no rehashing is also O(1) and rehashing occurs infrequently.
It looks like it was a workshop paper, from the 7th Workshop on Distributed Data and Structures. I don't know much about that particular workshop, but workshops often aim for a more informal atmosphere, since one of their main roles is fostering discussion (incl. of works in progress) and bringing communities of researchers together to discuss things.
I like it when academic papers include some personality or touches of informality (in headings or examples for example), as long as the required rigor is not sacrificed elsewhere. Now "why else is his cool?" is probably a bit further than I go.
I dig--if you're going to use an informal expression and risk being completely dated within half a decade, why not pre-empt the retro wave with WHY ELSE IS THIS GROOVY?
Mitzenmacher, the author of Probability and Computing, has an interesting survey on cookoo hashing. In it there are a number of open research problems that, for those of you with interest, a worth looking over. The 7th is very interesting to me, regarding optimal ways to maintain a hash table in a parallel computing environment.
Every time I've tried comparing cuckoo hashing vs traditional hash algorithms in practice, the time taken to compute the additional hash functions outweighs any gains in performance.
Counter-intuitively, I've also noticed in many cases that using binary search over sorted elements in contiguous memory is actually faster than using a hash table at all.
> the time taken to compute the additional hash functions outweighs any gains in performance.
Sounds like you're using the wrong hash functions.
Are you using secure cryptographic hash functions perchance? (such as MD5, SHA, etc) Because they're not intended for use in data structures.
Most data structure algorithms just require a hash function with good avalanche behaviour and a statistically even bit dispersion. The FNV hash will do this for you with just a MUL and a XOR per byte, which is (rough guess) at least 100 times faster than SHA. FNV hash (http://www.isthe.com/chongo/tech/comp/fnv/) it's super-effective!
It very much depends on your hash table implementation, but I'm betting that your hash table is storing pointers to objects stored elsewhere. In this case, an array of elements in contiguous memory will be much faster because of locality. It's sometimes surprising how badly most of the data structures we think about treat caches.
I almost always prefer some kind of balanced tree to a hash table, because no matter the original specifications, sometime during the development of a program I usually need the elements in order.