Hash tables with O(1) worst-case lookup and space efficiency [pdf]

sigstoat · on Feb 20, 2011

I've had some luck turning a cuckoo hash into a sort of LRU cache. Whenever you do an insert, replace the older item. Iirc, everything else stayed the same. Using more than 2 hashed really helped.

pjscott · on Feb 20, 2011

I'd been wondering how best to make an LRU cache, and now you come along and give a really good solution. Nice!

jrockway · on Feb 21, 2011

That is not LRU, it's "replace with some other random thing some time, based on a hash function".

gfodor · on Feb 20, 2011

This is really clever, thanks for the hack.

Deestan · on Feb 20, 2011

That's not a hack. It's an algorithm.

martinkallstrom · on Feb 21, 2011

Ooh so a hack is sort of an eigentransform of an algorithm in the sense that a hacked algorithm is still an algorithm.

ComputerGuru · on Feb 20, 2011

I'd just like to point out that the worse your hash algorithm is, the better Cuckoo Hashing compares to traditional hashing...... therefore, study your hash functions!

The ones that come with most languages or libraries by default quite simply aren't optimized for use on anything. There's hash functions that work better on strings and hash functions that work better on numbers (length bias). In the perfect world, this wouldn't be true, but in the real world, yes it is.

That said, may I recommend MurmurHash3: code.google.com/p/smhasher/wiki/MurmurHash3

Switch your hash tables to this. The performance difference is incredible.

ot · on Feb 20, 2011

Cuckoo hashing is not only optimal in theory, it is also very fast in practice. The only downside, though, is that the insert time has linear-time worst case. Thus it may be not the best solution if latency is an issue.

arnoldoMuller · on Feb 20, 2011

There was a talk around where they were using two replacements at most(or two hash tables), and two hash functions. At insert time you add the object to the first hash table, if there is an object in the bucket, you then try with the second hash table. If you follow this approach a very large percentage of the buckets is going to be full. The remaining objects set (those that cannot be inserted in the first two hash tables because the buckets are full) is so small that you can keep them in a list or in a "standard" hash table. At some point I had a secondary storage implementation of this approach and it was good indeed.

jemfinch · on Feb 20, 2011

"Linear-time worst case" is different from "amortized constant time", which is what cuckoo hashing provides.

Almost all the containers we use on a regular basis provide only amortized constant time inserts.

jules · on Feb 20, 2011

Linear time worst case and amortized constant time are not mutually exclusive. You are right that most containers that we use have amortized constant time complexity, but they also have linear worst case complexity. Most times this is because an internal array has to be resized and the contents have to be copied over to the new enlarged array.

jemfinch · on Feb 21, 2011

No, they're not mutually exclusive. But what actually matters in most software is "amortized constant time": linear time worst case only matters in hard real-time systems which can't tolerate any deviation from the expected runtime.

Even algorithms like binary trees can end up with very bad worst cases unless great care is taken in allocating their nodes.

ot · on Feb 20, 2011

The linear time behavior of cuckoo hashing is not triggered only when the table has to be enlarged, like standard containers (say, std::vector or python list): a linear numbers of keys can be "cuckooed" out of their slots for an insertion even when the load factor is under the capacity threshold.

In standard containers if you pre-define the capacity (say with std::vector::reserve) all the insertions are guaranteed to be constant time. Cuckoo hashing can't give this guarantee.

pmjordan · on Feb 20, 2011

I'm unaware of any hash table implementations which successfully insert elements in absolute constant time in the event of collisions.

ot · on Feb 20, 2011

Chained hash tables have constant time insertion, but I'm not sure about the amortized lookup time (the worst case is linear)

pmjordan · on Feb 20, 2011

Chaining requires a memory allocation and list insertion on collisions - is it fair to call that absolute constant time? Probability of this happening depends on the number of elements in the container.

silentbicycle · on Feb 20, 2011

Pooling memory for allocating links turns the allocation into amortized constant time.

List insertion is O(1) if you insert it at the head, and while moving the most-recently-accessed link to the head of the chain doesn't change the worst case behavior, it greatly improves the average in practice (and is dead easy).

ot · on Feb 20, 2011

As silentbycicle said, list insertion at the head is constant time.

About allocation, I agree with you that it is "slow", but all reasonable computation models count a memory allocation as a constant time operation. If we want to go down that road and measure also memory operations, even random memory access would not be worst-case constant time: the MMU has to map the page to the physical location, and this is usually a logarithmic or amortized constant time operation.

silentbicycle · on Feb 21, 2011

Yeah, allocation is usually included in the constant factors. Keep them in mind, though - things like locality can have a big impact.

ckuehne · on Feb 20, 2011

You are kidding, right? The name chain itself suggest linear worst case behavior.

s3graham · on Feb 20, 2011

"insertion".

ckuehne · on Feb 20, 2011

My fault. Just saw constant time and chain in one sentence and panicked ;)

ckuehne · on Feb 20, 2011

Well, there is perfect hashing [1]. Of course it does only work under several assumptions. And you pay for it in terms of time it takes to construct the hash function or in terms of space.

[1] http://en.wikipedia.org/wiki/Perfect_hash_function

thurn · on Feb 20, 2011

An interesting algorithm, and an engaging presentation. If only more academic writing was this accessible!

lecha · on Feb 20, 2011

Is open-source implementation of this data structure available somewhere?

sanxiyn · on Feb 21, 2011

There is ckhash.

http://canasai.tcllab.org/software/ckhash/

pbewig · on Feb 20, 2011

I recently did an exercise on cuckoo hashing at http://programmingpraxis.com/2011/02/01/cuckoo-hashing/.

kunjaan · on Feb 20, 2011

I had loved that article. It was very succinct.

jchonphoenix · on Feb 20, 2011

I coded a cuckoo hashed hashtable for a class once, and recently had a lecture on cuckoo hashing given to my class by Rasmus Pagh himself (inventor of cuckoo hashing). Its really surprising actually how complex the probability behind cuckoo hashing really is (and how tied to graph theory/bipartite matchings it is).

The amazing part is, not only is access O(1), the expected insertion time whenever there is no rehashing is also O(1) and rehashing occurs infrequently.

sambeau · on Feb 20, 2011

"WHY ELSE IS THIS COOL?" seems a very casual heading for an academic article.

_delirium · on Feb 20, 2011

It looks like it was a workshop paper, from the 7th Workshop on Distributed Data and Structures. I don't know much about that particular workshop, but workshops often aim for a more informal atmosphere, since one of their main roles is fostering discussion (incl. of works in progress) and bringing communities of researchers together to discuss things.

chwahoo · on Feb 20, 2011

I like it when academic papers include some personality or touches of informality (in headings or examples for example), as long as the required rigor is not sacrificed elsewhere. Now "why else is his cool?" is probably a bit further than I go.

tripzilch · on Feb 21, 2011

I dig--if you're going to use an informal expression and risk being completely dated within half a decade, why not pre-empt the retro wave with WHY ELSE IS THIS GROOVY?

pa7 · on Feb 20, 2011

thought exactly the same. define a "cool" way of doing things

16s · on Feb 20, 2011

It's been my experience that the O(1) was on average, not worst case.

ot · on Feb 20, 2011

Cuckoo hashing has O(1) worst-case access time, and O(1) "average" (amortized, to be correct) insert time.

16s · on Feb 20, 2011

OK. That makes sense. If O(1) is worst case, what could be better :)

rudiger · on Feb 20, 2011

Well, smaller constant factors could be better :)

brg · on Feb 20, 2011

Mitzenmacher, the author of Probability and Computing, has an interesting survey on cookoo hashing. In it there are a number of open research problems that, for those of you with interest, a worth looking over. The 7th is very interesting to me, regarding optimal ways to maintain a hash table in a parallel computing environment.

http://www.eecs.harvard.edu/~michaelm/postscripts/esa2009.pd...

ithayer · on Feb 20, 2011

Every time I've tried comparing cuckoo hashing vs traditional hash algorithms in practice, the time taken to compute the additional hash functions outweighs any gains in performance.

Counter-intuitively, I've also noticed in many cases that using binary search over sorted elements in contiguous memory is actually faster than using a hash table at all.

Has anyone else found this?

tripzilch · on Feb 21, 2011

> the time taken to compute the additional hash functions outweighs any gains in performance.

Sounds like you're using the wrong hash functions.

Are you using secure cryptographic hash functions perchance? (such as MD5, SHA, etc) Because they're not intended for use in data structures.

Most data structure algorithms just require a hash function with good avalanche behaviour and a statistically even bit dispersion. The FNV hash will do this for you with just a MUL and a XOR per byte, which is (rough guess) at least 100 times faster than SHA. FNV hash (http://www.isthe.com/chongo/tech/comp/fnv/) it's super-effective!

leif · on Feb 21, 2011

It very much depends on your hash table implementation, but I'm betting that your hash table is storing pointers to objects stored elsewhere. In this case, an array of elements in contiguous memory will be much faster because of locality. It's sometimes surprising how badly most of the data structures we think about treat caches.

pbewig · on Feb 21, 2011

I almost always prefer some kind of balanced tree to a hash table, because no matter the original specifications, sometime during the development of a program I usually need the elements in order.