Hacker News new | past | comments | ask | show | jobs | submit login
Hash tables with O(1) worst-case lookup and space efficiency [pdf] (ru.is)
118 points by tianyicui on Feb 20, 2011 | hide | past | favorite | 43 comments



I've had some luck turning a cuckoo hash into a sort of LRU cache. Whenever you do an insert, replace the older item. Iirc, everything else stayed the same. Using more than 2 hashed really helped.


I'd been wondering how best to make an LRU cache, and now you come along and give a really good solution. Nice!


That is not LRU, it's "replace with some other random thing some time, based on a hash function".


This is really clever, thanks for the hack.


That's not a hack. It's an algorithm.


Ooh so a hack is sort of an eigentransform of an algorithm in the sense that a hacked algorithm is still an algorithm.


I'd just like to point out that the worse your hash algorithm is, the better Cuckoo Hashing compares to traditional hashing...... therefore, study your hash functions!

The ones that come with most languages or libraries by default quite simply aren't optimized for use on anything. There's hash functions that work better on strings and hash functions that work better on numbers (length bias). In the perfect world, this wouldn't be true, but in the real world, yes it is.

That said, may I recommend MurmurHash3: code.google.com/p/smhasher/wiki/MurmurHash3

Switch your hash tables to this. The performance difference is incredible.


Cuckoo hashing is not only optimal in theory, it is also very fast in practice. The only downside, though, is that the insert time has linear-time worst case. Thus it may be not the best solution if latency is an issue.


There was a talk around where they were using two replacements at most(or two hash tables), and two hash functions. At insert time you add the object to the first hash table, if there is an object in the bucket, you then try with the second hash table. If you follow this approach a very large percentage of the buckets is going to be full. The remaining objects set (those that cannot be inserted in the first two hash tables because the buckets are full) is so small that you can keep them in a list or in a "standard" hash table. At some point I had a secondary storage implementation of this approach and it was good indeed.


"Linear-time worst case" is different from "amortized constant time", which is what cuckoo hashing provides.

Almost all the containers we use on a regular basis provide only amortized constant time inserts.


Linear time worst case and amortized constant time are not mutually exclusive. You are right that most containers that we use have amortized constant time complexity, but they also have linear worst case complexity. Most times this is because an internal array has to be resized and the contents have to be copied over to the new enlarged array.


No, they're not mutually exclusive. But what actually matters in most software is "amortized constant time": linear time worst case only matters in hard real-time systems which can't tolerate any deviation from the expected runtime.

Even algorithms like binary trees can end up with very bad worst cases unless great care is taken in allocating their nodes.


The linear time behavior of cuckoo hashing is not triggered only when the table has to be enlarged, like standard containers (say, std::vector or python list): a linear numbers of keys can be "cuckooed" out of their slots for an insertion even when the load factor is under the capacity threshold.

In standard containers if you pre-define the capacity (say with std::vector::reserve) all the insertions are guaranteed to be constant time. Cuckoo hashing can't give this guarantee.


I'm unaware of any hash table implementations which successfully insert elements in absolute constant time in the event of collisions.


Chained hash tables have constant time insertion, but I'm not sure about the amortized lookup time (the worst case is linear)


Chaining requires a memory allocation and list insertion on collisions - is it fair to call that absolute constant time? Probability of this happening depends on the number of elements in the container.


Pooling memory for allocating links turns the allocation into amortized constant time.

List insertion is O(1) if you insert it at the head, and while moving the most-recently-accessed link to the head of the chain doesn't change the worst case behavior, it greatly improves the average in practice (and is dead easy).


As silentbycicle said, list insertion at the head is constant time.

About allocation, I agree with you that it is "slow", but all reasonable computation models count a memory allocation as a constant time operation. If we want to go down that road and measure also memory operations, even random memory access would not be worst-case constant time: the MMU has to map the page to the physical location, and this is usually a logarithmic or amortized constant time operation.


Yeah, allocation is usually included in the constant factors. Keep them in mind, though - things like locality can have a big impact.


You are kidding, right? The name chain itself suggest linear worst case behavior.


"insertion".


My fault. Just saw constant time and chain in one sentence and panicked ;)


Well, there is perfect hashing [1]. Of course it does only work under several assumptions. And you pay for it in terms of time it takes to construct the hash function or in terms of space.

[1] http://en.wikipedia.org/wiki/Perfect_hash_function


An interesting algorithm, and an engaging presentation. If only more academic writing was this accessible!


Is open-source implementation of this data structure available somewhere?



I recently did an exercise on cuckoo hashing at http://programmingpraxis.com/2011/02/01/cuckoo-hashing/.


I had loved that article. It was very succinct.


I coded a cuckoo hashed hashtable for a class once, and recently had a lecture on cuckoo hashing given to my class by Rasmus Pagh himself (inventor of cuckoo hashing). Its really surprising actually how complex the probability behind cuckoo hashing really is (and how tied to graph theory/bipartite matchings it is).

The amazing part is, not only is access O(1), the expected insertion time whenever there is no rehashing is also O(1) and rehashing occurs infrequently.


"WHY ELSE IS THIS COOL?" seems a very casual heading for an academic article.


It looks like it was a workshop paper, from the 7th Workshop on Distributed Data and Structures. I don't know much about that particular workshop, but workshops often aim for a more informal atmosphere, since one of their main roles is fostering discussion (incl. of works in progress) and bringing communities of researchers together to discuss things.


I like it when academic papers include some personality or touches of informality (in headings or examples for example), as long as the required rigor is not sacrificed elsewhere. Now "why else is his cool?" is probably a bit further than I go.


I dig--if you're going to use an informal expression and risk being completely dated within half a decade, why not pre-empt the retro wave with WHY ELSE IS THIS GROOVY?


thought exactly the same. define a "cool" way of doing things


It's been my experience that the O(1) was on average, not worst case.


Cuckoo hashing has O(1) worst-case access time, and O(1) "average" (amortized, to be correct) insert time.


OK. That makes sense. If O(1) is worst case, what could be better :)


Well, smaller constant factors could be better :)


Mitzenmacher, the author of Probability and Computing, has an interesting survey on cookoo hashing. In it there are a number of open research problems that, for those of you with interest, a worth looking over. The 7th is very interesting to me, regarding optimal ways to maintain a hash table in a parallel computing environment.

http://www.eecs.harvard.edu/~michaelm/postscripts/esa2009.pd...


Every time I've tried comparing cuckoo hashing vs traditional hash algorithms in practice, the time taken to compute the additional hash functions outweighs any gains in performance.

Counter-intuitively, I've also noticed in many cases that using binary search over sorted elements in contiguous memory is actually faster than using a hash table at all.

Has anyone else found this?


> the time taken to compute the additional hash functions outweighs any gains in performance.

Sounds like you're using the wrong hash functions.

Are you using secure cryptographic hash functions perchance? (such as MD5, SHA, etc) Because they're not intended for use in data structures.

Most data structure algorithms just require a hash function with good avalanche behaviour and a statistically even bit dispersion. The FNV hash will do this for you with just a MUL and a XOR per byte, which is (rough guess) at least 100 times faster than SHA. FNV hash (http://www.isthe.com/chongo/tech/comp/fnv/) it's super-effective!


It very much depends on your hash table implementation, but I'm betting that your hash table is storing pointers to objects stored elsewhere. In this case, an array of elements in contiguous memory will be much faster because of locality. It's sometimes surprising how badly most of the data structures we think about treat caches.


I almost always prefer some kind of balanced tree to a hash table, because no matter the original specifications, sometime during the development of a program I usually need the elements in order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: