Cuckoo Filters, Practically Better Than Bloom [pdf]

KarthikT · on Oct 22, 2014

The big thing about this paper is deletions. This is the first thing I've seen that seems to be an easy to understand and easy to implement filter that allows deletions. That it is close to Bloom filters in performance is a bonus.

nkurz · on Oct 22, 2014

Here's my attempt at summarizing the paper with the hope that someone who understands it better can correct me.

Cuckoo hashes are a type of closed hash table where any given item has a fixed number of places where it can be located. Unlike other closed hashes, if there is not room for an item in one of its possible locations, an earlier item is evicted (in the style of a cuckoo kicking an egg out of strangers nest and quickly laying its own) and sent to its alternate spot. This repeats until an empty slot is found, or some large number (500) attempts have been made. If no slot is found, the table can be rehashed to try again, or enlarged to make more space available.

Initially cuckoo hashes were proposed to have 2 slots for each item, corresponding to 2 different hashes. Over time, for greater space efficiency (load) it's evolved to have a variable number of hashes >= 2 and a variable number of slots per bucket >= 1. This paper suggests that using 2 hashes and 4 slots per bucket is a good default (a 2,4 cuckoo hash). One of the innovations in this paper is that instead of using 2 independent hashes (the theoretical best case) they use the output of a single hash function and a shorter saved "signature" that can be hashed and xor'ed to find the 'alternate' slot. In this manner, the signature can be used to move an item to its alternate location without actually knowing or saving the original value.

This 'cuckoo filter' can be queried (LOOKUP) to see if an item is already in the hash (INSERT). If an item is present, a 'yes' answer is guaranteed. But if an item is not present, there is a low percentage chance that a false positive 'yes' answer will be returned. The number of bits required to store each signature is comparable to the number of bits required for a Bloom filter with a comparable rate of false positives. Through various optimizations (such as storing the items in a bucket in a known order) the number of bits per item stored drops to slightly below that of a Bloom filter.

Unlike a Bloom filter, deletions are made possible by storing multiple copies of the same signature. But since there only a set number of places that a given value can be stored, only a limited number of copies (minus the number of collisions) can be stored. In this case, that number is 2 * 4 = 8. If more inserts than this are attempted, either the insert must fail or the delete count will be out of sync.

Performance wise, at the same level of space efficiency, insert speeds are better than a standard Bloom filter when the hash is mostly empty (low load), but significantly worse than a Bloom filter at high load when the hash gets filled and many 'evictions' are necessary for each insert. But if a tradeoff for space is acceptable (to stay at ~50% load), the Cuckoo Hash inserts much faster than its Bloom counterpart. Lookup speeds for existing keys are 2-3x the speed of a standard Bloom filter, and about 1.5x than the "Blocked" Bloom filter variant. Negative lookups (key not present) are about 2x faster than Bloom at high load, but about 3/4 the speed at low load.

Overall, it's a nice technique that should probably be used more often.

sharth · on Oct 21, 2014

Thoughts:

1. It doesn't appear to have O(1) lookup or O(1) insertion.

2. It appears that this filter can become "full", which is very different from a bloom filter.

3. It appears that this doesn't have false positives on lookup either.

4. I'm not very confident that this has a constant space requirement either.

This seems like something that I wouldn't compare to a bloom filter.

pkhuong · on Oct 21, 2014

1. It has O(1) lookups (it has to probe at most 2 constant-size buckets) and O(1) amortised insertions (the probability of needing more than k swaps to find an empty slot decreases exponentially with k).

2. It's geared for low false positives, 3% or lower.

3. It does have false positive on lookups: signatures are lossy.

4. It has comparable space requirements to bloom filters for the same error rate and set (not universe) cardinality.

sharth · on Oct 22, 2014

Amortized O(1) is very different from O(1). But I do appreciate the corrections. It seems that I missed or misread some of their paper.

yid · on Oct 21, 2014

> It appears that this filter can become "full", which is very different from a bloom filter.

How is this different from a bloom filter?

sharth · on Oct 21, 2014

A bloom filter's insert function is basically:

    def insert(x):
        for hash in hashes:
            bloom_filter[hash(x)] = true

The downside is that your false positive rate will increase as you insert elements. But, on the other hand, you get the guarantee of no false negatives.

This cuckoo filter has no false negatives or false positives, but their Insert() function clearly can hit a case where it returns Failure.

placeybordeaux · on Oct 22, 2014

It has false positives.

fidotron · on Oct 21, 2014

Bloom filters effectively represent the infinite set when it's all ones. You can always add more, and everything will return as probably in there already.