The big thing about this paper is deletions. This is the first thing I've seen that seems to be an easy to understand and easy to implement filter that allows deletions. That it is close to Bloom filters in performance is a bonus.
Here's my attempt at summarizing the paper with the hope that someone who understands it better can correct me.
Cuckoo hashes are a type of closed hash table where any given item has a fixed number of places where it can be located. Unlike other closed hashes, if there is not room for an item in one of its possible locations, an earlier item is evicted (in the style of a cuckoo kicking an egg out of strangers nest and quickly laying its own) and sent to its alternate spot. This repeats until an empty slot is found, or some large number (500) attempts have been made. If no slot is found, the table can be rehashed to try again, or enlarged to make more space available.
Initially cuckoo hashes were proposed to have 2 slots for each item, corresponding to 2 different hashes. Over time, for greater space efficiency (load) it's evolved to have a variable number of hashes >= 2 and a variable number of slots per bucket >= 1. This paper suggests that using 2 hashes and 4 slots per bucket is a good default (a 2,4 cuckoo hash). One of the innovations in this paper is that instead of using 2 independent hashes (the theoretical best case) they use the output of a single hash function and a shorter saved "signature" that can be hashed and xor'ed to find the 'alternate' slot. In this manner, the signature can be used to move an item to its alternate location without actually knowing or saving the original value.
This 'cuckoo filter' can be queried (LOOKUP) to see if an item is already in the hash (INSERT). If an item is present, a 'yes' answer is guaranteed. But if an item is not present, there is a low percentage chance that a false positive 'yes' answer will be returned. The number of bits required to store each signature is comparable to the number of bits required for a Bloom filter with a comparable rate of false positives. Through various optimizations (such as storing the items in a bucket in a known order) the number of bits per item stored drops to slightly below that of a Bloom filter.
Unlike a Bloom filter, deletions are made possible by storing multiple copies of the same signature. But since there only a set number of places that a given value can be stored, only a limited number of copies (minus the number of collisions) can be stored. In this case, that number is 2 * 4 = 8. If more inserts than this are attempted, either the insert must fail or the delete count will be out of sync.
Performance wise, at the same level of space efficiency, insert speeds are better than a standard Bloom filter when the hash is mostly empty (low load), but significantly worse than a Bloom filter at high load when the hash gets filled and many 'evictions' are necessary for each insert. But if a tradeoff for space is acceptable (to stay at ~50% load), the Cuckoo Hash inserts much faster than its Bloom counterpart. Lookup speeds for existing keys are 2-3x the speed of a standard Bloom filter, and about 1.5x than the "Blocked" Bloom filter variant. Negative lookups (key not present) are about 2x faster than Bloom at high load, but about 3/4 the speed at low load.
Overall, it's a nice technique that should probably be used more often.
1. It has O(1) lookups (it has to probe at most 2 constant-size buckets) and O(1) amortised insertions (the probability of needing more than k swaps to find an empty slot decreases exponentially with k).
2. It's geared for low false positives, 3% or lower.
3. It does have false positive on lookups: signatures are lossy.
4. It has comparable space requirements to bloom filters for the same error rate and set (not universe) cardinality.
def insert(x):
for hash in hashes:
bloom_filter[hash(x)] = true
The downside is that your false positive rate will increase as you insert elements. But, on the other hand, you get the guarantee of no false negatives.
This cuckoo filter has no false negatives or false positives, but their Insert() function clearly can hit a case where it returns Failure.
Bloom filters effectively represent the infinite set when it's all ones. You can always add more, and everything will return as probably in there already.