Cuckoo Filter: Practically Better Than Bloom (2014) [pdf]

majke · on Oct 24, 2022

A mandatory comment that reducing random memory fetches is often way more important to perf than O() cost. Random memory read costs a lot.

https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

hinkley · on Oct 25, 2022

We have a big collective blindspot about the implications of order of complexity when you get down to O(nlogn), and to a lesser extent anything in the neighborhood of O(n√n), which is rarer but does happen.

A logn slowdown versus a linear speedup of two orders of magnitude - or more - is a set of lines that don't cross for any value of n any of us will see in our lifetimes, even if you're highly placed at a FAANG. The entire output of data from CERN throughout its entire history is a rounding error compared to n = 2^100.

And the fact of the matter is that long before n = 2^100, all of those constant time calculations we count on in Knuth's model of orders of complexity go from O(1) time to O(logn) anyway, so there is no such thing as an O(1) or O(n) algorithm when you have to represent everything as bignums.

vlmutolo · on Oct 25, 2022

This is one of the things Cuckoo filters do well: they need either one or two memory accesses per query, compared to a Bloom filter which needs k accesses (where k is typically greater than 5).

gsaussy · on Oct 25, 2022

True. I suspect improved filter performance is most useful for database development where preventing a read from unnecessarily hitting disk is the name of the game.

pickledish · on Oct 24, 2022

This was an excellent post, learned a lot, thanks for linking it!

awmarthur · on Oct 24, 2022

related and useful: "Ribbon filter: Practically smaller than Bloom and Xor" (https://arxiv.org/abs/2103.02515)

This paper does a good job of showing the space/time tradeoffs of various approximate sets.

Genbox · on Oct 24, 2022

Similarly, there is Binary Fuse Filters (https://arxiv.org/abs/2201.01174) [1]

The paper includes comparisons with Ribbon filter

[1] Graf, T. M., & Lemire, D. (2022). Binary Fuse Filters: Fast and Smaller Than Xor Filters. arXiv. https://doi.org/10.1145/3510449

slashdev · on Oct 25, 2022

Thanks for sharing! Ribbon filters, xor filters, these binary fuse filters, are all immutable sets where the keys are known at construction time. But there are many classes of problems where that's true.

nullc · on Oct 24, 2022

It's not really a very equal comparison to compare something that supports dynamic updates and which doesn't.

If you're willing to compare arbitrary operating behavior, a simple entropy coded bitset will achieve the information theoretic bound -- it just isn't updatable or queryable on the fly. :P

vlovich123 · on Oct 25, 2022

But these other filters are dynamically queryable on the fly. I think all these keys also support dynamic additions. What they don’t support (and neither does Bloom afaik) is removal of a key. For the kinds of typical applications for these, it tends to fit well enough.

vlovich123 · on Oct 26, 2022

I’m wrong. You need to give the entire set of keys upfront during construction for some of these filters.

nullc · on Oct 26, 2022

Yeah, I believe ~all of the ones with greater space efficiency than cuckoo filters (or blocked bloom, depending on the false positive rate) don't have incremental insert.

lmas · on Oct 25, 2022

I know there's scalable bloom filters, but are there any alternative data structure like bloom/cuckoo filters that allows for dynamic growth of inserted elements? Ie, I would like to add hundreds of new elements every day and let the filter grow it's capacity automagically.