Hacker News new | past | comments | ask | show | jobs | submit login
Cuckoo Filter: Practically Better Than Bloom (2014) [pdf] (eecs.harvard.edu)
80 points by Tomte on Oct 24, 2022 | hide | past | favorite | 13 comments



A mandatory comment that reducing random memory fetches is often way more important to perf than O() cost. Random memory read costs a lot.

https://blog.cloudflare.com/when-bloom-filters-dont-bloom/


We have a big collective blindspot about the implications of order of complexity when you get down to O(nlogn), and to a lesser extent anything in the neighborhood of O(n√n), which is rarer but does happen.

A logn slowdown versus a linear speedup of two orders of magnitude - or more - is a set of lines that don't cross for any value of n any of us will see in our lifetimes, even if you're highly placed at a FAANG. The entire output of data from CERN throughout its entire history is a rounding error compared to n = 2^100.

And the fact of the matter is that long before n = 2^100, all of those constant time calculations we count on in Knuth's model of orders of complexity go from O(1) time to O(logn) anyway, so there is no such thing as an O(1) or O(n) algorithm when you have to represent everything as bignums.


This is one of the things Cuckoo filters do well: they need either one or two memory accesses per query, compared to a Bloom filter which needs k accesses (where k is typically greater than 5).


True. I suspect improved filter performance is most useful for database development where preventing a read from unnecessarily hitting disk is the name of the game.


This was an excellent post, learned a lot, thanks for linking it!


related and useful: "Ribbon filter: Practically smaller than Bloom and Xor" (https://arxiv.org/abs/2103.02515)

This paper does a good job of showing the space/time tradeoffs of various approximate sets.


Similarly, there is Binary Fuse Filters (https://arxiv.org/abs/2201.01174) [1]

The paper includes comparisons with Ribbon filter

[1] Graf, T. M., & Lemire, D. (2022). Binary Fuse Filters: Fast and Smaller Than Xor Filters. arXiv. https://doi.org/10.1145/3510449


Thanks for sharing! Ribbon filters, xor filters, these binary fuse filters, are all immutable sets where the keys are known at construction time. But there are many classes of problems where that's true.


It's not really a very equal comparison to compare something that supports dynamic updates and which doesn't.

If you're willing to compare arbitrary operating behavior, a simple entropy coded bitset will achieve the information theoretic bound -- it just isn't updatable or queryable on the fly. :P


But these other filters are dynamically queryable on the fly. I think all these keys also support dynamic additions. What they don’t support (and neither does Bloom afaik) is removal of a key. For the kinds of typical applications for these, it tends to fit well enough.


I’m wrong. You need to give the entire set of keys upfront during construction for some of these filters.


Yeah, I believe ~all of the ones with greater space efficiency than cuckoo filters (or blocked bloom, depending on the false positive rate) don't have incremental insert.


I know there's scalable bloom filters, but are there any alternative data structure like bloom/cuckoo filters that allows for dynamic growth of inserted elements? Ie, I would like to add hundreds of new elements every day and let the filter grow it's capacity automagically.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: