We have a big collective blindspot about the implications of order of complexity when you get down to O(nlogn), and to a lesser extent anything in the neighborhood of O(n√n), which is rarer but does happen.
A logn slowdown versus a linear speedup of two orders of magnitude - or more - is a set of lines that don't cross for any value of n any of us will see in our lifetimes, even if you're highly placed at a FAANG. The entire output of data from CERN throughout its entire history is a rounding error compared to n = 2^100.
And the fact of the matter is that long before n = 2^100, all of those constant time calculations we count on in Knuth's model of orders of complexity go from O(1) time to O(logn) anyway, so there is no such thing as an O(1) or O(n) algorithm when you have to represent everything as bignums.
This is one of the things Cuckoo filters do well: they need either one or two memory accesses per query, compared to a Bloom filter which needs k accesses (where k is typically greater than 5).
True. I suspect improved filter performance is most useful for database development where preventing a read from unnecessarily hitting disk is the name of the game.
Thanks for sharing! Ribbon filters, xor filters, these binary fuse filters, are all immutable sets where the keys are known at construction time. But there are many classes of problems where that's true.
It's not really a very equal comparison to compare something that supports dynamic updates and which doesn't.
If you're willing to compare arbitrary operating behavior, a simple entropy coded bitset will achieve the information theoretic bound -- it just isn't updatable or queryable on the fly. :P
But these other filters are dynamically queryable on the fly. I think all these keys also support dynamic additions. What they don’t support (and neither does Bloom afaik) is removal of a key. For the kinds of typical applications for these, it tends to fit well enough.
Yeah, I believe ~all of the ones with greater space efficiency than cuckoo filters (or blocked bloom, depending on the false positive rate) don't have incremental insert.
I know there's scalable bloom filters, but are there any alternative data structure like bloom/cuckoo filters that allows for dynamic growth of inserted elements?
Ie, I would like to add hundreds of new elements every day and let the filter grow it's capacity automagically.
https://blog.cloudflare.com/when-bloom-filters-dont-bloom/