Bloom Filters by Example (2013)

dang · on May 14, 2019

2017: https://news.ycombinator.com/item?id=14740032

2013: https://news.ycombinator.com/item?id=6626128

nreilly · on May 14, 2019

And a week ago: https://news.ycombinator.com/item?id=19838163 (Understanding Bloom Filters with Pharo)

Piezoid · on May 14, 2019

In my field (bioinformatics), there is a trend of using a single hash function, an over-sized bloom filter which is then compressed with something like entropy optimal RRR[1] or RoaringBitmap bit vectors. The resulting bloom filter ends up taking the same space than a properly sized bloom filter. There is two advantages: knowing the size of the set in advance is not necessary, and a single hash computation and memory access per query. However updates are not well supported.

[1] Raman, R., Raman, V., & Rao, S. S. (2002, January). Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. A blog post: https://alexbowe.com/rrr/

tosh · on May 14, 2019

Does anyone have interesting examples for when a bloom filter came in handy as a solution to a problem?

I would love to have some (more than the contrived standard) examples to draw upon to make it easier for me to intuitively notice that a bloom filter might be a good fit in a future problem I encounter.

cmurphycode · on May 14, 2019

In deduplication, we check to see whether a block has been seen before. If it has, we don't write it. If it hasn't, we write it down and record that information for next time. Checking if a block has been seen requires looking in an index, which is large and expensive to check, but recording the store can be amortized.

So, a bloom filter can tell you whether it's worth spending the expensive check to verify that the block is a duplicate, or whether you should skip that behavior and move on directly.

Same is true for malicious site detection. Google offers this feature in Chrome, but we don't want to make an expensive network request for every URL we visit. And we can't really have a database of all malicious URLs downloaded and updated to everyone's computer. Instead, they can distribute a bloom filter of malicious sites. if the bloom filter says it's malicious, we can afford to check the network for the final answer. If the bloom filter says it's not, we know we don't have to check (and that's the most common case, too!)

In my opinion, the intuition to extract is this. If you need to check membership, the set is expensive to check, and getting a definitive NO is valuable, then you can consider a bloom filter (or another probabilistic structure like a cuckoo filter). In dedup, we need to check membership, the index is big enough to not fit in RAM so it's expensive, and getting a definitive NO means we can skip the index check. In malicious site detection, we need to check membership, the answer requires a network round trip so it's expensive, and getting a definitive NO means we can just move on to loading the site.

shereadsthenews · on May 14, 2019

Schachter and Ceglowski’s "LOAF", or List of all Friends, was a way to distribute a random projection of your email contacts publicly and anonymously. It was based on Bloom filters and actually I learned of Bloom filters by hearing Schachter's talk on LOAF at Eyebeam long ago. Basically you distribute the projection in an email header or signature and the recipient can probe it to see if they might have any contacts in common.

debatem1 · on May 14, 2019

One of my favorite uses of bloom filters was a proposal for source routed multicast some years ago. To work, it simply had the packet originator add all the next-hop addresses to a bloom filter, which was then the address of the packet. Each receiving node would then simply see which next hop nodes matched the filter and forward the packet to them. Fast, small, computationally cheap, bounded in error, and elegant. I liked it.

0xFACEFEED · on May 14, 2019

Needed to answer the question "have these 2 user IDs ever interacted?" in a 30 million user system. We did this to lower the read pressure on the primary data store. The bloom filter was basically a highly space efficient cache layer that answered one question. All of our other caches ran in large sharded clusters to distribute load and storage. A single bloom filter server was able to service all requests. Pretty cool.

If the bloom filter says yes then we check against the primary data store (there might be a false positive). If the bloom filter says no then we trust the bloom filter (there are no false negatives). It works because the vast majority of answers are "no" for this particular question.

paulbgd · on May 14, 2019

I used one in a bittorrent-like protocol, each block hash was added to the filter then sent to peers. Peers could then have a high accuracy way of checking if a peer has a block without having to communicate with them.

faceplanted · on May 14, 2019

That's really quite cool, how many blocks could an instance potentially have though? Seems unlikely you'd be saving that much space over having a list

paulbgd · on May 14, 2019

Sorry I wasn't clear, each peer maintains their own personal bloom filter that contains the hashes of the blocks that they have. The filters are sent upon connections with another peer and updated every so often.

vamega · on May 14, 2019

Is there any code that you can point to with this implementation?

paulbgd · on May 14, 2019

Sorry I haven't released anything, potentially in the future.