How to write a Bloom filter in C++

nly · on April 12, 2016

I feel the use of vector<bool> is an iffy choice.

The Bitcoin codebase has a simple Bloom filter implementation you can take a look at that has been in use for some time

https://github.com/bitcoin/bitcoin/blob/master/src/bloom.h

https://github.com/bitcoin/bitcoin/blob/master/src/bloom.cpp

mavam · on April 12, 2016

What's wrong with vector<bool> (other than its name)? The interface is exactly what you need to implement a bit-level abstraction in a language where this isn't a first-class primitive.

bmohlenhoff · on April 12, 2016

std::bitset is also a good choice if the number of desired bits is known at compile time

schmatz · on April 12, 2016

I think this would probably be the best choice after templating the filter

nate_martin · on April 12, 2016

This example works well for raw data but not for complex types. You could make the filter a template, taking the key and a "hasher" function as template args.

mavam · on April 12, 2016

Even better: N3980 [1]

This proposal decouples the implementation of hash functions from how types get hashed.

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398...

schmatz · on April 12, 2016

Great suggestion; I wasn't sure the idiomatic way to template this, thanks for letting me know!

nate_martin · on April 12, 2016

Probably something like this:

template< class Key, class Hash = std::hash<Key> > class BloomFilter;

bradleyjg · on April 12, 2016

I don't use c++ so I'm not sure how std:hash works or gets implemented, but the way that guava (Google's java library) does it is by passing in a key and a funnel object. The funnel object is essentially responsible for decomposing the object into a byte stream. The advantage of doing it this way rather than making the caller specify his own hash is that you can use murmurhash3 which you thought had the best properties for the bloom filter.

schmatz · on April 12, 2016

I updated the blog post with your suggestion; CDN should be updated soon :)

sokoloff · on April 12, 2016

Learned something today. Thanks for the article.

Minor nit: it will save readers time if you call out that "p is the false positive error rate". (You reference the error rate, but don't attach a variable name to it.) I had to go to an external reference to figure that out, which meant I learned something else of course.

schmatz · on April 12, 2016

Great suggestion, thanks! I've updated the article accordingly :)

barsonme · on April 12, 2016

Or in C[0] or in Go[1]...

[0] - https://github.com/EricLagergren/bloom [1] - https://github.com/EricLagergren/bloom-c

j_s · on April 12, 2016

Is there a standard implementation "everybody uses"? Bloomd seems popular as a bloom filter server.

https://github.com/armon/bloomd

m00dy · on April 12, 2016

I have also experiments about Bloom Filter in python https://github.com/erenyagdiran/BloomFilter

nvcken · on April 12, 2016

How is your laptop spec ? please

schmatz · on April 12, 2016

Intel Core i7-4980HQ

nvcken · on April 12, 2016

SSD, RAM ? I think maybe it affects performance

schmatz · on April 12, 2016

RAM is 1600MHz DDR3. I do have an SSD, but as the memory usage of the program was ~32MB, so I doubt it would have an effect haha