Hacker News new | past | comments | ask | show | jobs | submit login
How to write a Bloom filter in C++ (michaelschmatz.com)
130 points by schmatz on April 12, 2016 | hide | past | favorite | 19 comments



I feel the use of vector<bool> is an iffy choice.

The Bitcoin codebase has a simple Bloom filter implementation you can take a look at that has been in use for some time

https://github.com/bitcoin/bitcoin/blob/master/src/bloom.h

https://github.com/bitcoin/bitcoin/blob/master/src/bloom.cpp


What's wrong with vector<bool> (other than its name)? The interface is exactly what you need to implement a bit-level abstraction in a language where this isn't a first-class primitive.


std::bitset is also a good choice if the number of desired bits is known at compile time


I think this would probably be the best choice after templating the filter


This example works well for raw data but not for complex types. You could make the filter a template, taking the key and a "hasher" function as template args.


Even better: N3980 [1]

This proposal decouples the implementation of hash functions from how types get hashed.

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398...


Great suggestion; I wasn't sure the idiomatic way to template this, thanks for letting me know!


Probably something like this:

template< class Key, class Hash = std::hash<Key> > class BloomFilter;


I don't use c++ so I'm not sure how std:hash works or gets implemented, but the way that guava (Google's java library) does it is by passing in a key and a funnel object. The funnel object is essentially responsible for decomposing the object into a byte stream. The advantage of doing it this way rather than making the caller specify his own hash is that you can use murmurhash3 which you thought had the best properties for the bloom filter.


I updated the blog post with your suggestion; CDN should be updated soon :)


Learned something today. Thanks for the article.

Minor nit: it will save readers time if you call out that "p is the false positive error rate". (You reference the error rate, but don't attach a variable name to it.) I had to go to an external reference to figure that out, which meant I learned something else of course.


Great suggestion, thanks! I've updated the article accordingly :)



Is there a standard implementation "everybody uses"? Bloomd seems popular as a bloom filter server.

https://github.com/armon/bloomd


I have also experiments about Bloom Filter in python https://github.com/erenyagdiran/BloomFilter


How is your laptop spec ? please


Intel Core i7-4980HQ


SSD, RAM ? I think maybe it affects performance


RAM is 1600MHz DDR3. I do have an SSD, but as the memory usage of the program was ~32MB, so I doubt it would have an effect haha




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: