Interesting anecdote: an alumni of my university was chatting with me recently about his work at Google, essentially he replaced what he described as a "large, complex machine learning pipeline" with a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.
For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).
Is EE shop sites you would have visited before the search?
> a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.
"have searched" is not clear to me. I was expecting Google would just spit out a JSON response with all the search query for page 1 (including sponsored ads). Can you please elaborate on this please?
> Is EE shop sites you would have visited before the search?
No. I think you might be parsing his statement wrong - it's not "have searched" as in searched for previously, it's "have searched for a PRODUCT" as in, your search is for a product (something purchasable in a store, preferably an online store) as opposed to something else.
The Bloom filter figures out if your search is for a product, and if it is, puts that box w/ sponsored links to stores that sell that product up at the top of your results.
(HyperLogLogs are vaguely similar data structures in that you hash things a bunch to store a value inside, except instead of set-membership inquiries, they're better at cardinality-estimation purposes.)
Not in the style of the submitted article, but HyperLogLog was mentioned [0] last year on Nick's Blogs Damn Cool Algortihms series [1]. Discussion on HN [2].
Well, there you are: a nice explanation of HyperLogLog, along with graphs, and a javascript demo to play with. And a tribute to Philippe Flajolet who came up with the algorithm.
I hope you write it! I also imagine an encyclopedia of many different algorithm examples, each with demonstrations and graphical explanations. Bret Victor has talked about that sort of thing, and it greatly intrigues me.
I'm not sure what Bret Victor had in mind, but the second part of Skiena's text The Algorithm Design Manual already does a good job at this. Similar write ups are available online [0], in less detail than the text, e.g. [1]. They all open with a pictorial representation of possible input and corresponding output to aid in scanning when you have a problem to solve. There is also the Encyclopedia of Algorithms [2], and the NIST Dictionary of Algorithms and Data Structures [3], history of at [4].
Question, instead of using two hash functions, why not use just a single function but use a = hash(foo) and b = hash(foo+"something")? A well distributed hash function should give drastically different results for a and b for different values of foo, which is what we want, correct?
It's standard to have at least two hashing functions. You can then simulate having x hash functions simply by combining those two.[1]
But, your question is entirely valid and I'm not really sure why you couldn't do that. It might simply be that the kinds of hash functions used in bloom filters (non-cryptographic) are not well distributed.
bloomd[1] is a "high performance c server for bloom filters" for those looking for a low-overhead and deployable implementation with quite a few clients, like Python and Ruby.
This looks very cool, however I was able to get lots of false positives using the first few letters of the alphabet as test data. Could you reduce these by using k hashes and k bit vectors with m bits, i.e. one bit vector for each hash algorithm, at the expense of a little extra computation?
Here you can give it a false positive rate you're happy with and a number of elements. It'll give you the optimal size and number of hashes to use. Your suggestion would use far too much space.
The expense is space, not so much the extra computation.
The nice thing about Bloom filters is that (if your hash functions are good), that you compute the vector length given the number of set elements and the probability of false positives that is acceptable in your application.
Google Chrome not only uses bloom filters for malicious URL filtering but also as a part of the CSS selector matching process. I'm not sure how exactly, but allegedly, it gives them a nice speed bump in majority of the cases without compromising on generality.
In this vein, a friend and I are researching/implementing a probabilistic key-value store for some bioinformatics applications (one is metagenomic organism and gene identification). It's fast and space-efficient, just like a BF (though obviously less so as it stores keys and not single-set-membership). Any use cases for that kind of thing in your sub-field? Always trying to figure out interesting new applications (we aren't ready to write it all up yet, but hope to at some point not too far down the road).
* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.
Can you explain "probabilistic key-value store?" Would it be that each gene has some defined probability of belonging to a given organism, or is it probabilistic in the sense of having a defined error rate as BFs do?
The latter - probabilistic in the sense of having a defined error rate. In one of our use cases, the keys are simply kmers and the values are organism or gene IDs.
Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.
One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?
This is kind of a fundamental part of bloom filters. They're probabilistic data structures, which is the trade you make for space efficiency.
If you get a false back, you can guarantee that some item is not in the filter. If you get a true back, you can be fairly certain it is. Once you have a degree of certainty, you can do a more thorough check for membership while cutting out large numbers of negatives.
I get the concept - you're trading space for certainty. It's more the bit where it says:
"The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"
Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?
I agree with what chamblin said, and to add one more point to that, you can choose the amount of uncertainty by using a larger or smaller bit array relative to the number of items you will have.
Another great Bloom Filter library not mentioned in this article is Bitly's implementation on GitHub[0]. It has the best of both worlds: a C implementation and higher-level wrappers (Golang, PHP, Python, and Common Lisp). If you don't want to play around with a fancier DBMS and really just want the damn filters, I would look here.
Last time I checked, the Cassandra implementation was really messed up. Myself and another dev at our last company spent a few weeks debugging, adding test cases and fixing the code to get it to work correctly.
Cassandra user in production here. Could you elaborate a little bit on what was messed up (size, false positives etc.) and which version this was? Did you have trouble with Cassandra itself that you tracked down to the bloom filters or trying to re-use the bloom filters in another project?
I'm only familiar with a Python clone of the Cassandra implementation ("hydra" -- used it a while back), but two issues I do remember are: 1) I believe it only uses 32-bit ints for the bit array addressing, so you can overflow it (and this also may be less-than-ideal from a hash distribution perspective, but I don't know offhand); and 2) as someone coming from a different background, I found the whole thing to have a bit too much "OO" magic, with several helper classes to set up the filter that (to me) obfuscate what's really going on.
Any read-world use case where you want to check set membership when false positives are allowed.
E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).
Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:
- Asserting all the terms requires a lot of memory.
- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.
This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.
For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].
As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].
Jeremy Edberg mentioned in his talk on scaling reddit [0] that they used a bloom filter to provide fast (negative) lookup of whether or not a story or comment had been up voted by the user. The negative case was important since it's often the case that the majority of stories or comments appearing on a page will not have been up voted by the user.
You could use them for rough spell-checking by hashing all the words in a dictionary. You would then need some additional data structure for overcoming false positives though, but any word rejected would necessarily not be a member of the original dictionary. The BF size would be much cheaper to store than the original dictionary and searching for each word would be pretty fast (though not necessarily faster than exact string lookups).
Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.
it's also useful on local hash joins because it's faster than a regular hash table lookup. so you can use it to eliminate some stuff before going to the big hash table.
To generalize, "possible match detected" for anything: you have a big, fat database about a bigger, fatter universe of identifiers, and a less-capable set of clients over a less-capable network link. They want a fast way of determining whether or not a particular identifier is worth querying the database for, and the false positives do not have a very high cost.
The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.
I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.
For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).