Hacker News new | past | comments | ask | show | jobs | submit login

I looked into bloom filters after someone commented as such on my last post. I guess the possibility of false positives makes me a bit jumpy. I realize that it can be reduced with larger data structures but that prospect doesn't make me terribly excited (and having an as-close-to-zero-as-possible error rate would seem to create a significantly large data structure in the end - at least one that doesn't seem to provide much benefit over gzip compression). Although, please feel free to correct me if I'm wrong - if this does provide a tangible benefit with virtually no errors then I'd be very open to using this technique.



The way you'd want to use bloom filters when you want zero chance of false positives is to filter out impossible options before falling back to slower methods that don't give false positives. If the slower method takes 10 ms, and the bloom filter takes 1 ms, and you have a 1% false positive rate, you're sitting at about 1.1 ms in the average case, instead of 10 ms.


Rather than the fallback method being the list of all good words, how feasible would it be to use a bloom filter and then have the fallback method be an indexof list of all known false positives?

Would there be some more enlightened way of computing such a list than just crunching through every combination of letters up to length n? That is to say, could one devise a bloom filter with hash functions such that given the configured filter, a list of all possible matching elements could be produced?


As far as I can see, finding all false positives for a bloom filter is effectively impossible, since a false positive is simply a hash collision, effectively.


Okay, so... it's definitely possible, just really expensive. Depending on the lengths of words you cover, you have to crunch though a few trillion combinations of letters, looking for the false positives manually--- like computing a rainbow table, but without punctuation, mixed case, etc.

What I'm wondering, though, is if one could select a hashing function where the collisions were easily discoverable. For a "hash" function that's just m mod 8, the presence of a 1 bit means that you had to have had an input that was either 1, 9, 17, 25, 37, etc, up to the maximum word length. This may be a problem that's just as hard or harder than the rainbow table approach, but I'd be interested to explore its feasibility.


A Bloom filter with an error rate of 1/10^8 requires about 40 bits per item stored. That's about 5 bytes per element, which is comparable in magnitude to just storing the dictionary as a single string. It does also require quite a lot of independent hashes, but it's still going to be cheaper to do the lookups than a linear search through a string.

(On the other hand, it might not compare so favourably with "binary search until we've got it down to a smallish substring, then simple linear search".)


The number of bits you need in a bloom filter varies as a function of the number of items you have (in addition to the desired error rate). Assuming a 100 items, a 1e-8 error rate only requires 480 bytes (= ~5 bytes per item).

You also don't need to do a lot of independent hashes. You can just take a linear combination of the results of two hashes and get basically no performance degradation (see: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf).

Also, you probably don't actually need a 1e-8 false positive rate, since you're going to fall back to a canonical (but slow) lookup table for positives in most use cases. Bloom filters are most useful when your real dictionary is slow and you're expecting a lot of 'no match' results.


> Also, you probably don't actually need a 1e-8 false positive rate

Absolutely not. I just wanted to demonstrate that even if you do want a very low false-positive rate from your filter you don't need a very large data structure, contrary to jeresig's fears. Bloom filters are very, very space-efficient.


Yeah, I think they just fascinate me, and I'm still looking for a good use for them in real life.

I haven't had a lot of cases where I was willing to tolerate errors, but I figured in a word game nobody would notice. :)

One other thing to look into if you're using eval() to parse your json data: A couple of years ago, the flickr developer blog observed that building their data structures with string.split() was much faster than parsing the json equivalent with eval(). YMMV on newer browsers.

http://code.flickr.com/blog/2009/03/18/building-fast-client-...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: