Hacker News new | past | comments | ask | show | jobs | submit login

The way you'd want to use bloom filters when you want zero chance of false positives is to filter out impossible options before falling back to slower methods that don't give false positives. If the slower method takes 10 ms, and the bloom filter takes 1 ms, and you have a 1% false positive rate, you're sitting at about 1.1 ms in the average case, instead of 10 ms.



Rather than the fallback method being the list of all good words, how feasible would it be to use a bloom filter and then have the fallback method be an indexof list of all known false positives?

Would there be some more enlightened way of computing such a list than just crunching through every combination of letters up to length n? That is to say, could one devise a bloom filter with hash functions such that given the configured filter, a list of all possible matching elements could be produced?


As far as I can see, finding all false positives for a bloom filter is effectively impossible, since a false positive is simply a hash collision, effectively.


Okay, so... it's definitely possible, just really expensive. Depending on the lengths of words you cover, you have to crunch though a few trillion combinations of letters, looking for the false positives manually--- like computing a rainbow table, but without punctuation, mixed case, etc.

What I'm wondering, though, is if one could select a hashing function where the collisions were easily discoverable. For a "hash" function that's just m mod 8, the presence of a 1 bit means that you had to have had an input that was either 1, 9, 17, 25, 37, etc, up to the maximum word length. This may be a problem that's just as hard or harder than the rainbow table approach, but I'd be interested to explore its feasibility.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: