The first I heard of a Bloom filter was when I was tasked with implementing one in iBooks. I used a Bloom filter for search.
My recollection is a little fuzzy now but as I recall, as we parsed each pagination of the ePub, we would hash or index (the first three characters of?) every word on the page. Each page got its own bitfield or some such structure to hold these hashes.
When a user typed a word to search we could potentially eliminate a large percentage of the pages of the ePub. If a page had no hash matches to the hashed search term then we knew the word was not contained on the page at all.
A best-case example might be searching for "Jabberwocky" in the book "Through the Looking Glass". Very likely all but one page of the ePub will fall away.
As is pointed out in the article, if we did get a hash match for a page it was not a guarantee the word was there — but then we would drop down to a more standard string compare to find not only if the word truly existed on the page but where it was as well (or if it existed in multiple places on the page).
The worst case was of course entering a search term like the word "the" where potentially every page would have to be exhaustively string-compared. But that kind of falls in the lap of the user — why did you search for "the"?
This is really interesting, thanks for sharing! I'd imagine the three-character trick would have failed on programming-based ebooks, right? With all those underscores and stuff. Do you remember the preprocessing steps you took for this?
On a side note, I just spent like an hour on your blog. Fascinating stuff!
WebKit, at the time, was not very good at pagination, something ePub requires. iBooks work resulted in a number of "Radars" filed against WebKit with regard to pagination. As a side effect I suspect printing a web page from Safari (who does that?) is markedly better since.
My recollection is a little fuzzy now but as I recall, as we parsed each pagination of the ePub, we would hash or index (the first three characters of?) every word on the page. Each page got its own bitfield or some such structure to hold these hashes.
When a user typed a word to search we could potentially eliminate a large percentage of the pages of the ePub. If a page had no hash matches to the hashed search term then we knew the word was not contained on the page at all.
A best-case example might be searching for "Jabberwocky" in the book "Through the Looking Glass". Very likely all but one page of the ePub will fall away.
As is pointed out in the article, if we did get a hash match for a page it was not a guarantee the word was there — but then we would drop down to a more standard string compare to find not only if the word truly existed on the page but where it was as well (or if it existed in multiple places on the page).
The worst case was of course entering a search term like the word "the" where potentially every page would have to be exhaustively string-compared. But that kind of falls in the lap of the user — why did you search for "the"?