Let's build a full-text search engine

injb · on Aug 4, 2020

>To query the index, we are going to apply the same tokenizer and filters we used for indexing

From developing a Lucene-based search engine years ago, I learned a very useful trick for things like stemming and synonyms: don't use the same tokenizer/filters that you used for indexing. It's much more flexible if you instead expand the query terms, replacing a given term with a list of OR clauses with all the synonyms.

So during indexing, if the document has "fishing", then you just add that to the index.

Later, during querying, the user can search for "fish" with optional stemming enabled, and their query will get re-written into

"fish" OR "fishing" OR "fishy"

This way, the user has control at query-time, and you can manage your synonym/stemming database without having to rebuild your index.

Of course, I had to write a custom parser for this but it was well worth it imo.

blueline · on Aug 4, 2020

how do you populate/maintain the synonym/stemming database? is this something that “standard” datasets exist for? asking because this problem will fall on my plate at work sometimes in the next few months :-)

pseudalopex · on Aug 4, 2020

Stemming uses algorithms or Hunspell dictionaries.[1]

[1] https://cwiki.apache.org/confluence/display/solr/LanguageAna...

injb · on Aug 4, 2020

There was a standard set of synonyms that everyone was using at the time (around 12 years ago) that came in the form of a Lucene index. I'm not sure if it's still around or where you'd find it.

IIRC the stemming was done by an algorithm and not from a database...but I don't remember the exact details.

vaidhy · on Aug 4, 2020

when I was building this in circa 2006, I used porter stemmer. I do not know if that is still state of the art, but it worked really well for my purpose.

sanxiyn · on Aug 5, 2020

These days we use Snowball stemmer. OP also used it.

inertiatic · on Aug 5, 2020

Not a universal trick. Depending on how smart you end up getting and how big your queries are you might end up with an extremely large number of terms in your queries.

amelius · on Aug 5, 2020

Not by more than a small constant factor.

inertiatic · on Aug 5, 2020

The number of terms times the number of expansions times the types of expansion times the target fields.

In effect, a problem in many applications.

dumb1224 · on Aug 4, 2020

Good old days with Lucene powered indices! So many memories. That query expansion trick would have saved me many hours of tweaking.

aflag · on Aug 4, 2020

Wouldn't having a stemmed index and a non-stemmed one also allow the same usability? The user could then just select which index to use.

rpedela · on Aug 5, 2020

Yes and it can make certain use cases easier. There are pros and cons to both index-time and query-time stemming, synonyms, etc. I usually use index-time stemming and query-time synonyms.

sanxiyn · on Aug 5, 2020

That would work, but I am not sure it's worth maintaining two indices.

matlin · on Aug 4, 2020

I like that approach especially since index is more true to the source text.

mwsherman · on Aug 5, 2020

As you get into this, you’ll want to improve the tokenization to something less naive. (The naive approach is fine for the educational purpose of the article.)

I’ve made the mistake of simply splitting on spaces or punctuation, and eventually finding that will do the wrong thing.

Eventually I learned about Unicode text segmentation[0], which solves this pretty well for many languages. I believe the default Lucene tokenizer uses it. I implemented it for Go[1].

[0] https://unicode.org/reports/tr29/ [1] https://github.com/clipperhouse/uax29

rawoke083600 · on Aug 5, 2020

Yea agreed, as to my tokenization example. Back when I was doing search for a price comparison service. The model numbers of the products were sometimes entered as LGA775 or LGA-775 or LGA 77-5 or LGA—775 (unicode dash) or "LGA 775"

And that is just one example :) I ended up by developing a suite of "search tests" so whenever we were doing tuning (daily basis) we had a set of automatic search tests of about 80 weird/prev-wrong search cases we could check in a blink of an eye. One could specify the top X result needed to include at least Y number of products in XYZ category or have "brand" as ABC etc.. One might say it was "test-driven-search-tuning"

adsjhdashkj · on Aug 4, 2020

Recently i've been writing a sort of archival focused SQL database. Partly out of want, partly out of learning desire. FTS would be of great benefit if i added it.

So i ask, where do FTS and SQL constraints or implementations differ greatly? Why does Lucene exist rather than having the same features in MySQL or Postgres?

I'm asking in the context of road bumps, things to watch out for - should i try to merge the ideas and implementation of a SQL DB and FTS.

(Note: I am aware that some SQL has FTS, but i imagine it still differs or has weaknesses when compared to a strictly focused FTS engine like Lucene)

mamcx · on Aug 4, 2020

One major "weakness" of rdbms is that all internals are black blocks.

You can do "SELECT * FROM table"

but, weirdly, can't "SELECT * FROM A_Index".

NOTHING in the relational model requiere this limitation. Is purely accidental.

BTW, I'm building a relational lang and also, very long time ago, try to port lucene to Delphi.

So, what you need to support FTS and other kind of structures is... support them.

For interfacing, you can consider ANY key-value as a relation of exactly 2 columns. Any scalar, a relation of 1 column, 1 row. You can put relations in relations, so you can do recursive data.

Maybe the hardest is to have not only storage of btrees but also other stuff. I think you can fit FTS data in btrees anyway, but probably tweak it.

When building the relational lang I stumble upon how support a n-column data inside a btree. You can split:

    pk=key
    values=col1 col2 col3

or as I later "get", just store the full row and change the functions that do the comparing. In rust start with:

    pub struct Tree {
        index: usize, <--signal which column as key
        data: BTreeSet<Vec<...>>, <-- Store all row
    }

branko_d · on Aug 5, 2020

> but, weirdly, can't "SELECT * FROM A_Index".

In all honesty, most DBMSes have some sort of "hinting" mechanism that accomplishes roughly the same.

It's a cludge, though, usually in the form of specially formatted comments or some sort of syntactic "piggy-back" on top of the regular SQL, or even outside SQL (e.g. SQL Server plan guides).

I'd love to be able to specify the physical access path with more "first-class" syntax, and mix-and-match with logical access more easily.

adsjhdashkj · on Aug 4, 2020

How funny - i don't know SQL that well, and at work recently i was lamenting that i couldn't find a way to just query an index directly.

Funny how self learning can sometimes mirror what you end up asking in work - even though they're not related at all.

boberoni · on Aug 4, 2020

The first thing that I think of is that SQL is designed for relational schemas, which would involve JOINs operations as well as indexes that optimize for WHERE, GROUP BY, etc. Overall, you have a reasonable expectation of structure over your data.

On the other hand, FTS is working on less structured data a.k.a. variable length text. Therefore, an FTS system would need to include token filtering methods (which the article demonstrates) and optimize for the relevancy of search results. Another challenge for FTS is to create a ranking for your results.

I would imagine that Lucene makes it easier to tune and administer your own definition of "relevancy" and "ranking", based on your application's use case.

adsjhdashkj · on Aug 4, 2020

Gotcha. So it sounds like no immediate blockers for FTS with SQL, but that focused FTS solutions (Lucene/etc) typically offer greater flexibility than you would traditionally see with a SQL DB's FTS.

Appreciate the thoughts, thanks :)

fabian2k · on Aug 4, 2020

I don't really know enough about this for a proper answer, but some parts necessary for FTS are features in Postgres. The inverted index is the base building block, and Postgres does have this as GIN (generalized inverted index). The Postgres version is useful for more than just FTS, but as far as I understand it's pretty much the same thing under the hood as what dedicated FTS engines use.

From what I've read, the biggest weaknesses in Postgres FTS are in the language-specific stuff and not the core indexing. And especially in ranking results, where there are much more advanced methods than what Postgres uses.

JaggedNZ · on Aug 4, 2020

I've had great success with Postgres FTS for some complex (i.e large, non-standard, highly specialised corpus) and basic search (very small corpus) features.

Prior to switching to Postgres FTS had to maintain a seperate FTS applications index. I would say make sure that the gain of using a seperate FTS engine outweighs the maintenance cost, given that maintaining a FTS index in Postgres can be very minimal if configured correctly.

andyjpb · on Aug 4, 2020

This is a great intro to full-text search.

These techniques were the basis of some of the early web search engines (pre Google).

More of this stuff is explored in Witten Moffat and Bell's book "Managing Gigabytes" which dates from the era of those early search engines and when "Gigabytes" was something unimaginably large that certainly wouldn't fit in RAM and perhaps not even on a single disk.

ramraj07 · on Aug 4, 2020

Whatever terabytes or petabytes of data we have in indices today, is it all served from memory? I remember Google saying at some point they went fully in memory searches, don't know if they have kept at it or if it's the standard.

MR4D · on Aug 5, 2020

A petabyte of ram is less than $5MM - using just ram costs and consumer ram as well [0]. So given how much money they make and how many servers they have, I think it’s highly probable. The caching of images and web pages might be more difficult.

Plus, Google could easily just cache the the results for the top million queries without much ram at all by comparison.

[0] - https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16...

elorant · on Aug 5, 2020

What kind of motherboard could take a PB of RAM?

MR4D · on Aug 6, 2020

8,000 of them networked in a data center.

Back in 2011 it was thought they had a million servers, so 8,000 is easy for them.

[0] - https://www.datacenterknowledge.com/archives/2011/08/01/repo...

mrkeen · on Aug 4, 2020

I'd wager the amount of crawlable information on the web has grown much larger than a Google-like company like has grown its hardware.

GordonS · on Aug 4, 2020

Maybe even Google feels this way, since they seem to have deindexed so much old content?

karterk · on Aug 4, 2020

If you used a Trie for the inverted index, you can do fuzzy searches on tokens. This can help handle typo errors. This is the approach I've taken with Typesense: https://github.com/typesense/typesense

hbcondo714 · on Aug 4, 2020

> Most well-known FTS engine is Lucene (as well as Elasticsearch and Solr built on top of it)

What makes me attracted to these services are their hit highlighting[1] capabilities so you can show where the search query matches are. This is especially helpful in long documents but it appears in this case, the documents are just one-liners and the entire document is returned in the search results.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Timothy055 · on Aug 4, 2020

Glad to see this mentioned! I worked on open sourcing the unified highlighter for Bloomberg along with Lucene/Solr committer David Smiley. It’s a very important feature for longer documents--in our case legal documents. We gave a presentation about it at Lucene/Solr Revolution in 2016.[1]

[1] https://youtu.be/tv5qKDKW8kk

boyter · on Aug 5, 2020

Not the point of the article, but the slow-down from using regular expressions can be avoided in this case by instead generating each case permutation and searching on that. Although its actually a bit more complex because you need to cut off doing this for long strings to avoid too many permutations but you get the idea.

Posting just in case someone thinks they need to build an index in order to case-insensitive search a few gigabytes of text. Its actually how tools like ripgrep tend to work (assuming it can do literal extraction of the regular expression). The moment you get to 10's of gigabytes though an index is probably what you need.

I actually have some Go code (since the example was in Go) which does this and assuming you are doing literal searches is a drop in replacement for re.FindAllIndex https://github.com/boyter/cs/blob/master/str/index.go#L110 which also does simple case folding for you.

GordonS · on Aug 4, 2020

Not sure if the OP will see this, but thanks! I was looking for an introduction to building basic FTS just last week, and didn't find anything at the right level for me - this article was absolutely perfect to understand the basics.

Not sure if you intend to write more articles, but I'd love to see this turned into a series where you gradually add different capabilities and techniques (e.g. using tries)

dstick · on Aug 4, 2020

Have a look at ElasticSearch. I was taken aback by how fast it was performing complicated full text queries in a dataset of 300 million web pages.

It was essentially a matter of scrape - clean (remove HTML) - index (add to the DB) - search. Easy as that. If you have prior experience with JS and scraping it should be a day’s work for the basics. The documentation is really well done and easy to follow and the API is intuitive.

softwaredoug · on Aug 4, 2020

Pretty fun - I also often recommend people explore the earlier versions of Lucene, still on Sourceforge -

https://sourceforge.net/projects/lucene/

aidos · on Aug 4, 2020

Agreed. Even configuring Lucerne used to be enough to give you a pretty deep understanding of how it works under the hood. Arguably that’s a bad thing, but I found it very enlightening.

Keverw · on Aug 5, 2020

Pretty neat. I was considering doing something like this to create full text search inside of some newSQL type databases I've been looking at but don't support full text yet, like CockroachDB or YugabyteDB. I know some people use another system for search and try to synchronize everything but personally more things to manage and rather have a single source of truth.

Really wish I could find a prefect system where if you got popular wouldn't have to worry about changing the application where you just throw more servers at it, with incremental backup (sorta disappointed some charge extra for this and not included in the open source community version but I guess in the future could upgrade once your making $$$ and have a large enough dataset where backups are slow), full text search, reference integrity (so you can't delete a foreign key that's referenced - so no dangling references), etc... I guess can't get everything you wish for, but things are clearly better now than they were even a decade ago. So if starting a greenfield project, a lot more choices thankfully though.

jordanlewis · on Aug 5, 2020

I'd be happy to work with you to make it happen in CockroachDB if you were so inclined :)

I added array inverted index support a little while ago in this PR: https://github.com/cockroachdb/cockroach/pull/45157

A simple full-text search engine would not be too different at the lower levels.

Keverw · on Aug 5, 2020

Interesting, a bit over my head. Haven't got into Go, into Typescript/Node but there's been some really cool Go project's I've came across. My idea was to just store an array of strings in JSON and generate a query by parsing the search query string, probably separating words out too could work do handle things like "required" or -required for boolean search and filtering out stop words.

I keep changing my mind on things and still super early on my project still, so far I'm thinking CockroachDB for storing more transactional stuff, inventory, accounts, etc and then use MongoDB for chat, social feeds, logging, etc.

I guess if I implemented my own search logic this way, could then give consistent search experience regardless if what they are searching is in the SQL or noSQL database too.

I guess store things in lowercase too, last I looked CockroachDB didn't have PG's CITEXT but for search lowercase makes sense anyways, but maybe you want to search someone's fist name in support probably enter it lowercase while preserving the original casing for display uses.

make3 · on Aug 4, 2020

The next step is usually to use tf-idf or bm25 to rank the matches https://en.m.wikipedia.org/wiki/Tf%E2%80%93idf https://en.m.wikipedia.org/wiki/Okapi_BM25

kristopolous · on Aug 4, 2020

I've always found these things push poor quality results to the top. The merits it ranks upon are redundancy and repetition.

They are the "next thing taught" but they aren't useful

boyter · on Aug 4, 2020

What would you suggest then?

Assuming you only have the term frequency values I cannot think of any other algorithm you can apply that gives reasonable results. Of course this breaks down as its easily gamed, but unless you are building a public search engine it should work.

Heck tools like lucene/elasticsearch use tf/idf and bm25 although they can include vector space on top of the query as well.

kristopolous · on Aug 5, 2020

The next thing should be talking about the different classes of documents and searches.

Sometimes thesaurus searching is appropriate, sometimes it isn't. Sometimes removing logical modifiers (ex: not, excluding, etc) is appropriate and other times, not.

Sometimes verbatim searching is right, sometimes not.

Sometimes approximation searching is right, something it's not.

For instance, I was on Amazon the other day and I was looking for bubble mailers, 8x12, 50 pcs ... I wasn't able to get amazon to exclude the other unit and lot sizes ... maybe that's the right answer sometimes. A "better" search system however, would see the numbers, know the frequency distribution of those values and present ranges for me to choose.

Google has a focus ambiguity with a similar "one search" approach: it ignores more focused words to give you more generalized results. which might be the right answer sometimes ...and sometimes not.

Here's another classification example: On Google books, when looking for say, "constitution" in older texts, OCR will sometimes misidentify it as "conftitution" because of the long-s in older typography. For this search, I shouldn't have to be manually searching both terms to get the results ... the search query in this case should do something different still, swap letters with common OCR missteps and then aggregate them into a single result.

I ran my own OCR on documents and wrote a small python program to do this using Damerau–Levenshtein distance. It's not hard, could easily be taught to undergrads.

Here's yet another example. let's say I have a garden and I found holes in say, my basil leaves and am trying to find bugs that may be causing that. In this search it would be fine to say "basil is an herb (that's the class of object to search), it finds that say, cabbage loopers eat mint, which is also an herb and returns those results" - this indirection is correct here, assuming these caterpillars eat both plants (they do).

Now let's say I'm looking for a recipe with my basil plant and it uses the same indirection logic and returns mint recipes. Or I'm looking for songs by Toni Basil and I get Toni Thyme and Toni Lavender in the results ... now it's the wrong decision.

Here's another example: suicide by tylenol overdose is common and prompt medical care is extremely important. Often people mix up the terms "aspirin" and "tylenol" and many use them interchangably, big mistake here.

Pretend someone finds their friend passed out and incorrectly searches on the web, "aspirin overdose". It doesn't look like it is life-threatening meanwhile the time is ticking. The engine should be saying "yo, are you sure it isn't tylenol? that'll kill you". Getting that right (similar terms, far more common event, extremely dangerous) requires multiple of the above systems I spoke of combined together. Sometimes right, sometimes wrong.

Anyway, multi-method classifications of documents and search intents, totally the next step.

boyter · on Aug 5, 2020

Got any papers that go though this in terms of pros/cons and such?

While I agree with you I have to wonder given what a lot of people have in terms of search problems are easily solve-able though the use of tf/idf or bm25 because it improves where they are starting to a level thats often good enough.

kristopolous · on Aug 5, 2020

I don't have a doctorate and I'm not an academic. Let me try to be helpful.

IDF is old, like 1950s-1970s for most of the work. You have to put things in context. Digital storage was for surveys and structured documents; things that could be computed upon. Wasting computers for literature and opinions wasn't part of the application space. Why would you be storing newspaper opinion columns on your univac?

So if we are looking at say, demographic population surveys, essentially almanac data, then IDF serves you well. If I was searching "Duluth", then the document concerning the measurements taken in Duluth will have that word in it frequently.

Academically there's something called a "Zipfian" distribution and some math to back it up, but I have very little confidence in my mathematical formalism skills (been working on it for 20 years, still not very good) so excuse me for skipping over it.

The pros/cons appear in "application" sections when describing a technique. You can find these on wikipedia. Here's wikipedia's category on it: https://en.wikipedia.org/wiki/Category:Information_retrieval...

I think, as a whole, we've understated the human interface problem in search. There's some technical tools that if are exposed and used can dramatically improve results.

For instance, a swap operator.

In some of my search systems I permit a syntax:

  "s(phrase[0]:...:phrase[n])"

That's because sometimes you can have a phrase say "x y z" where "z y x" and "x z y" are common but have different meanings and you only want a subset.

Then you have polysemy and homonomy ... so sometimes you want an exact terms, sometimes you want to fuzzy search, sometimes you want a collection of exact terms. So we can do exact with say "=", collection with a "|" and so on...

  's(=x:y0|y1|y2:z)"'

Now pretend you also want a range in there:

  's(=x:y0|y1a..y1z|y2:z)'

and so on. This syntax is also quite fast to process. You can permute all the elements without too much effort, look them up, and aggregate the results in very little time but we are really starting to get into a RegEx like query system which means a decent amount of technical knowledge is needed and that's where we have the interface problem.

These systems are relatively easy to code and run on modest hardware (as in, something you can easily fit under a desk) but require more from the user - a level that frankly you just aren't going to get, let's be real here. I've found lots of otherwise competent programmers who struggle with say regex and bpf, I think I have some natural talent there that is not super common.

This is an example of why I think search is maybe, 50% a human interface problem.

chudi · on Aug 5, 2020

the next next step is using something like learning to rank algorithms to try get better results!

keshavmr · on Aug 4, 2020

bleve is a FTS engine written fully in go. https://blevesearch.com/

mdaniel · on Aug 4, 2020

And, going for the HN gold, Toshi is a FTS server written in Rust: https://github.com/toshi-search/Toshi#readme (backed by the FTS library https://github.com/tantivy-search/tantivy#readme also in Rust)

FridgeSeal · on Aug 5, 2020

Adding to this, here's a couple more options (also written in Rust):

Meilisearch - https://meilisearch.com/ Sonic - https://github.com/valeriansaliou/sonic#readme

Sonic is slightly different in that it returns the indexes to the documents that you use to go and get from an external store, but the engineering behind it is super cool: it uses finite-state-transducers to build the trie's which are then search.

Edit: updated to correct link

mdaniel · on Aug 5, 2020

Thanks for reminding me about MeiliSearch!

I think you meant your 2nd link to be https://github.com/valeriansaliou/sonic#readme

FridgeSeal · on Aug 5, 2020

Yes I certainly did, thanks for the catch.

euske · on Aug 5, 2020

I found that the notion of token is very English centric. What you really want to do is to convert each sentence to a set of features, be it words, n-grams or prefixes/suffixes, etc. When searching, the way I implemented was to use those features for only narrowing the candidates, and apply a post processing to filter out inexact results.

zemo · on Aug 4, 2020

what about the opposite? given a large block of previously unseen text: check if the text contains a word in a given list of bad words. I'm thinking ... profanity filtering, for example. So user inputs a page or two of text, we check it for profanity.

andyjpb · on Aug 4, 2020

This is a much easier problem as the input is large but streamable and the database is small.

However, you would still need to use some of the tokenisation steps that he talks about such as finding word boundaries and stemming.

The performance of your problem would be bounded by the size of your unseen text.

Something like the The Knuth-Morris-Pratt string-search algorithm would be useful.

zemo · on Aug 4, 2020

thanks this is a good reference, this was the type of term I was looking for. I can implement this off of a paper.

it's a bit tougher than regular string matching because of homoglyph attacks (i.e., people trying to subvert the matching by writing f_ck or f0ck or fȗck or whatever) but I can probably do a pretty straightforward tokenization at ingest time to clean that up.

082349872349872 · on Aug 4, 2020

If your users don't ever discuss places such as Scunthorpe, England or Fucking, Austria, the tokenisation may be effective.

zemo · on Aug 4, 2020

it's a multiplayer video game so assume users will attempt to subvert it by any means possible.

new_guy · on Aug 4, 2020

I want to stick my long-necked Giraffe up your fluffy white bunny.

zemo · on Aug 4, 2020

yeah, not trying to do sentiment analysis, that's out of scope lol

sanxiyn · on Aug 5, 2020

This is a classic use case for Aho-Corasick algorithm.

zemo · on Aug 6, 2020

ah, perfect. yeah this is what I had in my head: taking the word list and making a big trie and walking it as I read the input. I figured there was some prior art here but was struggling to google it. thanks!

crawdog · on Aug 4, 2020

You could build a bloom filter with the profane words, then check your tokens for the page against that filter.

teraflop · on Aug 4, 2020

That would make sense if your list of expletives was too large to fit in memory, which would be... impressive, to say the least.

GordonS · on Aug 4, 2020

I'm Scottish, and I'll say it's definitely possible :p

Keyframe · on Aug 4, 2020

if profanities were single words only. What if they weren’t? You’d have to have a giant list of permutations and build a huge-ass bloom filter. Still doable though, but then spelling errors (or not)...

cdbattags · on Aug 5, 2020

Now someone do the same in Rust.

Can_Not · on Aug 6, 2020

Does this count? https://github.com/valeriansaliou/sonic

jrnichols · on Aug 4, 2020

for some reasons this sounds kind of like Gopher

known · on Aug 5, 2020

2 major issues;

    Google is doing near real-time indexing of ~200,000 websites; It's expensive for others to do that;

    Google PageRank has matured over period of time to prevent https://en.wikipedia.org/wiki/Google_bombing