Why Pandas? Because BM25 is one thing, but you also want to combine with other factors (recency, popularity, etc) easily computed in pandas / numpy...
BTW phrases are the hard thing. There are a lot of edge case in phrase matching. Not to mention slop, etc. And you want to compact positions into as little memory as possible
I also bought the book (it’s Relevant not Relevance BTW), that one and Deep Learning for Search were invaluable when building a news search engine a few years ago (right before and in the middle of the pandemic).
It's one I point people at all the time when they ask me why something isn't working as expected in any standard search tool, and something I reference from time to time to refresh my own knowledge.
Have you found that including sentiment analysis has helped(or hurt) with phrases? I also have found that phrases are difficult to work with and I’m wondering what I can do to improve performance.
How come you found this post and commented on it so quickly, do you have some sort of search scanning frontpage for terms of interest or was it by chance? Curious
It's not the first time I saw an article posted and then an expert in the field comment on it rather quickly, I thought I may be missing something how other people use this site, had no negative intentions asking this and thanks for the answer ;)
Checks out. Most of the difficulty in search is dealing with the data volume. The logic itself is surprisingly easy, or it can be. Of course you can complicate it endlesssly, but this project successfully cuts most of the fluff...
So with that in mind, approaching the problem not as one of how to make the search engine bigger, but the data smaller (physically or better signal/noise ratio) goes a very long way.
I've no idea what `k1` or `b` are. Nor is there a single comment in the entire file. Are comments considered unfashionable these days?
Looking at `_documents`, I'm guessing the keys are URLs and the values contents of those URLs, but i might be wrong.
The whole thing looks like it would've been a useful resource for people to learn how to build search engines with, that people could build on, had the writer been bothered to document it. But as it is I'm disappointed with the poor code.
That's explained in the article, which serves as the documentation for the code within the article. The link for BM25 goes to some of the math, and a little more searching the internet about BM25 parameters can lead you to some relevant articles on how to choose them.
If I wanted a catchy title for the post I needed to cut the number of LOC as much as possible;)
Joking apart, thanks for your feedback. I agree that usually it's better to have documentation and code together, but in this case since it's an educational project I decided to split code and documentation, and document the code in a blog post.
k1 and b are tuning parameters for the BM-25 ranking function[1]. These are not names OP's invented. Virtually every implementation and definitely every textbook uses these variable names. You give them the name k1 and b because otherwise nobody who's into information retrieval will understand what they do.
this trend of `a: float` always reminds me of the Rich Hickey "you don't want types, you want proper names" talk. I really hate this (feels to me Go inspired) tendency of undescriptive single letter variable, with the type system abused as a naming assistant.
Names can convey proper semantic information about what your program does, use them godammit
Sadly the 5y SE forgot about dependency injecting the computation... after all we're always on the brink of new physics discovery, we'd hate to have to rewrite all that code when we can readily swap in fresh laws with the proper architecture!
This is right, but if you are implementing a formula known in the domain or documented elsewhere, you can (and should) use the letters used in the formula (in this case, b and k1) instead of making up names.
I agree with you that better names are always preferable. But type hints also work as documentation. We can have both.
However, in this particular case the undescriptive names are like this for historical reasons. I agree these are not the best names, but are the names used in the literature. If I was working in a physics I would probably use "c" as the speed of light or "kb" as the Boltzmann constant, which are non very descriptive names.
What is the point of flexing about LOC, if it is not a total number of \r\n since we are using external deps? I know that there is no unit for codebase in SI system, but I think we should measure cognitive load somehow.
This grug not big brain. This grug ate food and read article on big screen at the same time. Now food on screen, not in belly. Grug still hungry, but need to find rag now to clean screen. Otherwise no more work and no more shiny rocks for grug. But grug thanks other grug for article. Grug belly empty, but grug brain full now.
The 80 line search engine isn't using any external deps though. It only imports collections, math and string, all in the standard library. Maybe it'd be more accurate to call it a search engine engine though. The crawler and interface aren't counted towards that goal, but they are obviously needed in some form, and the implementations presented add a bunch of lines and a lot of libraries.
But even then, those libraries aren't related to search engines. If we start counting generic dependencies like pandas and fastapi, we might as well start counting the millions of loc of the operating system needed to run this search engine, and the firmware in the network card. Maybe even account for the complexity of the hardware this is running on.
"X in N lines" is interesting because it's going to show you the minimal implementation of the interesting or central part of the solution without all the other moving parts of a production system, usually for the purpose of demonstrating how something works.
You can see how your 1-liner pitch doesn't fulfill this expectation.
You code just use an existing search engine and it would be 0 LOC, but I think you're missing the point. The focus wasn't on 80 LOC, but rather being able to talk through it in a short blog post.
I don't think that LOC affects one's ability to effectively communicate the way their code operates. And, if we really need to, lowering the amount of operations required to attain the desired result would be better than lowering lined of code, sine low ops = less explanation.
It's meaningful here because if it said "A search engine in 4000 lines of Python" most readers' eyes would glaze over, but 80 is short enough to warrant a glance.
I am aware of it, but as far as I understand, cyclomatic complexity is used to compare implementations of single flows (algorithms), not for codebases. What is the cyclomatic complexity of Facebook? :)
Here is a recommendation engine in < 20 lines of python to go along your search engine (if you keep session logs of clicked urls).
def build_recommendations(logs: List[List[str]], window_size: int = 10,
max_recommendations_per_url: int = 50) -> dict:
recommendations = {}
for session in logs:
for i in range(0, len(session)-1):
url_id = session[i] # reference element
window = session[i+1 : i+window_size] # sliding window
recommendations[url_id] = recommendations.get(url_id, {})
for pos, next_link in enumerate(window):
weight = window_size - pos # elements in the window get decreasing weight proportional to their distance from the reference element
recommendations[url_id][next_link] = recommendations[url_id].get(next_link, 0)
recommendations[url_id][next_link] += weight
for url_id, link_recommendations in recommendations.items():
# sort and truncate the recommendations
recommendations[url_id] = dict(sorted(link_recommendations.items(), key=lambda item: item[1], reverse=True)[:max_recommendations_per_url])
return recommendations
recommendations = build_recommendations(your_logs)
print(list(recommendations[some_url_id].keys())) # gives an ordered list of the recommended url_ids for the given url_id
With some tweaking (mix typed queries and clicked urls in the logs you feed) you can get a spellcheck suggestion out of it as well :)
This is very cool and very educational. Don't deploy it, though. :-)
I needed something like this once, but at a little bit larger scale (few tens of thousands of documents). The answer, as always, was [sqlite](https://www.sqlite.org/fts5.html); structurally though it's just what you have here but with someone else writing the inverted-index persistence layer.
Google allows you to search for "search engine" (notice the double quotes) and it’ll only show you results where the two words appear in this specific order.
At least only some of the time, unfortunately. What power users want is "grep for the Web", not "Google, tell me what you want me to see."
> What power users want is "grep for the Web", not "Google, tell me what you want me to see."
I can almost guarantee that nobody actually wants this. "Grep for the web" is strictly bad compared to a search engine that does the tiniest amount of query expansion. Google is definitely taking too many liberties in interpreting the query, but there are many things any search engine should do that will be a straight improvement over not doing them.
The problem with Google search right now is that it's hard to reason about why it gives the results it does, seemingly because they rely too heavily on embeddings to compare strings. It's frustrating when "cat food" matches "dog restaurant" because the two were semantically close in some embedding space that doesn't quite align with human reasoning.
If they'd built it on top of elasticsearch I'd agree with this sentiment, but given the actual search engine bit is implemented in those 80 lines, I think it's fair. The libraries being pulled are exactly the sort of libraries you shouldn't try to hand-roll.
(sometimes you see those articles like 'built your own search engine' and it's a guide on how to install searxng or yacy or something.)
Nice! It wouldn't be much work to add fuzzy-search functionality (all results with a prefix edit-distance below some threshold delta, so that a search for "hackrnew" matches "hackernews"). Basically, what you would do is you add an additional inverted index, but this time the keys are n-grams of words in your document collection (typically 3-grams), and the postings are the words (or their ID) in which these n-grams occur. There is then a nice lemma that basically says the following (where PED is the prefix edit distance, and N(x) are the n-grams of word x):
If PED(x, y) <= delta, then |N(x) ∩ N(y)| >= |N(x)| - n ∙ delta
That is, x and y must have at least |N(x)| - n ∙ delta n-grams in common to have a PED(x, y) less then or equal to delta.
If you now have an input x, you calculate N(x) and retrieve all postings from the n-gram index for each n-gram of x. You can now merge all of these postings and get a list that looks like this: [worda, worda, worda, wordb, wordb, wordc] (for each q-gram x and some word y have in common, you get one entry of y). If you merge the duplicates, you get: [(worda, 3), (wordb, 2), (wordc, 1)], and for each y (in this case, worda, wordb, wordc), the number in the corresponding tuple is |N(x) ∩ N(y)|.
If this number is larger than |N(x)| - n ∙ delta, you explicitly compute PED(x, y) and check whether it is below your threshold. If the number is smaller, you can simply skip it, saving you large amounts of costly PED calculations.
Your result is a list of words y with a PED(x, y) to the input x below some threshold, and you can then use this list of words to query your existing index.
I used this approach many years ago to implement a fuzzy client-side JS search engine on https://dont.watch/ (if you look into the JS code, you can see that the inverted index and the (compressed) n-gram index are simply transferred in the JS-file). The actual search engine is around 300 lines of JS, with no external dependencies and some very basic heuristics to improve the search results).
I would argue that Python does not inhibit this project as much as you imply. There are multiple benefits to Python, such as:
- A built-in dictionary type, used for indexing words
- Clean and easy to read code, which is one of Python's core strengths
- It's fast to draft code in, perfect for toy programs
- Easy async support, which the author comments on
- Plenty of libraries to do the heavy lifting of tasks not focused on by the post, such as hosting a web server, rendering template HTML, and parsing CLI arguments
Yes, Python is not fast relative to C or Rust, but it's perfect for this type of project.
> This implementation doesn’t pretend to be a production-ready search engine, just a usable toy example showing how a search engine works under the hood.
I'm not a fan of Python, but the bigger "issue" here if this was meant to be production code (it's clearly not) is not the choice of Python, but algorithm choices. It shows the principles, but for a production search engine there are a whole lot of tricks you'd apply whether you stick with Python or not.
And once you do, odds are you'd find most of the runtime spent in very small portions of code doing fairly basic composable operations on large compressed arrays, and you can get very far with just rewriting a tiny core in something faster.
The number of people who need a scale where this is hard to do fast enough is small...
Thanks for this lovely well written account. Made me wish I still
taught this kind of coding because it would be such a lovely example
for students to work through. (We used to do a bootcamp of build your
own Facebook and Google in a week) Very inspirational.
Really cool project! Building a search engine from the ground up to tackle small site discoverability? That's a challenge I can get behind. I'm especially into how you used asyncio in Python for faster crawling – smart move.
But, let's talk scale and features. As it stands, handling bigger data sets or adding more complex search features might be tough. On the features side, playing around with query operators or n-gram indexing could seriously level up your search results.
Expanding beyond RSS for content could also give your engine a nice touch. Just throwing in my thoughts – been around the block with search tech a bit. Can't wait to see what's next for your project!
I think you'd probably get the best result with both. There's definitely real merit to a keyword-understanding of which terms appear in the title or as a named entity for example.
Usually you would want to have weighted n-grams with 1-grams having lowest weights. In many cases it's better to have zeros. For English, 4-grams react on generic phrases/idioms, 5+ are way too selective and just keywords usually reduce relevance too much. 2+3 are the best.
I have myself dabbled a little bit in that subject. Some of my notes:
- some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links
- sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection
- sometimes even that is not enough
- then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?
- sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?
- you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"
- RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains
- I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time
- I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity
My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.
Feel free to shoot me an email if you have any questions about how to use the files. They're very much there to fan the flames of other indie search projects :D
just fyi, PageRank and BM25 address orthogonal problems of reputation/popularity (PageRank) and query/semantic matching (BM25). Saying you're using bm25 instead of PageRank makes no sense.
both are similar in the sense that they can be used to sort a series of sites/documents by relevance. PageRank makes the assumption that sites with better in/out links are more relevant, while BM25 makes the assumption that documents with high keyword matching are more relevant. In this sense both solutions are similar.
https://github.com/softwaredoug/searcharray
Why Pandas? Because BM25 is one thing, but you also want to combine with other factors (recency, popularity, etc) easily computed in pandas / numpy...
BTW phrases are the hard thing. There are a lot of edge case in phrase matching. Not to mention slop, etc. And you want to compact positions into as little memory as possible
https://github.com/softwaredoug/searcharray/blob/main/search...