Hacker News new | past | comments | ask | show | jobs | submit login
What software engineers should know about search (2017) (scribe.rip)
352 points by todsacerdoti on Oct 18, 2021 | hide | past | favorite | 129 comments



Biggest advice I can give is you probably don't need search if you're indexable by search bots.

No really. Look over people's shoulders sometime.

They'll just go to Google and type in their search followed by terms such as Wikipedia, imdb, Stackoverflow, YouTube, Bandcamp, Amazon, eBay, Yelp... all sites that spent a lot of time on their search and have done quite a decent job. Oh well.

So unless you really need it for some critical reason where you know your users aren't going to do their regular patterns of going through one of the general engines, close the ticket as out of scope and go home early.

Don't bother implementing hard to do and expensive to maintain features that nobody will use. It'll become more headache than fun real quick.


  > Biggest advice I can give is you probably don't need search if you're indexable by search bots.
I don't know how relevant it would be today, but ~2005 or so I had a site that used Google search for the internal "search" functionality. It was a webmaster feature of Google at the time, the google results would be displayed with my site's branding, colours, etc. Only results from my site were displayed, but there was a clear Google logo and link at the top. I think that ads were not shown, but I'm not 100% sure.

A/B testing showed significant (I don't remember the numbers) loss of visitors on that page. When presented with our own internal search, visitors would stay on the site the vast majority of the time. But when presented with the google search, visitors would often leave. I would say that the results that google returned were no less relevant than those our internal search provided, in most cases near identical except for ordering.


So as a casual web browser, I have been guilty of this. IMHO it just breaks the users experience when they feel like they have been swept out to google. Almost like when someone asks a question, and someone replies with a link from lmgtfy


I wish that we had spoken a decade and a half ago ))


This is true in principle, but in practice, the use of 3rd party search has died down over the last decade. This is not the phenomenon of a superior way winning out.

Either search is not an important feature, and a suboptimal, DIY implementation that looks OK is good enough. Or, search is a primary feature and then you need control over it. IE, if you have an online store, travel site or dating app with a search based UI, then you'll probably roll your own.

There are cases where google really is the best way. As you say, stackoverflow, wikipedia & such. Even so, you'll eventually roll your own. Spolsky original UI concept totally leaned on Google for search, but SO still has its own.

At some point, you'll need results to take inventory into account or make autocomplete smarter about tags... and now the headache is yours anyway.

It would have been cool if the web had really developed into the hopeful, "semantic web era" where this kind of approach works. That didn't happen. Half the game is over control, and controlling UIs matters the most.

Also, the power of pagerank has dwindled as the web itself changed. Links aren't what they used to be. That makes Google relatively worse at what it was once best at. Google search as a whole is much, much richer but most of what makes Google good today has less to do with your use case anymore.

TLDR, Maybe google search works for searching wikipedia, which is perfectly loyal to the original WWW concept. Even they have a DIY search. If you work at Reddit though, and your search sucks, google will not fix this. Your search will just suck, and your app will be less usable.


> Either search is not an important feature, and a suboptimal, DIY implementation that looks OK is good enough. Or, search is a primary feature and then you need control over it. IE, if you have an online store, travel site or dating app with a search based UI, then you'll probably roll your own.

Most stores have garbage search and even worse filtering. >98 % of stores have only categories for filtering and only trash like "price" for sorting. In the 2 % of stores that can actually do faceted search the number of facets is too small (e.g. shoe stores which only have a "size" facet plus categories) or the data in the facets is garbage. This is likely a big contributor to sites like Geizhals existing, whose sole purpose is to offer decent search and filtering.


Ninety percent of ecommerce sites have terrible search because ninety percent of everything is crap [1].

That doesn't mean that having a great search can't help your shop get repeat customers, just like good service, fast shipping or a good checkout flow can.

1: https://en.wikipedia.org/wiki/Sturgeon%27s_law


I don't disagree.

If you have bad search, and search is a primary feature, then your ux sucks. I just don't think "just use google" is a viable alternative to making search good.

If you're making an online store, search is probably important, and a major determinant of how well the software works. The best way to handle that is with search that doesn't suck. It may be hard, but that's the job. Fair point that search is not trivial to do well, but that doesn't mean it isn't the job.


> controlling UIs matters the most

Or as former Google design ethicist Tristan Harris wrote, "Whoever controls the menu controls the choices."

https://observer.com/2016/06/how-technology-hijacks-peoples-...


This only counts for public sites with semi-static information.. I've been on plenty of internal/LOB projects where the filtering/searching/slicing of data and exporting to various formats are one of the core features of the application, where the data will also be different as time marches forward (non-static sites). Thousands of non-tech people rely on those kinds of features to do their job. Oh, and some of them are behind firewalls/proxies/vpn and never see the light of day on the public internet, so no search crawlers can see them - and even if they could, they are not optimized for crawlers and/or too dynamic.


Well they said in the very first sentence it only applies if you're indexed by search bots. Internal apps (typically) aren't accessible at all, and quickly-changing data isn't indexed in time to be relevant.


It can still be public and indexable but you will still miss finer grade filters.


I will very quickly stop using a given website if it doesn't provide a good, on-site search. If I'm looking for information that I know exists on a particular website, I see no reason to involve a third-party search engine in the equation at all.

Granted, a lot of sites out there that provide a search function do not necessarily need one, and would instead benefit from an organized, hierarchial index. With these types of websites, using the search function only adds unnecessary friction.


Given the state of search functionality built into websites I use often I can only imagine you have a very heavily curated list.

In my experience so many sites have searches that return bad results, restrict your ability to see a list of results due to some poorly thought through typeahead functionality, or simply do not work without third party scripts and cookies that I often resort to "site:<domain>" in ddg or worse, google.

I find hierarchical indexes only of marginal use when looking for information. I can only know where in the hierarchy something is if I'm very familiar with the website - because many things could logically be in more than one place.


> organized, hierarchial index

Yes!!

So many blogs without a list of posts in chronological order. So many sites without a list of pages. So annoying.


Would it bother you if foo.com took your query and send "query site:foo.com" to google? The results would be foo.com specific.


I often find this gives better results. Eg I'm much likely to find what I'm looking for if I search for "(query) site:reddit.com" than if I use reddit's own search feature.


Much more likely*


I agree with you but many sites use a third-party search engine to provide on-site search functionality. e.g. Algolia's DocSearch


Yes, but a good embeded search makes all the diff.

VueJS and Tailwind CSS both are indexed, but on those particular site I use the web site search _when I'm looking back for the reference of something I know_, because it's faster and more accurate than googling.

Granted, it's rare, but if you manage it, it's great.


Maybe. The point in those examples was that they all solved search in a specific contextualized manner impressively well and despite that people are creatures of habit and they'll want to port the same generalized patterns over that they do for everything else.

Not consciously. They'll just do it and expect it to work which is why effective indexing and "SEO" (as in, the engine can scrape the content and crawl around without getting confused) is likely the actual work to do to implement "search".


From my experience yesterday I can tell you that Zenodo is struggling with reliably repeatably returning results basic boolean search queries. I think I’ll try parent’s search-the-site-via-Google approach.


Agree with your sentiment.

I cofounded ZIR AI to provide ML (vector)-based search as a PaaS solution, similar to what Algolia or Elasticsearch do for keyword searching.

We have a demo (https://zir-ai.com/demo) running over Quanta Magazine articles. Not only can it outperform the keyword search embedded on quantamagazine.org, but one some queries, it even outperforms Google with a site restrict (e.g. how old is the universe site:quantamagazine.org).

> Granted, it's rare, but if you manage it, it's great.

ML-powered search will make this commonplace, once the tech goes mainstream in the next 2-3 years.


I don't see it anymore, but people used to put a google search bar on their website that would use the sitesearch parameter to restrict results to their domain.

It looks like there is a modern alernative - seems a little more complex to get going with which might be why it's not so common.

https://developers.google.com/custom-search


We do it at https://dlang.org

It's not the greatest, as it seems to prefer to return references in D forums in preference to the manual pages, but it works well enough.


There used to be a big yellow blackbox server called Google search appliance that you could put in your data center and get internally indexed white label google search.


Before Google figured out what its business model was.


I was going to say something similar.

It’s not common, but it happens. For small sites it should possibly happen more. If you’re working hard to achieve mediocrity, you should put that energy somewhere else where you can at least get to good.


That's not always true.

I run https://volt.fm and search is one of the most used features. If it didn't have built-in search, I doubt any of the users would use Google instead to find the artists/songs they were looking for.


Even further, any ! on DDG is using the site's search so even if we go to a search engine (e.g. `foo !w`), it's possible that we are using the website's search anyway.


I'm sure you're right and you've done the analysis. What's the inbound search engine versus your search endpoint as a fraction of each other?


I don’t want users to leave the site and use google. That’s bad UX and DX at the same time as it only increases friction and uncertainty.

I want them to believe that if it is there, they’ll find it with search. But that depends on how good it is and how helpful the feedback is.


I think this very much depends on what kind of service your site provides, but in many cases I think we can argue that the Google fallback is good UX. It means the user doesn't need to figure out and navigate a new/different search system for each service.

If the GP's right and people really do just drop out and use Google every time (even though the target site has a search function), then I think it's hard to argue that it's bad UX. Unless you want to argue that the specific sites' search functions are poorly made.


> Unless you want to argue that the specific sites' search functions are poorly made.

You lose trust in the search function very quickly, if you gain it at all. Whether yours is effective and helpful does not only depend on the search implementation but also on whether your site actually has the information (that it should or is assumed to have) to be effective and helpful. A lot of factors come into this, representation, content, design both for search results and for the actual target pages.

Now if you have the trust, are helpful and effective, then the UX is drastically improved.

For one, search engine results have been degrading. There is _so_much_spam_ out there. Countless SEO sites that just crawl the web and generate crap output to show you ads, some seem to be handwritten as well. It is increasingly probable to get top results that are just presenting "stolen" content in some form or another. You get low effort blog-like posts that are just restating things they read in a discussion that is actually not the primary source, but a generated site, which is referring to the actual information somewhere else.

Secondly if you have in-built, decent search people will use it and they will be happy for it, because you'll be presenting them much better suggestions in a better way and you'll do it faster.

Think of some of the _best_ sites like MDN. Sure, you might enter the site via DDG or Google (in the former case you'll be searching directly like so: "!mdn [search_term]"). But when you're on the site you'll be happily navigating it via links and search, because it's just a very good site.

Other examples of this are: wikipedia, tailwind, clojuredocs, reactjs, hn.algolia...

They all have rich content, good search, useful links. It's a very effective combination.


Yes, when I'm browsing documentation, e.g. for Python, I don't want to leave the context and type "X Python", where X is what I'm looking for. Also Google searches in that case might give all kinds of irrelevant results, like StackOverflow pages (sometimes useful, but not if I want to read just the documentation), ads, and such. Typing "site:docs.python.org" is also out of the question.


but if the search isn't good enough I'll just go to Google anyway but now you've annoyed me with your own useless site search, which is what always happens


How is it bad developer experience (DX)?


We don't want to be even more reliant on optimizing for something we have little to no control over right? Maybe this is more of a mindset/approach thing. But I like making things where I can communicate the guarantees and assumptions with confidence.

Note I'm not saying we shouldn't care about search engines. They are extremely useful and important. I'm saying if you have the type of content that benefits from being navigated via search, then consider direct control over this.


Yeah, I suppose there are two ways to think about it: maintaining an outside dependency or offloading that dependency completely.


Ironically I use hackernews's search engine a lot

Also Google search is very "fluid". Sometimes, for example for searching documentation, you need something more advanced.


Is there any actual evidence that might be true? I encountered this argument many times, but only from programmers (who thought it was to difficult because of their tech stack), never from users.

As a user, there is no site that does not have search, no matter how good it is, that does not feel quite horrible because of that. Even the search by Google feel horrible -- romhacking comes to mind. It is also frustrating because some instances of title searches are notoriously unindexable, while a nice site search lets you narrow the context through tags or model specific attributes (system, genre, ...).


The statement is intentionally hedging and qualitative. Some people do the usage pattern, some don't.

The advocacy is to push back on search as a difficult and often unnecessary problem to solve.

The caveat is my increasingly toxic pattern to push back on almost everything as unnecessary. It's probably overly antagonistic.

I'm a techno pessimist programmer. I didn't understand this attitude when I was younger but then again people often become that which they fail most to understand


Formulated like that, I'm very sympathetic to your point of view. Having a small set of well-refined features rather than a growing spaghetti of half done ones. I guess my own point of contention is that programmers would say "task X is hard" when it isn't that much, it is that their technical choices make it hard.

For instance, one Front-End engineer was really proud of using a very recent and trendy framework to rewrite the whole site from scratch. After deployment, people are ordering the wrong products. His reply was, "keeping the query filters in sync is too hard of a problem". It's really not.


An interesting example is the search query “<question about anything> reddit”.

Without the ‘reddit’ qualifier, the results are nearly always spammy and useless (as much as modern Reddit tries to compete here notwithstanding)


Reddit itself has an annoying dark-patterns UI. You have to edit the URL into old.reddit.com to get a reasonable page to read.


Its pretty often that I do a search, get bad results, then add reddit and get good results, so I hope some data analyst in Google sees those as opportunities to improve.


Or as a slight refinement implement it as feeding ("site:mysite.com " + user_query) through to Google, to spare your users 2000 Pinterest hits and 1000 scrapes of a very old Stack Overflow question that somehow matches.


Indeed, there's above board white label options for embedded search from most of the indexers, usually for a nominal fee.

Really depends on what kind of content your dealing with. Search can be as hard as you want it to be.


"Critical" is subjective but:

* Are Google indexes updated frequently enough to capture important changes?

* If you consider certain search terms synonymous, would Google do so too?

* Are Google result rankings compatible with yours?


>They'll just go to Google and type in their search followed by terms such as Wikipedia, imdb, Stackoverflow, YouTube, Bandcamp, Amazon, eBay, Yelp... all sites that spent a lot of time on their search and have done quite a decent job. Oh well.

I'm someone who does this, and I do it because I've found that it either works better than a site's specific search and/or it's so much faster to go to Google and search than to try to find the search functionality of the specific site.


It depends. If your main UI involves search, you might want to make sure that it is actually usable, competitive and not embarrassing.

If your main business is selling stuff that users find via your search features, then doubly so of course since you are literally losing cash every time your search ends up not doing its job. Easy to measure too and most eCommerce companies that survive long enough do that obsessively because it shows in their recurring revenue if they don't.

Also, if your competitors have awesome search and you don't, your users might realize and jump ship. If you have content that is great but nobody is finding it, you might want to fix it by allowing them to find it via better search functionality.

If search is actually not critical to your UX or product, then by all means, cut corners. Google will happily redirect users to your competitors as well. Make sure to give them plenty of money for keywords. If you don't, somebody will. Either way, your analytics will tell you how people come to your site and what they do once they get there.

Either way, it's not that hard to build a decent search experience if you know what you are doing. The key point of this article is that many engineers kind of don't know what they are getting into and mess it up.


I really do not agree. There are many technical blogs that I like and navigating the content on their site is a pain. They will have interesting multipart blog posts and finding all the connected parts requires clicking through multiple pages of 10 entry blog posts to find what I want. Sometimes they will have links between related posts in the blog post if you are lucky. If they are writing about a specific concept, good luck you need to start clicking.


I have never ever been ale to find anything on Wikipedia with search, except if I know the title of the article I'm looking for. But as I often do want the information from Wikipedia, I just search <search term + wiki>, and it takes me straight to where I want to be, whether I'm using DDG or Google. Works so much better than !w


Lately I had to scroll down on Google search results a lot to find the relevant Wikipedia article, it often being somewhere below some irrelevant images, followed by a completely unasked for and irrelevant map (why in gods name would I care for where the nearest factory for a product is?), some random blogspam, and ads.

It used to be that you reliably had the Wikipedia article at the top of your results to provide context and basic information in case you didn't know what your search term means, you could expect it to be there if it exists. Now you have to waste mental energy hunting it down, which is a waste if there's no article at all.


I've lost faith in the ability of the Google approach.

Their results seem to have been turning to trash but then again, so has everyone else's.

There's a few explanations. Easiest one is me, I'm getting dumber with age or have changed my standards. Second one is they are all using similar approaches and SEO'ing has ruined search. Third is Google sets "the standard" and the other engines tweak themselves to follow the goog, regardless of results.

Reality is it's probably all 3 and a few more that I haven't thought of


That does make you dependent on Google not starting to suck. For the first time last week I had the experience of Googling a programming question and realizing the results were so bad I should probably try searching StackOverflow directly. I did and got twice as many hits.


I agree. But I do think there is some room to look in to the details. It is possible that these search features targeted at someone other than the general audience.

And for something like EBay, or Amazon, they'd be mad as hatters to send any search traffic on their website out to Google where competitors are buying ad space. They don't really have a choice, they have to try and implement their own search even if only a small number of people use it. That is likely free money as far as they view the world.


A free text search on your content via a search input is the killer app of a search engine but it is not the only app of a search engine.

A search engine (and its backing index), just like a relational database, is a technology that allows you to build all sorts of solutions on top of it. Nobody says don't use a graph db because Facebook already has the best friend of a friend solution you could hope to find.


Yeah. Absolutely. There's plenty of merits in the pursuit.

But products are constrained by requirements and most of the time the manager thought search was in there, after a little talk, search was not in there any more.

There's resources, deadlines, budgets, contracts to satisfy, blah blah blah. The insight is this costly feature can usually be shaved from the requirements with a little conversation.

A big secret of "10x" programming is realizing "90%" of the work isn't truly necessary.

But yes, search is totally a fantastic thing to do if you have the space and expertise to do it.


Are you saying a customer who came from a search engine should go back to it when in need of searching another product on your site?

It's such a bad practice to let the users leave for whatever reason.

Sites are putting every effort not to leave their site and this is anti pattern.

Also I'd think the site doesn't even enough budget to put a search function and the business or the manager has some problem.


I use DuckDuckGo's bangs[0] to search, say Wikipedia (!w, !wf), Word Reference (!wref, !wrfe), Hacker News (!hn). They are incredibly useful and using Google would be a waste of time for such queries. I'm glad these sites implemented search.

[0]: https://duckduckgo.com/bang?q=


If this is your situation there's a nice site search builder for DDG here: https://ddg.patdryburgh.com/

I use that for my static site as it means I don't need to introduce any moving parts. Probably not the best search experience but it gets the job done


If I think of myself as a user, that is spot on for information/entertainment, but far from the truth for products. The most extreme examples being everyone using "$THING Wikipedia" but nobody I know using "$THING Amazon" in Google search


You're totally right. Especially forum search engines are terrible. This is the best bang-for-buck solution.

If someone has any idea why forum search experience is so bad and know how to improve it, please chime in. I have my hunches, but let's not get ahead of myself.


You might be right in most cases (for example blogs etc) but it's just disturbing on apps like spotify if you just can't find the music you want to listen inside the app. Or when Netflix gives unavailable movies as search suggestions.


Third-party search comes at a huge price for me. Crawler bots kick the shit out of my websites, introducing tons of unnecessary noise and load.

I've chosen to block them all and only spread the word organically instead.

I have yet to begin tackling the search problem :)


Wikipedia actually doesn't want to be indexed by third party search engines. Mediawiki is a heap of underoptimized PHP garbage so pages are (expensively) rerendered every time you visit them.


> underoptimized PHP garbage

rather uncharitable description - that PHP implements dozens of interesting features and recall. It is true that it is a mess, however.


or do it like the w3c - make a search form with action https://www.w3.org/Help/search?q=xslt.

Privacy may be a thing, however.


Yeah, the key part is IF, you are indexable. In the project I'm working on everything is behind a paywall.


Many sites with paywalls still open up their content for search bots.


A problem close to my heart. Good search is certainly still too difficult to pull off for small teams, and this was one of my motivations for building and open sourcing Typesense[1].

Most people think of search and immediately think of large data sets, but the problems that plague smaller datasets are equally interesting. It's less about performance and more about relevance. For e.g. searching across multiple fields for a compound query like "taylor swift style", requires breaking the query into segments (taylor swift | style) before searching for the appropriate fields. There are also a class of problems that traditional search engines that rely on BM25 or TF-IDF for ranking cannot reliably solve (e.g. searching on small texts like titles) where you have to consider distance between matching words (which TF-IDF and BM25 miss). Lastly, there is also personalization which is almost always left as an exercise to the reader :)

[1]: https://github.com/typesense/typesense


As a long time lurker of your work, and someone who works on a fair number of search engines myself - I'm curious, how well does TypeSense handle code search (punctuation, etc.)?


Finally added support for indexing and/or separating on specific symbols in the ongoing pre-release builds. With that, I think Typesense should be able to handle code search, but I have never tried to index and search on code myself :)


Cool! I’d be keen to try it out sometime if you have docs on how I’d get it to index specific symbols


Yes, please see here: https://github.com/typesense/typesense/issues/122#issuecomme...

The latest RC build is `0.22.0.rcs18`.


Typesense has been great for us so far. Easy to set up, works great with the simple queries that we need.

Lots of great additional functionality on top of search, for example security, with scoped API keys and the likes that we're looking forward to making use of.


Thank you for your kind words. We are just getting started :)


This phrase "Thank you for your kind words" has in the last 1-2 years become the standard way people used to just say "Thanks". It sounds so cringey and mawkish.


For sure it depends on the quality of data and the target of the service. Basic knowledge in Elastic search will beat Google for confined data sets.

Major search engines have grown to the size where it is theoretical more beneficial to train a model and then query that model. In theory this will win but in the real world this will not work as we already can see with the failing Google search engine.


A limitation of TypeSense, and a lot of alternatives, is that they don't really work on Windows. TypeSense looks good for embedded situations (ie. inside an app) except it won't run on Windows.


We don't publish a native Windows binary mainly due to the effort involved. However, I've heard from a couple of users that they've been able to use the Docker build or WSL to run the Linux binary on Windows machines.


Yeah I can't require users to set up Docker or WSL before running my application.


Wait, so this means I can not test it locally on my Windows machine?


The simplest advice to engineers would be to not think elastic search as some black box that will solve all of your search problems. In fact, if you've never implemented search before it's the last tool you need. Postgres full text search is all you need. The most important thing in search is to surface relevant results and no one can quantify relevancy. It's as unquantifiable as it gets in technology. You need to understand what results are relevant for your users and find a metric that would work to rank accordingly.


Something completely off topic about the content of the article but not when it comes to the title.

Github search is a great resource for any engineer. It has saved me so many hours trying to figure some API out when I know there must be so many people who have used it before.

"SomeApiCall" extension:.[js|rs|java|cs|etc.]

Or to find projects that use a dependency you are not sure how to use

"SomeDependency" extension:.[toml|json|etc.]


i did not know about these[0] keywords so thanks for that :)

[0] https://docs.github.com/en/search-github/searching-on-github...


GitHub really needs a feature to skip searching in tests. Or maybe there is such a feature but I've never found it.


Not sure who needs to know this but scribe.rip is a relatively new alternative reader for Medium. The original article is at: https://medium.com/startup-grind/what-every-software-enginee...

HN's special treatment of medium.com links doesn't apply to scribe ones.


That explains why all the link texts are weirdly misaligned. Do the authors have any input or does it just scrape Medium for content?


A lot of websites scrape Medium without the authors having any idea, I assume it is another one of those


What would they have input about?


(2017) as well.

Previous discussion from the time, of this excellent article: https://news.ycombinator.com/item?id=15231302


Sometimes, search is broken design. The assumption in the article is that the search function is trying to return what you are looking for, however, Spotify and Netflix have their own ideas about what content they want you to consume.


Author of the article here: thanks for re-posting this! The article seems to be still relevant to many even 4 years later, which surprises me, given how quickly everything is changing in the field.

However, I personally am now focused (and bullish) on DNN-based semantic search. Having built several search experiences based on it I'm convinced it is the future.


Thanks so much for writing this article and pointing readers to the resources. I am planning to build my own personal search engine to reduce SEO spam in my searches, and the resources and concepts here were very helpful.

Regarding DNN-based semantic search, what would you say are the top notable works that one should dive into? What search experiences have you built that you feel was strongly enhanced by DNN-based methods?


thanks for the great article could you write a followup about recent changes since 2017 and explain more about DNN-based methods.


Google ruined search, by which I mean they made it so good that everyone expects search to be as good as Google.

When building an inexpensive app, the client will often ask for search. The UI designer will oblige and add a search bar to the app. Neither will give much thought to what the search will actually do except to say "make it work like Google".


I won't repeat what others have said about advances in natural language processing since 2017, but it's true that it's a solved problem if your problem isn't "perfect search" but a more realistic "excellent search".

> Use existing technologies first: As in most engineering problems, don’t reinvent the wheel yourself. When possible, use existing services or open source tools. If an existing SaaS (such as Algolia or managed Elasticsearch) fits your constraints and you can afford to pay for it, use it.

I work at an AI search company (Relevance AI) and even we see that InstantSearch.js is all some people need in terms of UI. We created a version of it that uses our NLP-backend but is still the same Algolia components on the frontend: https://www.npmjs.com/package/@relevanceai/instant-relevance

The reason was because those components work. Think these days you'd need to ask a lot of questions before completely rolling your own UI or NLP handling.


I often come back to the Relevant Search book, by Doug Turnbull and John Berryman. I'm sure some of the examples are a little dated now, but most of the advice and approach is still sound and it's a great tour through all the things you need to consider to build a great search experience.

https://www.manning.com/books/relevant-search


Doug Turnbull here - Thank you! :)


Friendly reminder to the friends at Atlassian. This could be a good starting point to fix your damn Confluence search


look, if they fix search they might have to stop using "Confluence; where information goes to die" as a tagline... so not an easy sell.

seriously though... default confluence search-as-you-type box is horrible, however sometimes (often ? always ?) there is a more traditional search endpoint available that seems to give better results. see if either of these works for you:

<your confluence domain>/dosearchsite.action

-or-

<your confluence domain>/search/searchv3.action


I want to plug Algolia a little bit. For small teams, the results are truly incredible, and it provides a ton of features out-of-the-box without a lot of development time. I've seen small teams struggle mightily with both ElasticSearch and Solr; things usually start off OK, but results go downhill as the search needs get more complicated (indexing multiple kinds of documents; adding more and more different kinds of data to the index and trying to weight things properly; dealing with tradeoffs such as relevancy vs recency; etc.). In the future I would shy away from such powerful tools unless I knew I had sufficient engineering resources to dedicate to search and my problem was complex enough to merit it.


This article is incredible.

I've been running a news search engine API [1] for about 18 months, and I found a lot of new things.

The biggest takeaway (IMO) is to use existing SaaS at first. We've been using ElasticSearch Service for our v1, and it worked great. 0s downtime with no master nodes.

And yes, ElasticsSearch itself is just a tool: you'd have to write your own search logic on top of it.

Also, start with something "strict" that gives decent results for ppl who won't read any tips or docs: 99,99% of people who use Google Search don't know about any advanced tips, and still get relevant results.

[1] https://newscatcherapi.com/news-api


Good read, misleading title. Shd more be like "what you want to know if you seriously need to implement search functionality".


Yes. If I knew everything I "need" to know according to these articles ... I would know a lot of stuff I never need.


I think, it would benefit readers to mention more than the 2 mentioned groups of the search-related software - there is a lot of new generation tools and libraries like MeiliSearch, Typesense, Sonic, Toshi, Bleve, ...


That’s a useful and well-written article. It was written 4 years ago, though, and NLP has improved significantly in that time (in English).

My experience using natural-language search queries has given me a set of expectations, as a user, that set a high bar.

I’ve written a fairly substantial backend for a search, and most of my original frontend code has been ripped out and replaced with new frontends, by now. I feel that the new frontends are an improvement, but we still have a fairly strictly-guided “guardrail” search.


Being from 2017 the article misses some of the coolest advances in semantic search, which is now pretty easy and lets you search in (almost) the same way they would ask a shop assistant when looking for something specific - "do you know where the thing with the cool circles and pointy bits is?" (maybe being a little more specific...)

Google do this and they're very good at it, but a lot of companies need their own search capabilities - think about those internal help pages. They usually seem super outdated compared to the semantic search capabilities of Google.

In the end there's only a few components to it, you use some NLP model to create what are called 'dense vectors'. Then you put all these dense vectors into an 'index' which is optimized for fast search (that comes under the umbrella of ANN search). Then given a new query you just compare that to the items in the index and return the most similar results.

I covered the search part of it (https://www.pinecone.io/learn/) with Pinecone, who provide managed-search - although here we mainly focus on Faiss (a great engine from Facebook AI). However, we're also looking to create some content covering the first half too, which is 'how to build dense vectors' using models like BERT, we have one post so far on that (https://www.pinecone.io/learn/dense-vector-embeddings-nlp/)


> Merging results of different kinds: e.g. Google showing results from Maps, News etc.

Google does have an UX bug in its search/map product.

Let's say I google for "kuka deutschland". In my case, the first result is a link to KUKA, with sublinks to different sections of their website.

Next there's a map with A-, B-, C-markers, and below them fields with A: KUKA Deutschland GmbH [other info, website, route] B: KUKA Systems GmbH [...] and so on.

If I now click on A: KUKA Deutschland GmbH or on the map, then the map opens completely, with a left sidebar containing more related places, a popup with the currently selected item open which contains plenty-full information about the marker on the map.

The issue is that this is not a Google Maps map, it is not hosted under maps.google.com, but any normal person would expect to be able to interact with this map as if it were a normal map: have a button for satellite imagery, 3d-view and all of this, the ability have the normal Google Maps sidebar at my disposal. But this does no exists.

The worst thing about this view is that there is no link to Google Maps which would open that same view in Google Maps.


The biggest thing I've learned about search is to be very precise about WHAT is being searched. In my experience most people are very loose with this, but by being really precise you can skip past a lot of bad search experiences.

Take clothing search, "blue t-shirt", what is the search space? Most engineers would naively shove the name/description into Elastic Search and then be surprised when it doesn't work. Why? Because most product descriptions are not as explicit as "This is a blue t-shirt" (see also YouTube video descriptions).

What we actually found was that by searching categories (where blue t-shirts is roughly a category), and then just filtering clothing items to that category, search worked far better. Understanding what terms people used (categories/colours) and what was actually in the data and what data models we should be searching (not products!), we built a far more effective search experience.


One strategy for this is to parse the search string for facets. Presumably there would be a facet for "t-shirt" and another facet for "blue", modulo synonyms and typos. Selecting those facets should give very good results, even without hitting the full-text search.


Yes that's a good option, but it still requires that you've got the facets in the index. I'm surprised at how many people say "just use postgress full-text search" without thinking through what information will actually be made available to search, for example.


No, I'm not saying to have the facets in the index. I'm saying to parse the search text for facets, and to search on the facets.

If the whole string matches facets, then the full text search never gets touched, as in the example above. But if the example were e.g. "blue anime t-shirt" then you could facet on "blue" and "t-shirt" while FTS only has to contend with "anime".


Since I didn't see this mentioned yet - I was writing a web search for a client's support data, and wasn't looking forward to setting up ElasticSearch. But I found the FTS5 extension built into SQLite. It was trivial to set up, my client was happy with the results after very little tuning, and it's SQLite so you've probably already got it on your computer: https://www.sqlite.org/fts5.html


I think more software engineers should do themselves the favor of dabbling in search engine design.

It really is the gift that keeps on giving in terms of interesting problems, you get everything from large scale graph processing, esoteric data structures, down to bit twiddling optimization, language processing, dealing with seriously messy and sometimes adversarial real world data. Everything is challenging, but none of it is impossible, even as a solo project.


Underrated how effective indexOf can be for search if you don’t have an incredibly large set of data


Assuming you have an NVIDIA GPU, you can build a semantic image search engine by indexing CLIP embeds (image or text).

https://github.com/rom1504/clip-retrieval


HN discussion from 2017 on the same article at a different URL:

https://news.ycombinator.com/item?id=15231302


First thing would be to see if Elasticsearch works for you, because it will for many applications.


Donald Knuth's volume on Searching and Sorting needs a successor.


Link to Whoosh is not correct.


I don't think there is any correct link for Whoosh at this point. http://whoosh.ca/ is gone too.


tldr (4,473 words); Search is a classic NLP problem replete with all of natural languages' wide-ranging input and evaluation vagaries. I admire Russian get-to-the-point bluntness in most contexts, but for whatever cultural reason, their writing is always Tolstoyesque.


I don't understand why people still waste so much money on making some perfect search engine when they can just filter on tags. Every retail product sales website in the world works by filtering on tags, not deciphering search terms in the "right" way.

Want black shoes? I could search for "black shoes", and receive shoes with the brand/product name "Black". Or I could select 'category: shoes' and 'color: black' in the drop-down box. Hey, look, now I have a list of black shoes.


Because:

a) speaking "black shoes" into your phone is a better experience than scrolling down an arbitrarily long list of categories.

b) Is shoes a category? Is it clothing? Is it boots? Oh I need an ontology now? Will users find it intuitive to have to drill down on it???

c) Wait, the item comes in with an automatic description from a thousand different vendors. How do I decide what part of it maps to my "color" field? What if the color isn't in a separate field? How do I make sure that this item isn't dead in my inventory? How do I make sure that this popular item is retrievable for me too and not just my competitors that have a better search??? In this case just searching for "black" anywhere in the fields of the product, weighted for significance, gets me valid results without communicating to the user that they filtered on a color.


What do I do if I want burgundy shoes? Do I decide if I think it's red or purple or blue? Or do I go further and think what the person listing it thought it was? Do I hope the decision went the same way each time, and all shoes have their colour listed?

I don't do any of this. I filter by shoes and look with my eyes. Retail search is a poor example because existing retail tagging is poor.


Lucky you. I search eshops with

  black shoes site:https://shittyeshop.com
on Google because of abysmal search on 99% of sites.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: