How Algolia Built Their Realtime Search as a Service

jsty · on May 3, 2018

Algolia had a great 8-part series of 'under the hood' blog posts:

Part 1: https://blog.algolia.com/inside-the-algolia-engine-part-1-in...

No affiliation, just thought they were really interesting.

amelius · on May 3, 2018

I just skimmed through those posts. Indeed interesting, but I couldn't find what type of data-structure they use for their main search (as opposed to instant search suggestions).

ddorian43 · on May 4, 2018

An inverted index ? From part 2:

    For each document, we extract the list of words and build a hash-table that associates words to documents
    When all documents are processed, we compute an on-disk binary data-structure containing the mapping of words to documents. This data-structure is the index we will use to process queries.

thomasfromcdnjs · on May 3, 2018

If you have never used Algolia, I'd recommend doing so. For any reason.

They are one of those great companies that you want to emulate, their entire setup is polished and perfect.

flatlander · on May 3, 2018

>They are one of those great companies that you want to emulate

I created an Algolia account once with my personal github account, just for testing purposes. Starting the next day and continuing for a few weeks, I received a stream of emails from their sales people requesting times to schedule phone calls, and telling me how much they can help my company. I had no idea how they knew which company I worked for until I checked my linkedin and found Algoia people were visiting my profile. They have some seriously stalkerish sales practices.

redox_ · on May 3, 2018

(I'm leading engineering at Algolia) Very sorry to read that the trial period you've experienced was that stalkerish. In order to provide the best customer & support experience to our users, we do ask our teams to do some research before reaching out to our users. I assume this went a little bit far in your case, most probably because you've listed the company you were working for on your GH profile? Please feel free to reach out to me personally if there is anything I can do or anything you cannot share on HN, my email is sylvain@alg..

sky_rw · on May 3, 2018

The irony of a person commenting on a companies aggressive sales tactics and having the company immediately respond with sales tactics. Love it. I bet you definitely alleviated and sense of being stalked.

TheSpiceIsLife · on May 3, 2018

Ultimately, though, if a sale is made in this space we’d probably be right in assuming the deal is mutually beneficial.

manigandham · on May 3, 2018

Step 1 in sales is knowing who you're talking to. If your LinkedIn profile is public, what's the problem? They're a business selling a service so it's reasonable they reach out.

edraferi · on May 3, 2018

Why on earth are you offended that a company responded positively when you created a relationship with them? If you think visiting your LinkedIn profile is stalking, you should delete it.

Totally agree that sales teams can overreach though. Perhaps there should be a little “why are you signing up? Personal/work/etc” quiz when you sign up, then emails reduced accordingly.

ASalazarMX · on May 3, 2018

I don't get why people expect privacy on information they made publicly available. It's hardly stalking if your information is just a few clicks away.

vonseel · on May 3, 2018

Don't they provide the HN search? I have never found it particularly effective.

vvoyer · on May 3, 2018

Hi vonseel, sorry to hear that. I work at Algolia and can directly impact how this search works. Do you have any particular, specific issue or feedback for us to make it better? Thanks!

petepete · on May 3, 2018

Really? It's accurate, fast and easy to navigate. Compared to the search options embedded on other sites it's a joy to use.

donmatito · on May 3, 2018

The HN search implentation is a bit sad, because it lacks the most magic part of Algolia : the real time search.

Using Algolia in a full-page-load HTML request is using a Ferrari do drive to the grocery store. You can, but what a waste

louis-paul · on May 3, 2018

Have you tried https://hn.algolia.com?

donmatito · on May 4, 2018

Yes I had, but I forgot about it, thanks. Much better than HN own implentation.

Algolia also implemented a linkedin contact search infinitely superior to Linkedin own search.

I remember thinking that both demos were really brilliant growth strategy because it showed clearly, by contrast, how much the status quo was painful

tangue · on May 3, 2018

I miss the old search... and was never at ease with Algolia. It's relevant but I don't like the UX.

discussedbefore · on May 3, 2018

I use HN search all the time; it's useful.

There is no way to search across multiple comments on the same "story"; I use Google site search for this.

It only counts replies/"comments" for stories, not comments (which would probably be a lot more work).

It differentiates between plural and singular search terms as very different searches.

The only ranking options are by popularity and date (would like to be able to sort results by number of replies too).

MrBuddyCasino · on May 3, 2018

"The problem with Zookeeper is that a change in the topology can take a lot of time to be detected. For example, if a host is down, it can take up to several seconds to be detected. This is way too long for us, as in one second will have potentially thousands of indexing jobs to process on the cluster and we need to have a master that attributes an ID to be able to handle them. So we have built our own election algorithm based on RAFT."

Thats a bold move. Afaik the timeouts in Zookeeper can be tuned, no?

karterk · on May 3, 2018

Shameless plug: if you are looking for a simple, fast, fuzzy search engine that you want to host yourself, take a look at Typesense: https://github.com/typesense/typesense

Antrikshy · on May 3, 2018

Only tangentially related, but...

If you're looking for a front-end library for autocompletion, Twitter's typeahead.js is a nice one: https://github.com/twitter/typeahead.js

If you want one that works seamlessly with your React setup, React Autosuggest is pretty neat: https://github.com/moroshko/react-autosuggest

vvoyer · on May 3, 2018

Careful though because typeahead.js is not maintained at all even if not said clearly on the GitHub repo

rawoke083600 · on May 3, 2018

only more less related... if you need to access any of the search servers on the internet... take a look at www.fibretiger.co.za for your fibre pricing ! :P

jazoom · on May 3, 2018

That looks good. According to your FAQ the main disadvantage this has compared to Elasticsearch is not being able to scale horizontally.

Have you considered using FoundationDB as a storage layer to match that feature?

Disclaimer: I have no idea how to build a distributed search engine.

karterk · on May 3, 2018

I might not be 100% correct on this, but it seems like even Algolia keeps all records in a single host and uses the other 2 hosts for only high availability.

Having said that, exploring a FoundationDB integration is definitely an interesting idea. However, quite a lot of use cases can be served perfectly well with a simple master+slave set-up, so my primary focus is on that until there is enough demand for horizontal scalability. For e.g. Typesense is not a great fit for things like log data that typically need large amounts of storage.

nl · on May 3, 2018

I noticed that Google (who know a bit about search I think) link to Algolia as the recommended way to do search if you are using Cloud Firestore.

That seems a good recommendation.

exclusiv · on May 3, 2018

Yeah it's solid and easy to setup by using cloud functions on firestore events.

Querying on Firestore is really limited though and quite surprising really. They even have geopoint data types but you can't query on them.

I'm sure they'll add a lot more soon but Algolia has been great for filling in those gaps and adding robust search.

tootie · on May 3, 2018

Straying OT, but I find Google Cloud's array of database options to be hopelessly confusing. Particularly with the Fire* offerings that are all really cool, but seemingly limited to mobile use cases.

donmatito · on May 3, 2018

Every time I need to use Algolia in a project, there is this sense of "wow". It's such a magical feeling. Every single time.

curiousgal · on May 3, 2018

Well honestly, when searching through HN posts, their fuzzy search deature can be annoying.

DanBC · on May 3, 2018

It can be turned off.

amelius · on May 3, 2018

Offtopic: isn't it about time that Linux distros get a serious "built-in" search engine?

I mean, there's the "locate" command, but many people disable it because it's a performance hog. Shouldn't "search" be an integral part of OS and/or filesystem architecture?

e12e · on May 6, 2018

I actually use mlocate a bit, one benefit is that it's a simple system, easy to understand. And it's seen some use, so it tries to keep permissions consistent between the index/search and the filesystem.

There's also been a few attempts at integrating search with the various desktop projects, generally backed by xapian or some other full-term search library. I'm not sure what are currently the best/best maintained options.

I seem to recall gnome "tracker" had the most traction last I checked, not sure if eg the kde project has something similar. Looks like canonical booted tracker from the default install in 18.04 lts:

https://community.ubuntu.com/t/install-tracker-by-default-in...

Microsoft had its aborted attempt at a new fs built on top of sql server, for "proper" search at the fs level. I'm not aware of any real file systems that do full search out of the box. And I guess it's not clear that it'd be any better than initial indexing+index refresh on change/on a schedule.

emilsedgh · on May 6, 2018

KDE has Baloo [0]

[0] https://community.kde.org/Baloo

serguzest · on May 3, 2018

I've always thought they were using elasticsearch!

dvirsky · on May 3, 2018

The speed was a telltale that it's not the case ;)

amelius · on May 3, 2018

Does anybody know where statistics/benchmarks are posted for elasticsearch? This would be useful for performance comparisons with other products, and to see if anything is wrong with the configuration in case of slow queries.

ddorian43 · on May 4, 2018

https://benchmarks.elastic.co/index.html

https://github.com/elastic/rally

RyanShook · on May 3, 2018

Can the concept of distributed consensus be seen as an alternative to block chain? https://raft.github.io

zapita · on May 3, 2018

No, blockchains are built on top of distributed consensus. You can have distributed consensus without a blockchain, but not the other way around.