Hacker News new | past | comments | ask | show | jobs | submit login
How Algolia Built Their Realtime Search as a Service (stackshare.io)
160 points by 0x4542 on May 2, 2018 | hide | past | favorite | 41 comments



Algolia had a great 8-part series of 'under the hood' blog posts:

Part 1: https://blog.algolia.com/inside-the-algolia-engine-part-1-in...

No affiliation, just thought they were really interesting.


I just skimmed through those posts. Indeed interesting, but I couldn't find what type of data-structure they use for their main search (as opposed to instant search suggestions).


An inverted index ? From part 2:

    For each document, we extract the list of words and build a hash-table that associates words to documents
    When all documents are processed, we compute an on-disk binary data-structure containing the mapping of words to documents. This data-structure is the index we will use to process queries.


If you have never used Algolia, I'd recommend doing so. For any reason.

They are one of those great companies that you want to emulate, their entire setup is polished and perfect.


>They are one of those great companies that you want to emulate

I created an Algolia account once with my personal github account, just for testing purposes. Starting the next day and continuing for a few weeks, I received a stream of emails from their sales people requesting times to schedule phone calls, and telling me how much they can help my company. I had no idea how they knew which company I worked for until I checked my linkedin and found Algoia people were visiting my profile. They have some seriously stalkerish sales practices.


(I'm leading engineering at Algolia) Very sorry to read that the trial period you've experienced was that stalkerish. In order to provide the best customer & support experience to our users, we do ask our teams to do some research before reaching out to our users. I assume this went a little bit far in your case, most probably because you've listed the company you were working for on your GH profile? Please feel free to reach out to me personally if there is anything I can do or anything you cannot share on HN, my email is sylvain@alg..


The irony of a person commenting on a companies aggressive sales tactics and having the company immediately respond with sales tactics. Love it. I bet you definitely alleviated and sense of being stalked.


Ultimately, though, if a sale is made in this space we’d probably be right in assuming the deal is mutually beneficial.


Step 1 in sales is knowing who you're talking to. If your LinkedIn profile is public, what's the problem? They're a business selling a service so it's reasonable they reach out.


Why on earth are you offended that a company responded positively when you created a relationship with them? If you think visiting your LinkedIn profile is stalking, you should delete it.

Totally agree that sales teams can overreach though. Perhaps there should be a little “why are you signing up? Personal/work/etc” quiz when you sign up, then emails reduced accordingly.


I don't get why people expect privacy on information they made publicly available. It's hardly stalking if your information is just a few clicks away.


Don't they provide the HN search? I have never found it particularly effective.


Hi vonseel, sorry to hear that. I work at Algolia and can directly impact how this search works. Do you have any particular, specific issue or feedback for us to make it better? Thanks!


Really? It's accurate, fast and easy to navigate. Compared to the search options embedded on other sites it's a joy to use.


The HN search implentation is a bit sad, because it lacks the most magic part of Algolia : the real time search.

Using Algolia in a full-page-load HTML request is using a Ferrari do drive to the grocery store. You can, but what a waste


Have you tried https://hn.algolia.com?


Yes I had, but I forgot about it, thanks. Much better than HN own implentation.

Algolia also implemented a linkedin contact search infinitely superior to Linkedin own search.

I remember thinking that both demos were really brilliant growth strategy because it showed clearly, by contrast, how much the status quo was painful


I miss the old search... and was never at ease with Algolia. It's relevant but I don't like the UX.


I use HN search all the time; it's useful.

There is no way to search across multiple comments on the same "story"; I use Google site search for this.

It only counts replies/"comments" for stories, not comments (which would probably be a lot more work).

It differentiates between plural and singular search terms as very different searches.

The only ranking options are by popularity and date (would like to be able to sort results by number of replies too).


"The problem with Zookeeper is that a change in the topology can take a lot of time to be detected. For example, if a host is down, it can take up to several seconds to be detected. This is way too long for us, as in one second will have potentially thousands of indexing jobs to process on the cluster and we need to have a master that attributes an ID to be able to handle them. So we have built our own election algorithm based on RAFT."

Thats a bold move. Afaik the timeouts in Zookeeper can be tuned, no?


Shameless plug: if you are looking for a simple, fast, fuzzy search engine that you want to host yourself, take a look at Typesense: https://github.com/typesense/typesense


Only tangentially related, but...

If you're looking for a front-end library for autocompletion, Twitter's typeahead.js is a nice one: https://github.com/twitter/typeahead.js

If you want one that works seamlessly with your React setup, React Autosuggest is pretty neat: https://github.com/moroshko/react-autosuggest


Careful though because typeahead.js is not maintained at all even if not said clearly on the GitHub repo


only more less related... if you need to access any of the search servers on the internet... take a look at www.fibretiger.co.za for your fibre pricing ! :P


That looks good. According to your FAQ the main disadvantage this has compared to Elasticsearch is not being able to scale horizontally.

Have you considered using FoundationDB as a storage layer to match that feature?

Disclaimer: I have no idea how to build a distributed search engine.


I might not be 100% correct on this, but it seems like even Algolia keeps all records in a single host and uses the other 2 hosts for only high availability.

Having said that, exploring a FoundationDB integration is definitely an interesting idea. However, quite a lot of use cases can be served perfectly well with a simple master+slave set-up, so my primary focus is on that until there is enough demand for horizontal scalability. For e.g. Typesense is not a great fit for things like log data that typically need large amounts of storage.


I noticed that Google (who know a bit about search I think) link to Algolia as the recommended way to do search if you are using Cloud Firestore.

That seems a good recommendation.


Yeah it's solid and easy to setup by using cloud functions on firestore events.

Querying on Firestore is really limited though and quite surprising really. They even have geopoint data types but you can't query on them.

I'm sure they'll add a lot more soon but Algolia has been great for filling in those gaps and adding robust search.


Straying OT, but I find Google Cloud's array of database options to be hopelessly confusing. Particularly with the Fire* offerings that are all really cool, but seemingly limited to mobile use cases.


Every time I need to use Algolia in a project, there is this sense of "wow". It's such a magical feeling. Every single time.


Well honestly, when searching through HN posts, their fuzzy search deature can be annoying.


It can be turned off.


Offtopic: isn't it about time that Linux distros get a serious "built-in" search engine?

I mean, there's the "locate" command, but many people disable it because it's a performance hog. Shouldn't "search" be an integral part of OS and/or filesystem architecture?


I actually use mlocate a bit, one benefit is that it's a simple system, easy to understand. And it's seen some use, so it tries to keep permissions consistent between the index/search and the filesystem.

There's also been a few attempts at integrating search with the various desktop projects, generally backed by xapian or some other full-term search library. I'm not sure what are currently the best/best maintained options.

I seem to recall gnome "tracker" had the most traction last I checked, not sure if eg the kde project has something similar. Looks like canonical booted tracker from the default install in 18.04 lts:

https://community.ubuntu.com/t/install-tracker-by-default-in...

Microsoft had its aborted attempt at a new fs built on top of sql server, for "proper" search at the fs level. I'm not aware of any real file systems that do full search out of the box. And I guess it's not clear that it'd be any better than initial indexing+index refresh on change/on a schedule.


KDE has Baloo [0]

[0] https://community.kde.org/Baloo


I've always thought they were using elasticsearch!


The speed was a telltale that it's not the case ;)


Does anybody know where statistics/benchmarks are posted for elasticsearch? This would be useful for performance comparisons with other products, and to see if anything is wrong with the configuration in case of slow queries.



Can the concept of distributed consensus be seen as an alternative to block chain? https://raft.github.io


No, blockchains are built on top of distributed consensus. You can have distributed consensus without a blockchain, but not the other way around.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: