I just skimmed through those posts. Indeed interesting, but I couldn't find what type of data-structure they use for their main search (as opposed to instant search suggestions).
For each document, we extract the list of words and build a hash-table that associates words to documents
When all documents are processed, we compute an on-disk binary data-structure containing the mapping of words to documents. This data-structure is the index we will use to process queries.
>They are one of those great companies that you want to emulate
I created an Algolia account once with my personal github account, just for testing purposes. Starting the next day and continuing for a few weeks, I received a stream of emails from their sales people requesting times to schedule phone calls, and telling me how much they can help my company. I had no idea how they knew which company I worked for until I checked my linkedin and found Algoia people were visiting my profile. They have some seriously stalkerish sales practices.
(I'm leading engineering at Algolia) Very sorry to read that the trial period you've experienced was that stalkerish. In order to provide the best customer & support experience to our users, we do ask our teams to do some research before reaching out to our users. I assume this went a little bit far in your case, most probably because you've listed the company you were working for on your GH profile?
Please feel free to reach out to me personally if there is anything I can do or anything you cannot share on HN, my email is sylvain@alg..
The irony of a person commenting on a companies aggressive sales tactics and having the company immediately respond with sales tactics. Love it. I bet you definitely alleviated and sense of being stalked.
Step 1 in sales is knowing who you're talking to. If your LinkedIn profile is public, what's the problem? They're a business selling a service so it's reasonable they reach out.
Why on earth are you offended that a company responded positively when you created a relationship with them? If you think visiting your LinkedIn profile is stalking, you should delete it.
Totally agree that sales teams can overreach though. Perhaps there should be a little “why are you signing up? Personal/work/etc” quiz when you sign up, then emails reduced accordingly.
Hi vonseel, sorry to hear that. I work at Algolia and can directly impact how this search works. Do you have any particular, specific issue or feedback for us to make it better? Thanks!
"The problem with Zookeeper is that a change in the topology can take a lot of time to be detected. For example, if a host is down, it can take up to several seconds to be detected. This is way too long for us, as in one second will have potentially thousands of indexing jobs to process on the cluster and we need to have a master that attributes an ID to be able to handle them. So we have built our own election algorithm based on RAFT."
Thats a bold move. Afaik the timeouts in Zookeeper can be tuned, no?
Shameless plug: if you are looking for a simple, fast, fuzzy search engine that you want to host yourself, take a look at Typesense: https://github.com/typesense/typesense
only more less related... if you need to access any of the search servers on the internet... take a look at www.fibretiger.co.za for your fibre pricing ! :P
I might not be 100% correct on this, but it seems like even Algolia keeps all records in a single host and uses the other 2 hosts for only high availability.
Having said that, exploring a FoundationDB integration is definitely an interesting idea. However, quite a lot of use cases can be served perfectly well with a simple master+slave set-up, so my primary focus is on that until there is enough demand for horizontal scalability. For e.g. Typesense is not a great fit for things like log data that typically need large amounts of storage.
Straying OT, but I find Google Cloud's array of database options to be hopelessly confusing. Particularly with the Fire* offerings that are all really cool, but seemingly limited to mobile use cases.
Offtopic: isn't it about time that Linux distros get a serious "built-in" search engine?
I mean, there's the "locate" command, but many people disable it because it's a performance hog. Shouldn't "search" be an integral part of OS and/or filesystem architecture?
I actually use mlocate a bit, one benefit is that it's a simple system, easy to understand. And it's seen some use, so it tries to keep permissions consistent between the index/search and the filesystem.
There's also been a few attempts at integrating search with the various desktop projects, generally backed by xapian or some other full-term search library. I'm not sure what are currently the best/best maintained options.
I seem to recall gnome "tracker" had the most traction last I checked, not sure if eg the kde project has something similar. Looks like canonical booted tracker from the default install in 18.04 lts:
Microsoft had its aborted attempt at a new fs built on top of sql server, for "proper" search at the fs level. I'm not aware of any real file systems that do full search out of the box. And I guess it's not clear that it'd be any better than initial indexing+index refresh on change/on a schedule.
Does anybody know where statistics/benchmarks are posted for elasticsearch? This would be useful for performance comparisons with other products, and to see if anything is wrong with the configuration in case of slow queries.
Part 1: https://blog.algolia.com/inside-the-algolia-engine-part-1-in...
No affiliation, just thought they were really interesting.