A brief history of Elasticsearch (2014)

jillesvangurp · on Dec 28, 2020

I've been using Elasticsearch since 2012. I also used Solr and Lucene before that. It's an awesome product. The current version is much evolved from what it was in the early days but it still pretty much works the same way. But it's gotten faster, more robust, more capable, etc. And of course it continues to have this great symbiotic relation with Apache Lucene; which continues to improve every release as well.

I think that's actually the key to its success. Elasticsearch is still open source but it also has a few closed source components. However, Lucene is properly 100% open source and through that project, Elastic continues to be embedded in an open community of researchers and engineers working on that. That gets them a constant stream of new ideas and feedback that they can turn into new features and added value.

This is perhaps something where some other OSS companies struggle more because due to their code ownership (copyright assignment) they effectively end up employing, dominating, and monopolizing most of their respective communities. Elastic is also at risk of this but through Lucene they keep their connection to the outside.

IMHO open communities like this are essential to long term success of OSS software. E.g. databases like mysql/mariadb and postgresql or the Linux kernel are good examples of projects that have thriving communities and continue to evolve regardless to what happens to companies working or depending on this. E.g. Mysql has had a colorful history of ownership changes, forks, competing vendors, etc. Some of those companies are long gone but the code lives on.

ignoramous · on Dec 28, 2020

Interesting point about Lucene (I'd highly recommend reading the code base, if you're a Java dev, for some inspiration) is that it's creator and still the project lead, Mike McCandless, works for A9 at Amazon.

Edit: TIL Doug Cutting originally created Apache Lucene.

itronitron · on Dec 28, 2020

While Mike McCandless is a long-time core contributor on Lucene, Doug Cutting (and others) are Lucene's creators.

sam_lowry_ · on Dec 28, 2020

In the beginning was Lucene...

dijit · on Dec 28, 2020

It’s still Lucene underneath. Just like SOLR.

The bread and butter of elasticsearch is in the aggressive sharding, shard allocation, replication and promotion. It essentially “clusterises” lucene.

I can’t wait for something similar to what elastic did for lucene to happen to redis or Postgres.

fizx · on Dec 28, 2020

I can't wait until someone "Snowflake"s Elasticsearch and separates compute from storage.

There's not a great reason for there to be "shards" and "clusters". In 2021, I'd appreciate search running from object storage (e.g. S3), with a dynamic compute layer over the top providing all of the compute--especially with indexing being such a source of variable compute.

Also, why do we have to re-index all of the time? Most textual customization can happen at query time, with only a mild performance hit.

I'm curious if we ever truly get a cloud-native, easy-to-maintain search thing, or if we have to wait for ML to eat search in order to get nextgen tech.

plasticxme · on Dec 28, 2020

The reason is because object storage is slow and not meant for high performance, which is usually important for large databases.

For your S3 example and ignoring IOPS, you are comparing 13ms of latency on local spinning disk versus 10s-100s of ms latency from S3. SSD is faster with only 1ms of latency or less on average.

Adding IOPS to the equation, you’re likely going to slam your object store if you have a high volume of traffic, where your block storage likely wouldn’t even break a sweat.

fizx · on Dec 28, 2020

It's ~200LOC to implement a Lucene block cache over S3, and because Lucene uses log-structured storage, that block cache will be quite effective.

atombender · on Dec 28, 2020

I think might work well, at least until the "hot" part of your dataset exceeds the available memory in the cache, unless you made the cache distributed and sharded.

This doesn't solve writes. I guess a writer will write to a memory buffer and only flush to S3 when a block is complete; but that wouldn't work in a multiprocess/multi-node environment if they can't share memory buffers.

jillesvangurp · on Dec 28, 2020

How do you think S3 works? It's a distributed cluster with shards, replication, and everything. Elasticsearch implements its own version of that with some optimizations that suit their usecase; like running algorithms close to the data instead of fetching the data over a network and then running some algorithm.

You'd have a hard time matching performance, throughput and cost (particularly) rebuilding that on top of s3. It would probably work well enough for smallish data sets but would get expensive pretty quickly beyond that. For the same reason, network attached storage is not a thing with Elasticsearch. Bare metal, preferably SSD is what you want. Technically it works (though nfs is not recommended) but the performance sucks.

Elastic is doing plenty with ML but fundamentally search requires digging through heaps of data, whether you use ML or not. That data needs to live somewhere and that somewhere is indeed a distributed object store, which is exactly what Elastic implemented already.

As for reindexing all the time. You index only once, until you change the schema or data. That kind of is the whole point: it's CPU intensive and you don't want to have to do that over and over again at query time. There are systems like prestodb that do that of course but they tend to be not used for real-time, interactive use cases (like search). The whole point of having an index is not having to scan heaps of data when you run your queries. Systems like presto are good at delegating the work of scanning that data to gazillions of cluster nodes making it seem more cheap than it actually is.

x3n0ph3n3 · on Dec 28, 2020

Have you looked at Presto [1]? I'm currently using it for just that reason. Our Elasticsearch instances contain only recent data, which eventually expires, but continuesto live in S3. Presto can search across both, and more.

1. https://prestodb.io/

Graphguy · on Dec 28, 2020

Look up Chaossearch. They do this.

fizx · on Dec 28, 2020

Oh yeah, I forgot about them. Thanks for the pointer! Still looking for something OSS though :/

Graphguy · on Dec 28, 2020

Fair enough. They are the closest I know outside of ultra warm from AWS.

reilly3000 · on Dec 28, 2020

ChaosSearch is pretty rad, but it lacks some aggregation and integration support vs vanilla Elastic (last I checked).

Elastic is in beta with its own warm, cold, and frozen data tier system, including "Searchable Snapshots" that work with plain object stores to provide a seamless query experience with lower costs for large volumes of data: https://www.elastic.co/blog/introducing-elasticsearch-search...

ralusek · on Dec 28, 2020

Isn't that basically what Citus does for Postgres?

grumpyprole · on Dec 28, 2020

Postgres would give us actual Boolean logic whenever we used Boolean operators:

https://lucidworks.com/post/why-not-and-or-and-not/

sun_bear · on Dec 28, 2020

Did you look at ZomboDB? It allows you to use Elasticsearch as a native Postgres index type.

https://github.com/zombodb/zombodb

cnmjbm · on Dec 30, 2020

Can someone educate me about which of the two approaches mentioned below is better, Snowflake's separated compute from storage or ElasticSearch's aggressive sharding? Thanks in advance!

jabo · on Dec 28, 2020

[flagged]

naiv · on Dec 28, 2020

There is something very creative about thread hijacking as well.

adenta · on Dec 28, 2020

> Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.[1]

https://news.ycombinator.com/newsguidelines.html

graysonk · on Dec 28, 2020

Dang can ban me if he wants but look at this dudes comment history. 100% nothing but shilling his product

photonios · on Dec 28, 2020

It would be cool to build something like hn.algolia.com with TypeSense instead.

graysonk · on Dec 28, 2020

This is one of those comments you write after taking a “How to market your product online 100% free!!” classes