I've been using Elasticsearch since 2012. I also used Solr and Lucene before that. It's an awesome product. The current version is much evolved from what it was in the early days but it still pretty much works the same way. But it's gotten faster, more robust, more capable, etc. And of course it continues to have this great symbiotic relation with Apache Lucene; which continues to improve every release as well.
I think that's actually the key to its success. Elasticsearch is still open source but it also has a few closed source components. However, Lucene is properly 100% open source and through that project, Elastic continues to be embedded in an open community of researchers and engineers working on that. That gets them a constant stream of new ideas and feedback that they can turn into new features and added value.
This is perhaps something where some other OSS companies struggle more because due to their code ownership (copyright assignment) they effectively end up employing, dominating, and monopolizing most of their respective communities. Elastic is also at risk of this but through Lucene they keep their connection to the outside.
IMHO open communities like this are essential to long term success of OSS software. E.g. databases like mysql/mariadb and postgresql or the Linux kernel are good examples of projects that have thriving communities and continue to evolve regardless to what happens to companies working or depending on this. E.g. Mysql has had a colorful history of ownership changes, forks, competing vendors, etc. Some of those companies are long gone but the code lives on.
Interesting point about Lucene (I'd highly recommend reading the code base, if you're a Java dev, for some inspiration) is that it's creator and still the project lead, Mike McCandless, works for A9 at Amazon.
Edit: TIL Doug Cutting originally created Apache Lucene.
I can't wait until someone "Snowflake"s Elasticsearch and separates compute from storage.
There's not a great reason for there to be "shards" and "clusters". In 2021, I'd appreciate search running from object storage (e.g. S3), with a dynamic compute layer over the top providing all of the compute--especially with indexing being such a source of variable compute.
Also, why do we have to re-index all of the time? Most textual customization can happen at query time, with only a mild performance hit.
I'm curious if we ever truly get a cloud-native, easy-to-maintain search thing, or if we have to wait for ML to eat search in order to get nextgen tech.
The reason is because object storage is slow and not meant for high performance, which is usually important for large databases.
For your S3 example and ignoring IOPS, you are comparing 13ms of latency on local spinning disk versus 10s-100s of ms latency from S3. SSD is faster with only 1ms of latency or less on average.
Adding IOPS to the equation, you’re likely going to slam your object store if you have a high volume of traffic, where your block storage likely wouldn’t even break a sweat.
I think might work well, at least until the "hot" part of your dataset exceeds the available memory in the cache, unless you made the cache distributed and sharded.
This doesn't solve writes. I guess a writer will write to a memory buffer and only flush to S3 when a block is complete; but that wouldn't work in a multiprocess/multi-node environment if they can't share memory buffers.
How do you think S3 works? It's a distributed cluster with shards, replication, and everything. Elasticsearch implements its own version of that with some optimizations that suit their usecase; like running algorithms close to the data instead of fetching the data over a network and then running some algorithm.
You'd have a hard time matching performance, throughput and cost (particularly) rebuilding that on top of s3. It would probably work well enough for smallish data sets but would get expensive pretty quickly beyond that. For the same reason, network attached storage is not a thing with Elasticsearch. Bare metal, preferably SSD is what you want. Technically it works (though nfs is not recommended) but the performance sucks.
Elastic is doing plenty with ML but fundamentally search requires digging through heaps of data, whether you use ML or not. That data needs to live somewhere and that somewhere is indeed a distributed object store, which is exactly what Elastic implemented already.
As for reindexing all the time. You index only once, until you change the schema or data. That kind of is the whole point: it's CPU intensive and you don't want to have to do that over and over again at query time. There are systems like prestodb that do that of course but they tend to be not used for real-time, interactive use cases (like search). The whole point of having an index is not having to scan heaps of data when you run your queries. Systems like presto are good at delegating the work of scanning that data to gazillions of cluster nodes making it seem more cheap than it actually is.
Have you looked at Presto [1]? I'm currently using it for just that reason. Our Elasticsearch instances contain only recent data, which eventually expires, but continuesto live in S3. Presto can search across both, and more.
ChaosSearch is pretty rad, but it lacks some aggregation and integration support vs vanilla Elastic (last I checked).
Elastic is in beta with its own warm, cold, and frozen data tier system, including "Searchable Snapshots" that work with plain object stores to provide a seamless query experience with lower costs for large volumes of data:
https://www.elastic.co/blog/introducing-elasticsearch-search...
Can someone educate me about which of the two approaches mentioned below is better, Snowflake's separated compute from storage or ElasticSearch's aggressive sharding? Thanks in advance!
> Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.[1]
I think that's actually the key to its success. Elasticsearch is still open source but it also has a few closed source components. However, Lucene is properly 100% open source and through that project, Elastic continues to be embedded in an open community of researchers and engineers working on that. That gets them a constant stream of new ideas and feedback that they can turn into new features and added value.
This is perhaps something where some other OSS companies struggle more because due to their code ownership (copyright assignment) they effectively end up employing, dominating, and monopolizing most of their respective communities. Elastic is also at risk of this but through Lucene they keep their connection to the outside.
IMHO open communities like this are essential to long term success of OSS software. E.g. databases like mysql/mariadb and postgresql or the Linux kernel are good examples of projects that have thriving communities and continue to evolve regardless to what happens to companies working or depending on this. E.g. Mysql has had a colorful history of ownership changes, forks, competing vendors, etc. Some of those companies are long gone but the code lives on.