In the beginning was Lucene...

dijit · on Dec 28, 2020

It’s still Lucene underneath. Just like SOLR.

The bread and butter of elasticsearch is in the aggressive sharding, shard allocation, replication and promotion. It essentially “clusterises” lucene.

I can’t wait for something similar to what elastic did for lucene to happen to redis or Postgres.

fizx · on Dec 28, 2020

I can't wait until someone "Snowflake"s Elasticsearch and separates compute from storage.

There's not a great reason for there to be "shards" and "clusters". In 2021, I'd appreciate search running from object storage (e.g. S3), with a dynamic compute layer over the top providing all of the compute--especially with indexing being such a source of variable compute.

Also, why do we have to re-index all of the time? Most textual customization can happen at query time, with only a mild performance hit.

I'm curious if we ever truly get a cloud-native, easy-to-maintain search thing, or if we have to wait for ML to eat search in order to get nextgen tech.

plasticxme · on Dec 28, 2020

The reason is because object storage is slow and not meant for high performance, which is usually important for large databases.

For your S3 example and ignoring IOPS, you are comparing 13ms of latency on local spinning disk versus 10s-100s of ms latency from S3. SSD is faster with only 1ms of latency or less on average.

Adding IOPS to the equation, you’re likely going to slam your object store if you have a high volume of traffic, where your block storage likely wouldn’t even break a sweat.

fizx · on Dec 28, 2020

It's ~200LOC to implement a Lucene block cache over S3, and because Lucene uses log-structured storage, that block cache will be quite effective.

atombender · on Dec 28, 2020

I think might work well, at least until the "hot" part of your dataset exceeds the available memory in the cache, unless you made the cache distributed and sharded.

This doesn't solve writes. I guess a writer will write to a memory buffer and only flush to S3 when a block is complete; but that wouldn't work in a multiprocess/multi-node environment if they can't share memory buffers.

jillesvangurp · on Dec 28, 2020

How do you think S3 works? It's a distributed cluster with shards, replication, and everything. Elasticsearch implements its own version of that with some optimizations that suit their usecase; like running algorithms close to the data instead of fetching the data over a network and then running some algorithm.

You'd have a hard time matching performance, throughput and cost (particularly) rebuilding that on top of s3. It would probably work well enough for smallish data sets but would get expensive pretty quickly beyond that. For the same reason, network attached storage is not a thing with Elasticsearch. Bare metal, preferably SSD is what you want. Technically it works (though nfs is not recommended) but the performance sucks.

Elastic is doing plenty with ML but fundamentally search requires digging through heaps of data, whether you use ML or not. That data needs to live somewhere and that somewhere is indeed a distributed object store, which is exactly what Elastic implemented already.

As for reindexing all the time. You index only once, until you change the schema or data. That kind of is the whole point: it's CPU intensive and you don't want to have to do that over and over again at query time. There are systems like prestodb that do that of course but they tend to be not used for real-time, interactive use cases (like search). The whole point of having an index is not having to scan heaps of data when you run your queries. Systems like presto are good at delegating the work of scanning that data to gazillions of cluster nodes making it seem more cheap than it actually is.

x3n0ph3n3 · on Dec 28, 2020

Have you looked at Presto [1]? I'm currently using it for just that reason. Our Elasticsearch instances contain only recent data, which eventually expires, but continuesto live in S3. Presto can search across both, and more.

1. https://prestodb.io/

Graphguy · on Dec 28, 2020

Look up Chaossearch. They do this.

fizx · on Dec 28, 2020

Oh yeah, I forgot about them. Thanks for the pointer! Still looking for something OSS though :/

Graphguy · on Dec 28, 2020

Fair enough. They are the closest I know outside of ultra warm from AWS.

reilly3000 · on Dec 28, 2020

ChaosSearch is pretty rad, but it lacks some aggregation and integration support vs vanilla Elastic (last I checked).

Elastic is in beta with its own warm, cold, and frozen data tier system, including "Searchable Snapshots" that work with plain object stores to provide a seamless query experience with lower costs for large volumes of data: https://www.elastic.co/blog/introducing-elasticsearch-search...

ralusek · on Dec 28, 2020

Isn't that basically what Citus does for Postgres?

grumpyprole · on Dec 28, 2020

Postgres would give us actual Boolean logic whenever we used Boolean operators:

https://lucidworks.com/post/why-not-and-or-and-not/

sun_bear · on Dec 28, 2020

Did you look at ZomboDB? It allows you to use Elasticsearch as a native Postgres index type.

https://github.com/zombodb/zombodb