How we get high availability with Elasticsearch and Ruby on Rails

shockzzz · on April 8, 2016

I have no idea why the phrase "high availability" is in this post.

mburns · on April 8, 2016

>We call these extensions "high availability" because this approach means that re-indexing a production system can happen much faster, reducing downtime for our users.

Agree with their use of the term or not, they give you their reasoning at the end of the article.

shockzzz · on April 8, 2016

That's crazy misleading. This is just a post saying, "hey, this is a way to sync data faster." Awesome! Much kudos.

But stale data isn't "downtime." This is tech marketing at like, MongoDB level.

ma2rten · on April 8, 2016

Except it's not marketing from the vendor in question. This is a page by the US government.

jimbokun · on April 9, 2016

I think it's just an honest misunderstanding of what the term "high availability" means, in general tech usage.

serguzest · on April 8, 2016

"27 reports per second" what?

I use bulk api with .net Nest client.

I can easily put 1000 documents per second (which also include 3-10 nested documents) with 4core i5 machine. Serialization is the cheapest operation in my case. I would blame ruby in your workflow.

xentronium · on April 8, 2016

Seriously depends on the structure of the documents and your analyzer setup. I agree that it should be in hundreds recs per sec though.

acehyzer · on April 8, 2016

Elasticsearch is awesome. It may be a good idea to use the bulk API that is built into elasticsearch, use some joins in your SQL query, and index more than just one record at a time. In the implementation I have, I batched my query to 50,000 records at a time that then index into elasticsearch. For the 2.7 million records I indexed this week, it took a total of 54 queries to the database (50,000 records returned at a time). Just one more idea to streamline your indexing without slamming your DB quite so hard.

andrewvc · on April 8, 2016

FWIW the perf gains of doing large bulks are usually not worth the risk of blowing up your memory usage. Bulks in the hundreds are usually just as efficient as bulks in the thousands (YMMV), and don't cary the additional memory risk.

acehyzer · on April 8, 2016

Yeah I don't disagree with you there. I tested the memory consumption on one chunk before I went whole hog and did the whole data set. You certainly don't need to do 50k at a time, but it might be worth pulling at least 1k at a time and indexing (depending on how large your objects are of course). Mostly I was implying that in most cases it shouldn't take 4 queries to the DB per record, you can likely setup things to pull a decently large number of records at a time, and then pass them with the bulk API, the number of network requests that are needed both to go round trip to the DB and to go to Elastic would be significantly reduced.

brndn · on April 8, 2016

I started using elasticsearch recently and I was wondering, does the indexing happen in real time during the index request? How do you know how long the indexing process takes?

acehyzer · on April 8, 2016

It happens pretty much real time. While using cURL or the .NET client it doesn't return a response until the indexing is finished. You can verify this by querying your index for some basic info. For example you could even just put the following in the browser:

http://hostname:9200/_cat/indices

The 6th column gives you the number of records indexed. Send your request and refresh the browser.

chrisatumd · on April 8, 2016

There's an index.refresh_interval setting. It defaults to 1s, so by default your data will be available for querying within one second after being indexed.

nemo44x · on April 12, 2016

In general, yes. But keep in mind that the Elasticsearch JVM GC could fire up right after the document is indexed and possibly run for a few seconds if there is a lot of memory pressure. When the GC is done Elasticsearch will continue to process queries but it may be the indexed data hasn't been refreshed yet. So, a query run "1 second" after the index operation may not in fact return the document. However, this would be a very rare case.

athenot · on April 8, 2016

I get that this reduces the load on the SQL DB but does it make a measurable difference on the ES side of things?

Edit: reason I ask is we index about 35m records per hour.

acehyzer · on April 8, 2016

The bulk index API is very fast, I initially tried converting all the records into 7 separate JSON files with ~400k records each and passed them to elastic using cURL and the bulk api. Once the file finished uploading to the server where Elastic was hosted, the indexing of 400k records only took about 20 seconds on a pretty trivial dev VM. It's about the same using the .NET client, the majority of time is spent round-trip to the DB and organizing the data the way you want it to pass it to the bulk API. The indexing is super fast once you've got that all together. I don't know if it actually is optimized on the ES side of things for sure, but ES definitely isn't a bottleneck from what I've observed.

Edit: According to the official ES documentation: "The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed."

They're fairly vague in their documentation about what's going on under the hood. I'm sure you could go dig through the source and find out more about whether it would really speed things up or not for a specific use case.

polyfractal · on April 9, 2016

Hey, ES dev here. :)

The bulk API mainly helps by amortizing various overheads across many documents at once. There isn't anything "special" about the bulk indexing...it doesn't side-step the normal indexing pathway to directly dump the data directly onto disk, for example.

What happens is that the bulk request arrives at a node (which we'll call the coordinating node now, since it is coordinating this request). The coordinator will then create "mini-bulk" requests which are fired off to individual nodes who need to perform the actual indexing operations. If your bulk touches five different shards around the cluster, the coordinating node will construct five mini-bulks.

Bulk requests have an (annoying) newline-delimited format. This allows the coordinator to extract until the first newline, parse the JSON and identify the action (index, delete, update) and destination (index/type/id). Since the actual document is irrelevant to the coordinator, it can slice the buffer until the next newline straight into the "mini-bulk" for the destination node. This allows the coordinator to avoid parsing the document's JSON, which could be quite large (think 15mb PDFs, or docs with 5k fields).

Once the coordinator is done assembling the mini-bulks, it sends them to the various nodes around the cluster and waits for responses to come back. Finally, the coordinator merges the responses and sends a response to the client.

So a bulk reduces overhead by amortizing the two network hops (one to coordinator, one to node) over thousands of documents. There are also side benefits like allowing concurrent bulks to share translog fsyncs, giving lucene larger indexing buffers to work with (helps create larger segments for less churn), less context switching, etc.

Generally, smaller documents gain more benefit from bulk, since more of the time is dominated by overhead. Super large documents, or documents with very complex analysis chains are less likely to see a huge improvement, since the latency there is dominated by IO or CPU. But you'll still see an improvement due to fsyncs, context switching, giving lucene batches, etc.

Hope that helps! Lemme know if you have questions :)

athenot · on April 11, 2016

Very helpful! Thanks a lot! So it sounds like in our case with lots of small documents going into ES, we could reduce IO overhead with some batching.

tamersalama · on April 8, 2016

Do you mind sharing how long it took for the indexing portion?

acehyzer · on April 8, 2016

It took me about 20 minutes to index 2.7M records. It was pulling from 3 tables using joins, and using the .Net NEST client.

serguzest · on April 8, 2016

if you are using MSSQL or maybe other XML supported RDBMS, I highly suggest xml subqueries instead of joins. You can actually return whole object graph in one row, it is impossible with traditional sql. It is not too complicated either.

sqlcook · on April 9, 2016

I've indexed ~ 1 million docs a second, but with proper routing, can probably even 5x that. Total cluster size was 50 terabytes, at the end.

true_religion · on April 9, 2016

How many machines did you have on the cluster?

sqlcook · on April 9, 2016

100 data nodes

basically if you want fast ingress, keep shards small, once they get past ~5-10gb , ingress significantly slows down. Also this was on ES 1.5 , have not tested latest 2.0+ builds

sandGorgon · on April 9, 2016

I assume you are also replicating your nodes...how does replication impact ingress? What happens when nodes exceed 10 GB? Do you split them?

sqlcook · on April 9, 2016

if you want the fastest ingress, disable replica until your ingress is done, its faster to create replica at the end of ETL for that given index. Also, you want to disable auto allocation as well, this will disable shard movement during ingress, re-enable it afterwards.

on a 100 node cluster i had roughly 500GB on each node. this was not a single index, multiple indexes, with roughly 8 shards per index per node. Shard count is pretty important to get correct.

I did not manually control document routing (it was hard based on the type of data i was ingressing), so it was set to auto and during the load i observed hotspots in the cluster (you have to look at BULK thread/queue length), some nodes were getting burst of docs while others were idle, roughly 40-50% of the nodes in the cluster were under utilized, and maybe 5-10% had hot spots from time to time.

Also, depending what you use to push data in, (I used ES hadoop plugin) , you have to account for shard segment merges, which literally pause ingress for a brief moment and merge segments in a given shard. You have to set retry to -1 (infinite) and retry delay to something like a second or two, otherwise you will end up with dropped documents.

sandGorgon · on April 9, 2016

this is brilliant ! if you had your ES and hadoop config somewhere it would be awesome

IndianAstronaut · on April 8, 2016

Is this sort of parallelism also doable with Solr as well?

brightball · on April 8, 2016

I don't see why it wouldn't be. The main differentiator between Solr and Elastic Search is that ES handles constant incoming data more consistently, so it's a much better fit for realtime scenarios.

Just batch loading the data one time shouldn't create much of a problem for Solr either.