Hacker News new | past | comments | ask | show | jobs | submit | acehyzer's comments login

Coincidentally, it is also a short name for the home assistant software in English.


Does not support net neutrality. Game over for me.


But does support undoing NSA surveillance. If I'm going to pick tech issues to vote on, I care more about that one than net neutrality.


Elasticsearch is awesome. It may be a good idea to use the bulk API that is built into elasticsearch, use some joins in your SQL query, and index more than just one record at a time. In the implementation I have, I batched my query to 50,000 records at a time that then index into elasticsearch. For the 2.7 million records I indexed this week, it took a total of 54 queries to the database (50,000 records returned at a time). Just one more idea to streamline your indexing without slamming your DB quite so hard.


FWIW the perf gains of doing large bulks are usually not worth the risk of blowing up your memory usage. Bulks in the hundreds are usually just as efficient as bulks in the thousands (YMMV), and don't cary the additional memory risk.


Yeah I don't disagree with you there. I tested the memory consumption on one chunk before I went whole hog and did the whole data set. You certainly don't need to do 50k at a time, but it might be worth pulling at least 1k at a time and indexing (depending on how large your objects are of course). Mostly I was implying that in most cases it shouldn't take 4 queries to the DB per record, you can likely setup things to pull a decently large number of records at a time, and then pass them with the bulk API, the number of network requests that are needed both to go round trip to the DB and to go to Elastic would be significantly reduced.


I started using elasticsearch recently and I was wondering, does the indexing happen in real time during the index request? How do you know how long the indexing process takes?


It happens pretty much real time. While using cURL or the .NET client it doesn't return a response until the indexing is finished. You can verify this by querying your index for some basic info. For example you could even just put the following in the browser:

http://hostname:9200/_cat/indices

The 6th column gives you the number of records indexed. Send your request and refresh the browser.


There's an index.refresh_interval setting. It defaults to 1s, so by default your data will be available for querying within one second after being indexed.


In general, yes. But keep in mind that the Elasticsearch JVM GC could fire up right after the document is indexed and possibly run for a few seconds if there is a lot of memory pressure. When the GC is done Elasticsearch will continue to process queries but it may be the indexed data hasn't been refreshed yet. So, a query run "1 second" after the index operation may not in fact return the document. However, this would be a very rare case.


I get that this reduces the load on the SQL DB but does it make a measurable difference on the ES side of things?

Edit: reason I ask is we index about 35m records per hour.


The bulk index API is very fast, I initially tried converting all the records into 7 separate JSON files with ~400k records each and passed them to elastic using cURL and the bulk api. Once the file finished uploading to the server where Elastic was hosted, the indexing of 400k records only took about 20 seconds on a pretty trivial dev VM. It's about the same using the .NET client, the majority of time is spent round-trip to the DB and organizing the data the way you want it to pass it to the bulk API. The indexing is super fast once you've got that all together. I don't know if it actually is optimized on the ES side of things for sure, but ES definitely isn't a bottleneck from what I've observed.

Edit: According to the official ES documentation: "The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed."

They're fairly vague in their documentation about what's going on under the hood. I'm sure you could go dig through the source and find out more about whether it would really speed things up or not for a specific use case.


Hey, ES dev here. :)

The bulk API mainly helps by amortizing various overheads across many documents at once. There isn't anything "special" about the bulk indexing...it doesn't side-step the normal indexing pathway to directly dump the data directly onto disk, for example.

What happens is that the bulk request arrives at a node (which we'll call the coordinating node now, since it is coordinating this request). The coordinator will then create "mini-bulk" requests which are fired off to individual nodes who need to perform the actual indexing operations. If your bulk touches five different shards around the cluster, the coordinating node will construct five mini-bulks.

Bulk requests have an (annoying) newline-delimited format. This allows the coordinator to extract until the first newline, parse the JSON and identify the action (index, delete, update) and destination (index/type/id). Since the actual document is irrelevant to the coordinator, it can slice the buffer until the next newline straight into the "mini-bulk" for the destination node. This allows the coordinator to avoid parsing the document's JSON, which could be quite large (think 15mb PDFs, or docs with 5k fields).

Once the coordinator is done assembling the mini-bulks, it sends them to the various nodes around the cluster and waits for responses to come back. Finally, the coordinator merges the responses and sends a response to the client.

So a bulk reduces overhead by amortizing the two network hops (one to coordinator, one to node) over thousands of documents. There are also side benefits like allowing concurrent bulks to share translog fsyncs, giving lucene larger indexing buffers to work with (helps create larger segments for less churn), less context switching, etc.

Generally, smaller documents gain more benefit from bulk, since more of the time is dominated by overhead. Super large documents, or documents with very complex analysis chains are less likely to see a huge improvement, since the latency there is dominated by IO or CPU. But you'll still see an improvement due to fsyncs, context switching, giving lucene batches, etc.

Hope that helps! Lemme know if you have questions :)


Very helpful! Thanks a lot! So it sounds like in our case with lots of small documents going into ES, we could reduce IO overhead with some batching.


Do you mind sharing how long it took for the indexing portion?


It took me about 20 minutes to index 2.7M records. It was pulling from 3 tables using joins, and using the .Net NEST client.


if you are using MSSQL or maybe other XML supported RDBMS, I highly suggest xml subqueries instead of joins. You can actually return whole object graph in one row, it is impossible with traditional sql. It is not too complicated either.


If I put this into my company's tests, we'd end up with no users... I have a lot of work ahead of me. :/


Yeah, the other exploit strings do innocuous stuff like putting up javascript alerts or touching files, but the SQL injection ones aren't innocuous at all. I wonder if there's something better to replace those with. Something like `1'; CREATE TABLE blns ...--` would be more akin to what the shell exploits do.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: