Show HN: EventQL – Open-source distributed SQL analytics database in C++11

jimktrains2 · on July 26, 2016

Is this similar to Apache Drill[0]/Google Dremel[1] (BigQuery[2])? One differences is that it seems to be able to do mutations to data, not simple appends.

[0] https://drill.apache.org/ [1] http://research.google.com/pubs/pub36632.html [2] https://cloud.google.com/bigquery/

paulasmuth · on July 26, 2016

Yes, it is similar to BigQuery with a couple of differences as you pointed out. The big ones being that it's fully open source and self-hostable.

It's less similar to Apache Drill - Drill is "only" a query engine and doesn't handle the actual data storage. EventQL combines a bigtable-like storage engine (optimized for the analytics use case) with a dremel/bigquery/dremel-like query engine.

ddispaltro · on July 26, 2016

How does it compare to Clickhouse, recently released from Yandex [0]?

[0] https://clickhouse.yandex/

jimktrains2 · on July 26, 2016

Awesome! Thanks! I'll have to try it out.

babas · on July 26, 2016

Looks great! I've just gotten a single instance up and running. Really simple to set up. It seems almost exactly what We've been looking for! Some background: We have been evaluating different time series databases for use with sensor readings (We probably need to ingest something on the order of 100,000-200,000 samples/s per cluster). The data is from physical sensors like temperature, level, pressure etc. Where all the data is in the form of tag(string)|datetime(ms)|value(decimal).

Have you done any comparisons to other similar software like influxdb, Cassandra, etc? Especially ingestion rate and disk usage.

What kind of pricing can we expect on Managed Hosting?

We are currently leaning towards Influxdb but the cluster licensing stuff they are doing really made us think twice.

teej · on July 26, 2016

If you're looking for recommendations for time series data, this thread still holds up - https://news.ycombinator.com/item?id=8368509

I'd consider taking a look at KDB+

babas · on July 26, 2016

We threw kdb+ out of consideration pretty early because it's extremely expensive and we prefer open software.

bigger_cheese · on July 27, 2016

At my work we use OSI PI which is an Enterprise Historian for storing this kind of physical sensor data - has very good support for time series logging and integrates well with control system(Citect, ABB etc) as well as our LIMS system.

It's not free software so that might be a deal breaker for you.

mattbillenstein · on July 26, 2016

InfluxDB doesn't currently handle high-cardinality data sets well -- it needs a lot of RAM.

babas · on July 26, 2016

Yeah that is one of the biggest issues besides the licensing stuff. But ram is pretty cheap and it seems 200,000 different series/tags fits easily in 32gb.

etrain · on July 26, 2016

If you're willing to try something more bleeding edge: http://btrdb.io/

dtheodor · on July 26, 2016

Why EventQL over other distributed columnar-storage based databases like Redshift, Vertica, Citus, or Hadoop-based ones like Impala and Presto. What does EventQL do better?

carterehsmith · on July 26, 2016

Good question!

May be useful to add to Show HN guidelines that it is recommended to add a "Why X" section. Or "How X compares to others in this space".

Without that, it is confusing. Are authors unaware of other solutions in the space? If so, they surely did not build competitive product.

Or maybe authors are aware of other solutions, but don't want to get compared?

Either way, not a good sign.

nwrk · on July 26, 2016

looks good [0], thank you

can anyone comment on data disk usage ?

[0] docs - http://eventql.io/documentation/

ddorian43 · on July 26, 2016

Why not build a bigtable-cassandra fusion row-store too ? Since updates are async and all nodes are the same (cassandra) while the data model is sorted-by-primary-key (bigtable) and the schema fixed (low cost in storing tuples) and sql available (easier for devs).

buremba · on July 26, 2016

How can we import CSV dataset? HTTP API for insert returns this error message: expected JSON_OBJECT_BEGIN, got: JSON_OBJECT_BEGIN. The GROUP BY query in homepage scans 1.8B rows and only takes 1.5 seconds which is great but how many nodes used in that setup?

paulasmuth · on July 26, 2016

We do have a csv import util (the API expects JSON) but it's not in the current distribution/release build. I'll add it and update this comment once it's live.

Queries are mostly limited by IO if running on regular hard disks. The number of rows/seconds mainly depends on the number and types of columns that are accessed. For example, if we scan 1.8B rows and only load a single integer column from disk (and the integers are small), we'll only have to load about 1 byte per row from disk (using an idealized model excluding some overheads for illustration purposes). If we want to complete the query in 1.5 seconds that would be a total IO load of 1144MB/s. So (depending on disk speed) around ~15 machines would suffice.

ktamura · on July 26, 2016

As a maintainer of the projec,t it's cool to see the Fluentd plugin in the repo: https://github.com/eventql/fluent-plugin-eventql