Is this similar to Apache Drill[0]/Google Dremel[1] (BigQuery[2])? One differences is that it seems to be able to do mutations to data, not simple appends.
Yes, it is similar to BigQuery with a couple of differences as you pointed out. The big ones being that it's fully open source and self-hostable.
It's less similar to Apache Drill - Drill is "only" a query engine and doesn't handle the actual data storage. EventQL combines a bigtable-like storage engine (optimized for the analytics use case) with a dremel/bigquery/dremel-like query engine.
Looks great! I've just gotten a single instance up and running. Really simple to set up. It seems almost exactly what We've been looking for!
Some background: We have been evaluating different time series databases for use with sensor readings (We probably need to ingest something on the order of 100,000-200,000 samples/s per cluster). The data is from physical sensors like temperature, level, pressure etc. Where all the data is in the form of tag(string)|datetime(ms)|value(decimal).
Have you done any comparisons to other similar software like influxdb, Cassandra, etc? Especially ingestion rate and disk usage.
What kind of pricing can we expect on Managed Hosting?
We are currently leaning towards Influxdb but the cluster licensing stuff they are doing really made us think twice.
At my work we use OSI PI which is an Enterprise Historian for storing this kind of physical sensor data - has very good support for time series logging and integrates well with control system(Citect, ABB etc) as well as our LIMS system.
It's not free software so that might be a deal breaker for you.
Yeah that is one of the biggest issues besides the licensing stuff. But ram is pretty cheap and it seems 200,000 different series/tags fits easily in 32gb.
Why EventQL over other distributed columnar-storage based databases like Redshift, Vertica, Citus, or Hadoop-based ones like Impala and Presto. What does EventQL do better?
Why not build a bigtable-cassandra fusion row-store too ?
Since updates are async and all nodes are the same (cassandra) while the data model is sorted-by-primary-key (bigtable) and the schema fixed (low cost in storing tuples) and sql available (easier for devs).
How can we import CSV dataset? HTTP API for insert returns this error message: expected JSON_OBJECT_BEGIN, got: JSON_OBJECT_BEGIN. The GROUP BY query in homepage scans 1.8B rows and only takes 1.5 seconds which is great but how many nodes used in that setup?
We do have a csv import util (the API expects JSON) but it's not in the current distribution/release build. I'll add it and update this comment once it's live.
Queries are mostly limited by IO if running on regular hard disks. The number of rows/seconds mainly depends on the number and types of columns that are accessed. For example, if we scan 1.8B rows and only load a single integer column from disk (and the integers are small), we'll only have to load about 1 byte per row from disk (using an idealized model excluding some overheads for illustration purposes). If we want to complete the query in 1.5 seconds that would be a total IO load of 1144MB/s. So (depending on disk speed) around ~15 machines would suffice.
[0] https://drill.apache.org/ [1] http://research.google.com/pubs/pub36632.html [2] https://cloud.google.com/bigquery/