Great explanation of HLL in here - I hadn't fully understood how you can combine HLLs together, which is key to understanding why they can help distribute count distinct over multiple shards without needing to copy vast amounts of data around to check for uniqueness across multiple shards.
Is citus + cstore_fdw a good choice for space-efficient archiving and occassionally querying large amounts of (high cardinality) log and (lower cardinality) time series data?
ElasticSearch works great but it unsuitable for long term storage for performance and storage reasons.
(2) Citus + cstore_fdw if you're cool with doing some manual work. In particular, our customers who follow this approach shard on one column (automatic) and then manually partition by time. We're looking to leverage upcoming partitioning improvements in Postgres 10 and further automate this setup: https://github.com/citusdata/citus/issues/183
Internal analytics, may accidentally become a B2B at some point (who knows), but for now, non business-critical with no availability requirements. About 30+ GB per day, but that's uncompressed and unprocessed.
Thanks for the pointers! Manual sharding sounds like an option.
Note that BigQuery exposes partial HLL functions like HLL_COUNT.INIT, HLL_COUNT.MERGE, HLL_COUNT.MERGE_PARTIAL, and HLL_COUNT.EXTRACT.
"These functions operate on sketches that compress an arbitrary set into a fixed-memory representation. The sketches can be merged to produce a new sketch that represents the union, before a final estimate is extracted from the sketch."
For now we're trying to do it "on premise" since we're a large-ish service provider with operations and hardware teams. As long as the operational burden isn't too high, we do it ourselves to gain experiences with different technologies.
However, BigQuery is really tempting since it'd probably solve all of the issues we're currently having. Will definitely do some benchmarking. The pricing is hard to beat .
No. Clickhouse is actual column-store. Cassandra just stores every column of a row as an actual separate row. Cassandra is for oltp, column-stores for olap.
HLL is great for a lot of reasons. He gives the primary reason as getting uniques from randomly sharded data in a distributed system.
If your distributed system allows you to do a hash or range based sharding, for example by user_id, then you can do an accurate count(distinct user_id) across the system without a reshuffle of the data, knowing that all the data for a particular user lives on the same node.
Yup, good point. This example shards the github_events table on user_id and then shows running count(distinct user_id). Since the sharding and count(distinct) column is the same, Citus can push down the count(distinct user_id) to each shard and then sum up the results from those shards. In this case, reshuffles don't come into the picture.
In this example, HLLs would be most useful if the user then issued count(distinct repo_id) for example. Citus would then ask for HLL sketches from each shard and add them up on the coordinator node.