Distributed count(distinct) with HyperLogLog on Postgres

Manozco · on April 7, 2017

I still think the best explanation of HyperLogLog is the one done by antirez (Redis Developer) here: http://antirez.com/news/75

simonw · on April 6, 2017

Great explanation of HLL in here - I hadn't fully understood how you can combine HLLs together, which is key to understanding why they can help distribute count distinct over multiple shards without needing to copy vast amounts of data around to check for uniqueness across multiple shards.

lima · on April 6, 2017

Is citus + cstore_fdw a good choice for space-efficient archiving and occassionally querying large amounts of (high cardinality) log and (lower cardinality) time series data?

ElasticSearch works great but it unsuitable for long term storage for performance and storage reasons.

Great writeup by the 18F guys on this topic: https://github.com/18F/api.data.gov/issues/235

Unfortunately, they did not really consider Citus and they seem to have lots of issues with the solution they chose (Kylin).

ozgune · on April 6, 2017

TLDR: This depends on your requirements. We find that users with similar requirements follow one of two approaches:

(1) You keep raw data for a certain time period and regularly roll it up: https://www.citusdata.com/blog/2016/11/29/event-aggregation-...

(2) Citus + cstore_fdw if you're cool with doing some manual work. In particular, our customers who follow this approach shard on one column (automatic) and then manually partition by time. We're looking to leverage upcoming partitioning improvements in Postgres 10 and further automate this setup: https://github.com/citusdata/citus/issues/183

For more on Citus + cstore, this recent HN comment could also be helpful: https://news.ycombinator.com/item?id=14039001

Happy to chat more if you drop us a line. How much data did you have in mind? Are you looking to power a B2B or B2C application?

lima · on April 7, 2017

Internal analytics, may accidentally become a B2B at some point (who knows), but for now, non business-critical with no availability requirements. About 30+ GB per day, but that's uncompressed and unprocessed.

Thanks for the pointers! Manual sharding sounds like an option.

fhoffa · on April 7, 2017

Since you asked about long term storage and occasional queries, have you considered offloading to the cloud?

(Disclosure: I work for Google)

With BigQuery you can:

- Store as much relational data as you want ($20->$10 per terabyte per month).

- Query it whenever, and get charged only by query (no servers needed)

- To stay on topic: HyperLogLog:

https://cloud.google.com/bigquery/docs/reference/standard-sq...

Note that BigQuery exposes partial HLL functions like HLL_COUNT.INIT, HLL_COUNT.MERGE, HLL_COUNT.MERGE_PARTIAL, and HLL_COUNT.EXTRACT.

"These functions operate on sketches that compress an arbitrary set into a fixed-memory representation. The sketches can be merged to produce a new sketch that represents the union, before a final estimate is extracted from the sketch."

lima · on April 7, 2017

For now we're trying to do it "on premise" since we're a large-ish service provider with operations and hardware teams. As long as the operational burden isn't too high, we do it ourselves to gain experiences with different technologies.

However, BigQuery is really tempting since it'd probably solve all of the issues we're currently having. Will definitely do some benchmarking. The pricing is hard to beat .

vgt · on April 8, 2017

BigQuery is ALWAYS the right choice :)

(a very biased Googler perspective)

In all seriousness, BigQuery is amazing - hit me up if you have any questions!

zX41ZdbW · on April 6, 2017

I think, ClickHouse (open-source distributed column-oriented DBMS) should fit better for your needs.

lima · on April 6, 2017

Thanks, wasn't aware of that one. After a cursory look, it sounds like a better Cassandra?

ddorian43 · on April 7, 2017

No. Clickhouse is actual column-store. Cassandra just stores every column of a row as an actual separate row. Cassandra is for oltp, column-stores for olap.

lima · on April 7, 2017

Ah, I see. Thanks! Seems like it outperforms all other column stores, too. The feature list sounds amazing.

Will definitely include it in the next benchmarking round.

brd529 · on April 6, 2017

HLL is great for a lot of reasons. He gives the primary reason as getting uniques from randomly sharded data in a distributed system.

If your distributed system allows you to do a hash or range based sharding, for example by user_id, then you can do an accurate count(distinct user_id) across the system without a reshuffle of the data, knowing that all the data for a particular user lives on the same node.

ozgune · on April 6, 2017

(Ozgun from Citus Data)

Yup, good point. This example shards the github_events table on user_id and then shows running count(distinct user_id). Since the sharding and count(distinct) column is the same, Citus can push down the count(distinct user_id) to each shard and then sum up the results from those shards. In this case, reshuffles don't come into the picture.

In this example, HLLs would be most useful if the user then issued count(distinct repo_id) for example. Citus would then ask for HLL sketches from each shard and add them up on the coordinator node.