Just wanted to say I am super impressed with the work TimescaleDB has been doing...

roskilli · on July 21, 2020

If you're looking at scaling monitoring timeseries data you may also wanter to consider more Availability leaning architecture (in the CAP theory sense) with respect to replication (i.e. quorum write/read replication, strictly not leader/follower - active/passive architecture) then you might also want to check out the Apache 2 project M3 and M3DB at m3db.io.

I am biased obviously as a contributor. Having said that I think it's always worth understanding active/passive type replication and the implications and see how other solutions handle this scaling and reliability problem to better understand the underlying challenges that will be faced with instance upgrades, failover and failures in a cluster.

gshulegaard · on July 21, 2020

Neat! Hadn't heard of M3DB before, but cursory poke around the docs seems like it's a pretty solid solution/approach.

My current use case isn't monitoring, or even time series anymore, but will keep M3DB in mind next time I have to seriously push a time series/monitoring solution.

valyala · on July 22, 2020

Did you try ClickHouse? [1]

We were successfully ingesting hundreds of billions of ad serving events per day to it. It is much faster at query speed than any Postgres-based database (for instance, it may scan tens of billions of rows per second on a single node). And it scales to many nodes.

While it is possible to store monitoring data to ClickHouse, it may be non-trivial to set up. So we decided creating VictoriaMetrics [2]. It is built on design ideas from ClickHouse, so it features high performance additionally to ease of setup and operation. This is proved by publicly available case studies [3].

[1] https://clickhouse.tech/

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/

[3] https://victoriametrics.github.io/CaseStudies.html

gshulegaard · on July 22, 2020

ClickHouse's intial release was circa 2016 IIRC. The work I was doing at NGINX predates ClickHouse's initial release by 1-2 years.

ClickHouse was certainly something we evaluated later on when we were looking at moving to a true columnar storage approach, but like most columnar systems there are trade-offs.

* Partial SQL support.

* No transactions (not ACID).

* Certain workloads are less efficient like updates and deletes, or single key look ups.

None of these are unique to ClickHouse, they are fairly well known trade-offs most columnar stores make to improve write throughput and prioritize high scaling sequential read performance. As I mentioned before, the amount of data we were ingesting never really reached the limits of even Postgres 9.4, so we didn't feel like we had to make those trade-offs...yet.

I would imagine that servicing ad events is several factors larger scale than we were dealing with.