TimescaleDB confuses me. Postgres is an OLTP database and their disk storage format is uncompressed and not particularly effective.
By clever sharding, you can work around the performance issues somewhat but it'll never be as efficient as an OLAP column store like ClickHouse or MemSQL:
- Timestamps and metric values compress very nicely using delta-of-delta encoding.
- Aligning data by columns means much faster aggregation. A typical time series query does min/max/avg aggregations by timestamp. You can load data straight from disk into memory, use SSE/AVX instructions and only the small subset of data you aggregate on will have to be read from disk.
So what's the use case for TimescaleDB? Complex queries that OLAP databases can't handle? Small amounts of metrics where storage cost is irrelevant, but PostgreSQL compatibility matters?
Storing time series data in TimescaleDB takes at least 10x (if not more) space compared to, say, ClickHouse or the Prometheus TSDB.
TimescaleDB is more performant that you may think. We've benchmarked this extensively: eg outperforming vs InfluxDB [1] [2], vs Cassandra [3], vs Mongo [4].
We've also open-sourced the benchmarking suite so others can run these themselves and verify our results. [5]
We also beat MemSQL regularly for enterprise engagements (unfortunately can't share those results publicly).
I think the scalability of ClickHouse is quite compelling, and if you need more than 1-2M inserts a second and 100TBs of storage, then that would be one reason where I'd recommend another database over our own. But horizontal scalability is something we have been working on for nearly a year, so we expect this to be a less of an issue in the near future (will have more to share later this month).
You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.
As far as I know we've only faced off against TimeScaleDB on one small account in the IoT space.
You can't really compare columnstore storage (MemSQL) to rowstore storage (Timescale) for scanning and filtering large amounts of data for analytics use cases (of which time series use cases are a subset). I think this fact is reasonably well established at this point (the idea was popularized by the CStore project a decade ago[1]). Even at the small end scanning compressed data in columnstore format is so much faster then rowstore [2] (the data fits nicely into CPU caches and is well suited for SIMD instructions)
I would be happy to compare public customer references with timescale though. MemSQL is well established in the fortune 100 at this point:
At my previous job we implemented custom sharding and aggregation on top of Postgres 9.4 for timeseries for a monitoring product. We did it to simplify operations as we built a new product (team <4) and we knew it would be years before our scale motivated us to adopt a specialized store.
We were pleasantly surprised, however, with how far this solution took us. 3 years later we were pushing ~30 TB every two weeks and Postgres was handing it well with predictable performance characteristics. We still didn't feel a pressing need to replace Postgres (although we were moving that direction).
It's also worth mentioning that this was 9.4 Postgres which is prior to partitioning and parallelization improvements which have been landing since 9.6. So I would expect even vanilla Postgres to handle even better.
Anyway, I'm a fan of Timescale's work and share your sentiments here almost exactly.
> You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.
This is a weird answer since compression is used by columnar databases like MemSQL and Clickhouse to both save on storage and accelerate queries. Compare this to using a generic a filesystem compression which would both compress worse and make the system slower.
We haven't really found it to be the case that the system is slower with ZFS. As the sibling mentions, you are trading some CPU for better I/O. We usually see better insert performance and similar/better query latency.
Compression may or may not be worse with ZFS defaults, but performance will almost certainly be _better_ with the default ZFS compression settings than an uncompressed filesystem. You're trading a small amount of CPU for IO, and that's usually a really good trade.
We use TimescaleDB with databases between 1-100 million rows (small by some standards, but certainly not tiny) - I love it!
- we use Postgres as our main database, so being able to keep out time-series data in the same place is a big win
- perhaps because because it's a Postgres extension, the learning curve is small
- it keeps timerange-constrained queries over our event data super fast, because it knows which chunks to search across
- deleting old data (e.g. for a data retention policy) is instantaneous, as TimescaleDB just deletes the physical files that back the timerange being deleted
- it has some nice functions built-in, like `time_bucket_gapfill`. Yes, you could write your own functions to do this, but it's nice to have maintained, tested functions available OOTB
There's an interesting benchmark against TimescaleDB and InfluxDB from VictoriaMetrics that seem to do better on performance and disk space than both. I consider using it as a remote storage of Prometheus.
I can't say for sure, but shouldn't insertions be way quicker in Timescale, because the index-changes are limited to the most-recent subtable only, and it's still row-based?
We're considering a move from OpenTSDB to Timescale currently, and something that stands out in Timescale is the wide-table format; we get bundles of metrics at each tick, and having them aligned makes usage easier, and perhaps also saved us some space over having the timestamps repeated per metric.
My understanding is that it just gives you a bit of extra room if you have a small to mild timeseries problem. That's still a lot of use cases, but you're right it will never work for larger use cases.
They said they didn't want to reinvent a database engine to solve the timeseries problem, so you have what you pay for.
We are actually doing a fair bit to address larger use cases. We also had customers who are doing 100s of billions of points successfully, so I guess it depends on what you mean by larger.
As our CEO mentioned in a sibling, we are working on a horizontal/scale-out solution for even higher ingest rates, as well as sharding. We're also doing some work for better compression to reduce our disk footprint.
Also since 1.2, we have support for automatic retention policies that help keep the disk usage in check. Yesterday we released 1.3, which contains our first iteration of continuous aggregations that let's you materialize aggregates over the raw data for faster querying. In a future iteration, we'll also allow you to remove the underlying/raw data but keep the aggregates -- another way to improve the disk usage of your data.
All that is to say we do consider ourselves useful for larger use cases, and have a lot of features coming down the pipe to make it even better.
By clever sharding, you can work around the performance issues somewhat but it'll never be as efficient as an OLAP column store like ClickHouse or MemSQL:
- Timestamps and metric values compress very nicely using delta-of-delta encoding.
- Compression dramatically improves scan performance.
- Aligning data by columns means much faster aggregation. A typical time series query does min/max/avg aggregations by timestamp. You can load data straight from disk into memory, use SSE/AVX instructions and only the small subset of data you aggregate on will have to be read from disk.
So what's the use case for TimescaleDB? Complex queries that OLAP databases can't handle? Small amounts of metrics where storage cost is irrelevant, but PostgreSQL compatibility matters?
Storing time series data in TimescaleDB takes at least 10x (if not more) space compared to, say, ClickHouse or the Prometheus TSDB.