Another Timescale engineer here. As previously pointed out, zheap should work as a drop-in in TimescaleDB. In fact, I just tried it and it works. However, it currently requires an unmerged PR to work properly: https://github.com/timescale/timescaledb/pull/2082, as well as further testing.
We talk about sharding vs. chunking in the blog post and I would put CitusDB in the former category. More specifically, TimescaleDB is focusing on time-series workloads. To handle time-series workloads, CitusDB suggests combining their extension with a third-party extension (pg_partman) (see their docs).
I have no experience with this combination myself, so don't want to speculate about performance, etc., but when reading the docs it really seems like an afterthought.
It basically boils down to deleting a bunch of files on disk. The fact that it is distributed doesn't affect efficiency too much; it is basically a delete sent to all nodes, followed by a two-phase commit.
The upside of deleting entire tables (chunks) like this is that you don't pay the same PostgreSQL vacuuming cost normally associated with row-by-row deletes.
Thanks for the advice. FWIW, though, TimescaleDB supports multi-dimensional partitioning, so a specific "hot" time interval is actually typically split across many chunks, and thus server instances. We are also working on native chunk replication, which allows serving copies of the same chunk out of different server instances.
Apart from these things to mitigate the hot partition problem, it's usually a good thing to be able to serve the same data to many requests using a warm cache compared to having many random reads that thrashes the cache.
Hey Erik, thanks for the post. In this vision, would this cluster of servers be reserved exclusively for timeseries data, or do you imagine it containing other ordinary tables as well?
We're using postgres presently for some IoT, B2B applications, and the timeseries tables are a half dozen orders of magnitude larger than the other tables in our application. Certain database operations, like updates, take a very long time because of this. I've wondered if by splitting the timeseries tables onto their own server I could handle updates independently, with the main app gracefully handling the timeseries DB being offline for some period of time.
It's more than just about downtime though. If through poor querying or other issues the timeseries db is overloaded the customer impact of the slow down would be limited.
We commonly see hypertables (time-series tables) deployed alongside relational tables, often because there exists a relation between them: the relational metadata provides information about the user, sensor, server, security instrument that is referenced by id/name in the hypertable.
So joins between these time-series and relational tables are often common, and together these serve the applications one often builds on top of your data.
Now, TimescaleDB can be installed on a PG server that is also handling tables that have nothing to do with its workload, in which case one does get performance interference between the two workloads. We generally wouldn't recommend this for more production deployments, but the decision here is always a tradeoff between resource isolation and cost.
Timescale engineer here. Just want to point out that you can also attach additional disks using tablespaces, which are fully supported on hypertables. With a few simple commands, this allows you to add new disks and move old disks out of rotation while still being able to query the old data on them.
Timescaler here. We're not blaming "PG devs". We have great respect for the PostgreSQL developers and what they are doing; so much, in fact, that we chose to base our product on PostgreSQL. And, TimescaleDB is not a fork--it is an extension to PostgreSQL that can be loaded in existing PostgreSQL installations.
We would be happy to contribute to PostgreSQL, but I think the issue here is that, as a business that is focusing on a very particular use case, we are not perfectly aligned with the PostgreSQL roadmap. We want to be able to move quickly and adapt to customers needs, focusing on the pain points and issues they have. This simply isn't compatible with the more conservative development pace that main PostgreSQL understandably has.
From another perspective, I think one strength of PostgreSQL is, in fact, its support of extensions, enabling innovation alongside main PostgreSQL while the core developers can focus on a rock solid and extensible foundation. So, from where I am coming from, this is a feature and not a bug.
Fellow Timescaler here. Thanks for the feedback. While we do not directly compare ingestion protocol and specific features, like continuous queries and retention polices (something I guess we could add), we do compare echosystems and third-party tools support, including ingestion (e.g., Kafka, Hibernate) and visualization (e.g., Tableau, Grafana). In fact, the developer behind the Grafana PostgreSQL data source is also a Timescaler, and an upcoming version of the data source will have an improved query builder and first-class TimescaleDB support.
Finally, I can assure you that this is more than a few input scripts. In fact, the project is thousands of lines of C code (if that matters) that implement automatic partitioning, query optimizations, and much more. Our code is open source here: https://github.com/timescale/timescaledb
Timescaler here. Although we haven't done any official announcements w.r.t. clustering, our plan is for this to be open source, like the single-node version.
We wouldn't do this if we didn't believe we could make a business out of it. That said, we are at a pretty early stage at this moment and are looking at many different options. As you may know, and if you've been following the discussions around business models for open-source projects, there are many approaches as well as challenges. All I can say is that we have been following the discussions and we are drawing lessons from past failures and successes.
That point about irregularly spaced data (sparse) is a very insightful observation. I’d just add that a user can to some extent address that by normalization, i.e., splitting incoming data across multiple TimescaleDB (hyper)tables, like in any SQL database. However, the are clear trade-offs here. The upside is that users can themselves balance these trade-offs.