The value proposition appears to be a much better time-series solution for the PostgreSQL environment, and in that regard it looks like it is well-executed. But I don't follow the assertion that it is targeted at machine-generated sensor data per se for a couple reasons. I'm not sure it is possible to hit the operation metrics/characteristics typically required for these workloads from within the PostgreSQL ecosystem. Interesting database implementation but I am not clear on the core technical differentiation and value proposition compared to other databases. Maybe someone from TimescaleDB here can fill in?
For example, what is the online insertion rate story? As a rough rule of thumb, I expect an inexpensive Linux server to drive a million records/second through parsing, processing, indexing, and disk storage because you need that kind of throughput for efficiency when dealing with sensor data. But I would not expect that from PostgreSQL -- it is optimized for a different workload. Does this bypass the normal PostgreSQL I/O path some how?
What is the scale-out story? One of the most difficult parts of dealing with sensor data models is that you need to continuously shift data and load between machines to get consistent performance. TimescaleDB's sharding model looks like it might actually be able to support this kind of load shifting but PostgreSQL has no such architectural concept and I don't see anything written about it in a cursory read.
What is the geospatial analytics story? This is a huge part of analyzing sensor data for many applications. PostGIS uses GEOS, which is the only high-quality geospatial geometry engine in open source, but is that integrated into TimescaleDB? Is there proper support for high-performance geospatial sharding?
Looks like a solid architecture and implementation though.
Thanks for the detailed comments. Let me try to answer a couple of them.
- We don't bypass the Postgres I/O in general, but have rewritten a lot of the insert path as it relates to managing inserts across many chunks (child tables in Postgres). You can see a detailed explanation and performance comparison of PG10 declarative partitioning vs. TimescaleDB here: https://blog.timescale.com/time-series-data-postgresql-10-vs...
- What we find is that in many sensor settings, at least when deployed at centralized collection points, you'll actually place your DB/storage infrastructure behind a front-end anyway (web interface, MQTT, SSL termination point, etc.), and then batch data through a data pipeline like Kafka. This means that when the data hits TimescaleDB, you can perform larger batch inserts (even 10K rows / batch), which is how a single-node Timescale can support 100,000s of rows/sec (millions of metrics/sec). More benchmarks: https://blog.timescale.com/timescaledb-vs-6a696248104e
- So while this isn't the native I/O speed of Linux, it's not just about raw write rate; you also get those other things that are critical for a robust data product: reliability, flexible and optimized query performance, etc.
- Scaling out OLTP workloads has in general been challenging, but there are some nice characteristics of time-series workloads that make this better. We're working on scale-out clustering right now, and hope to be able to announce some things later this year.
- We fully integrate with PostGIS and other Postgres extensions. Some of our users customers are very much employing TimescaleDB + PostGIS for temporal+spatial analysis: doing things like asset tracking or logistics over time.
It's more a marketing spin than anything. It's an interesting time-series focused solution extending Postgres to become a more scalable database with built-in smarter time dimension handling. A distributed column-store database would do things just as well for queries and ingest, if not better, but it's always good to have more choices.
For example, what is the online insertion rate story? As a rough rule of thumb, I expect an inexpensive Linux server to drive a million records/second through parsing, processing, indexing, and disk storage because you need that kind of throughput for efficiency when dealing with sensor data. But I would not expect that from PostgreSQL -- it is optimized for a different workload. Does this bypass the normal PostgreSQL I/O path some how?
What is the scale-out story? One of the most difficult parts of dealing with sensor data models is that you need to continuously shift data and load between machines to get consistent performance. TimescaleDB's sharding model looks like it might actually be able to support this kind of load shifting but PostgreSQL has no such architectural concept and I don't see anything written about it in a cursory read.
What is the geospatial analytics story? This is a huge part of analyzing sensor data for many applications. PostGIS uses GEOS, which is the only high-quality geospatial geometry engine in open source, but is that integrated into TimescaleDB? Is there proper support for high-performance geospatial sharding?
Looks like a solid architecture and implementation though.