The value proposition appears to be a much better time-series solution for the PostgreSQL environment, and in that regard it looks like it is well-executed. But I don't follow the assertion that it is targeted at machine-generated sensor data per se for a couple reasons. I'm not sure it is possible to hit the operation metrics/characteristics typically required for these workloads from within the PostgreSQL ecosystem. Interesting database implementation but I am not clear on the core technical differentiation and value proposition compared to other databases. Maybe someone from TimescaleDB here can fill in?
For example, what is the online insertion rate story? As a rough rule of thumb, I expect an inexpensive Linux server to drive a million records/second through parsing, processing, indexing, and disk storage because you need that kind of throughput for efficiency when dealing with sensor data. But I would not expect that from PostgreSQL -- it is optimized for a different workload. Does this bypass the normal PostgreSQL I/O path some how?
What is the scale-out story? One of the most difficult parts of dealing with sensor data models is that you need to continuously shift data and load between machines to get consistent performance. TimescaleDB's sharding model looks like it might actually be able to support this kind of load shifting but PostgreSQL has no such architectural concept and I don't see anything written about it in a cursory read.
What is the geospatial analytics story? This is a huge part of analyzing sensor data for many applications. PostGIS uses GEOS, which is the only high-quality geospatial geometry engine in open source, but is that integrated into TimescaleDB? Is there proper support for high-performance geospatial sharding?
Looks like a solid architecture and implementation though.
Thanks for the detailed comments. Let me try to answer a couple of them.
- We don't bypass the Postgres I/O in general, but have rewritten a lot of the insert path as it relates to managing inserts across many chunks (child tables in Postgres). You can see a detailed explanation and performance comparison of PG10 declarative partitioning vs. TimescaleDB here: https://blog.timescale.com/time-series-data-postgresql-10-vs...
- What we find is that in many sensor settings, at least when deployed at centralized collection points, you'll actually place your DB/storage infrastructure behind a front-end anyway (web interface, MQTT, SSL termination point, etc.), and then batch data through a data pipeline like Kafka. This means that when the data hits TimescaleDB, you can perform larger batch inserts (even 10K rows / batch), which is how a single-node Timescale can support 100,000s of rows/sec (millions of metrics/sec). More benchmarks: https://blog.timescale.com/timescaledb-vs-6a696248104e
- So while this isn't the native I/O speed of Linux, it's not just about raw write rate; you also get those other things that are critical for a robust data product: reliability, flexible and optimized query performance, etc.
- Scaling out OLTP workloads has in general been challenging, but there are some nice characteristics of time-series workloads that make this better. We're working on scale-out clustering right now, and hope to be able to announce some things later this year.
- We fully integrate with PostGIS and other Postgres extensions. Some of our users customers are very much employing TimescaleDB + PostGIS for temporal+spatial analysis: doing things like asset tracking or logistics over time.
It's more a marketing spin than anything. It's an interesting time-series focused solution extending Postgres to become a more scalable database with built-in smarter time dimension handling. A distributed column-store database would do things just as well for queries and ingest, if not better, but it's always good to have more choices.
We recently integrated TimescaleDB extension into our PostgreSQL database at https://nightwatch.io. For performance reasons, we decided to implement time based table partitioning. We were evaluating native PG 10 partitioning vs. TimescaleDB, and concluded the latter will spare us a lot of work, maintenance and headaches. We were right - we got table partitioning basically "for free", query performance improved, and there is no partitioning maintenance involved, everything is automatic behind the curtains with their concept of hypertables. Support is really good too, they are very helpful on Timescale Slack channel.
I had the same experience. I manually partitioned a pg10 db and it was quite a lot of work to make sure I had all the partitions I need. Then I switched to timescale and got it was just a breeze. I could partition a dataset both in time and in foreign_key with a single command, and performance (especially huge inserts) jumped up. Or to be more accurate, the performance degradation of my table which made each subsequent insert a bit slower than the previous was completely gone. All in all, very happy with timescaledb.
I'm wondering if there is a way to use Timescale in combination with Citus.
We are well adjusted to Citus as we're using now for almost 2 years. For most data it's great but we do have some time series data where we have to do a lot manually to achieve desirable results. I'm wondering if there is a way to take advantage of both, but not sure how would the query distribution work in combination of the two.
Craig from Citus here. Its a question we get quite often. Unfortunately, I'm actually not sure. There is a chance they would play just fine together, and from a quick glance they likely would but we've not explored this ourselves yet. Since Citus would be running on the coordinator you'd place timescale on each distributed node and at that point what Citus passes down is standard SQL which timescale can act on. All that said we've not heavily tested it ourselves and there would be some open questions on designing the right experience.
We do have a number of users that are using Citus alongside the native time partitioning that was released in Postgres 10 with good results.
I set up a time-series Postgres 9.6 database last year with traditional table partitioning (based on entity id, not time). I looked at TimescaleDB but it was about a month old at the time so I wasn't comfortable recommending it. Also I got the impression that it was little more than a wrapper around traditional partitioning, just to make it easier to use. Looking at their Github now it seems to be more than that though. I will have to give it a closer look! But also you might want to clarify your literature about what's going on. Is it something like cstore_fdw with its own files (but not built on FDW)?
I have a Postgres extension of my own for fast timeseries inserts that gives you back the data as an array (https://github.com/pjungwir/floatfile), but it is much less developed and comes with a lot of caveats. My own approach is very simple: just make a file of 8-byte floats and keep appending to it. Is timescaledb something like that (but better)? One of the drawbacks with my approach is that you have to re-implement a lot of things (like histograms). Even though you say you support "all of SQL" it looks like you are similarly re-implementing things that would be slow otherwise (histograms).
Also, I see a lot about insert performance, but I keep getting projects where inserts aren't challenging but getting fast analytic queries is. I'd love to see some examples/numbers there.
I really like the idea of keeping timeseries data quasi-connected to your main RDBMS, so I'm thrilled to see successful projects in that space. It can definitely be challenging. I wonder if something like influx_fdw is possible and would make sense? (I've never used InfluxDB.)
Thanks for the comments, and it sounds like you want a deeper dive as to some of the design decisions around TimescaleDB.
What might be useful to start is two of our technical blog posts around how we design internal partitioning [1], as well as how this compares against declarative partitioning in PG10 [2].
We aren't rewriting the lowest storage layers like cstore_fdw, so we can continue to use Postgres' heavily-tested storage layers....as well as work with its existing streaming replication and recovery mechanisms. We also aren't packing individual writes into files (like it sounds you are doing with floatfile), arrays (like in tgres), or other TOASTED data structures, although we might be compatible. We certainly do support the range of types (arrays, JSON/B, etc.) in our data rows.
Fully agree that projects differ on their needs of inserts vs. queries time, but in addition to much higher insert rates, our general data model and aggressive constraint exclusion across chunks provides meaningful time-based query improvements. We also have all sorts of planner optimizations, e.g., we taught the planner that querying for the data by time rounded-to-some-time-bucket preserves the sort on time --- which is useful for all kinds of ORDER BY/GROUP BY queries. We've published various benchmarks around this [3].
Finally, there are a lot of changes needed to "just make things work", which also includes support for constraints, foreign keys, upserts, triggers, etc., which enable capabilities like data retention and continuous aggregation policies.
There is one limitation mentioned there ("TimescaleDB does not yet support using ON CONFLICT ON CONSTRAINT with a named key") that doesn't work, and a tip that mentions requirements for the UNIQUE constraint, but these are not usually dealbreakers.
We were an early adopter of InfluxDB and ran it for almost two years.
Some of our challenges:
Not being able to change your DB structure (like indexed columns) after the fact.
Indexes on very unique data blowing up the DB (OOM, slowdown, still even after they ‘fixed’ the cardinality problem.
Performance writing data while also querying against it for analytics.
We solved all these problems with timescale, use far fewer resources, are a long way away from needing a cluster, have lots of room for growth and development had sped up with devs who are already familiar with postgresql and can easily make changes.
At a high-level, we've taken a unique approach to the time-series data problem. While every other time-series db has taken a non-relational approach (which also means no or poor SQL support), we've built ours on Postgres. As a result, not only do we support full SQL (JOINs, complex WHEREs/GROUPBYs, window functions, etc), but we also inherit the reliability and vast ecosystem of Postgres.
At a more detailed level, comparisons between TimescaleDB and others varies depending on the database. Typically though, if you are most concerned with data compression or simple column scans, some of the other options might be more suitable than ours. But otherwise you should see similar (if not better) insert and query performance, while maintaining all the other benefits of a relational DB / Postgres.
And with your relational data living alongside time-series data, many people have been able to simplify their stack (from multiple dbs into one db), which makes things easier to manage operationally and at the application level.
Re: OpenTSDB, we hear that managing an HDFS cluster / Zookeeper / etc can be a headache. You obviously don't have that problem with TimescaleDB.
Re: Prometheus, I should add that we've open-sourced `pg_prometheus` [1], which allows you to write Prometheus data into Postgres / TimescaleDB. So with regards to Prometheus, we are more of a complement than a replacement.
Fred from IRONdb (http://irondb.io) here. First off, it's worth mentioning that "scalability" is a very relative term. Some people need to process a billion measurements per second.. there are very few databases out there that can handle that sort of task. If you're only idling along at something small that a single machine can easily handle (like a million data points per second), then the product doesn't need to scale very much at all. At that point you'd be looking for operability and manageability... how does it behave when everything else in the world aims to ruin your day?
When we started scaling Circonus, we realized we needed to solve this billion measurements per second problem on a system that scales near-linear with added nodes and has manageable failure and recovery scenarios... if we didn't solve that, we'd be worried about data all the time instead of building value on data. That's IRONdb.
What all of these TSDBs you mentioned have in common is that they tend to handle increasing workloads by just throwing a bigger box at the problem, but run into technical challenges when adding more nodes. How do they ensure data consistency and availability when multiple nodes are involved? Most of them accomplish this via consensus algorithms (paxos/raft/etc).
This introduces a number of technical problems that readers of the 'Call Me Maybe' blog series by Aphyr (https://aphyr.com/tags/jepsen) are familiar with. Essentially you end up with data inconsistency and corruption at high workloads to some degree for almost all of them, and sacrifice performance for consensus. At serious scale with petabytes of data, how does your TSDB detect and recover from the inevitable fallibility of disks (bit error rates being non-zero)? These are manageability questions that you run into at scale.
This was an issue we recognized was best avoided when we designed IRONdb, so we approached the problem by avoiding the need for consensus altogether through use of commutative operations. This means that it doesn't matter what order the end time series data is applied in; the result is the same. It allowed us to focus our efforts on read vs write performance, operational simplicity and the other points you mentioned.
As always, I'd recommend taking each for a spin and seeing for yourself. None are perfect and each will surprise and disappoint you in unique ways when you use it in anger. This coming from someone with ample anger.
Disclaimer; I've been a Postgres user for 18 years.
"We realized that this problem needed a new kind of database, and so we built TimescaleDB: the first open-source time-series database designed for scalability and complex analysis."
I'm not a DB expert, but what have financial firms been using for decades? Likely dealing with much larger securities datasets...
Or, is it just that TimescaleDB is 'opensource'...
Yes, they have. kdb+ has been the standard bearer within finance for more real-time trading applications. Still, q as a query language tends to marginalize more of the organization from having access to the data.
TimescaleDB offers full-SQL at scale, opening the database up to more people. It's used to power longer-term trading platforms (for e.g., commodities trading) and interactive analytical dashboards (for e.g., research analysts). We also see applications in compliance and risk management.
And yes, TimescaleDB is open-source. So we see TimescaleDB as complementary to kdb+, which has been around for much longer as a proprietary system, is certainly full-featured and powerful, but is used more in real-time trading applications.
kdb+ from Kx systems is popular among users who have mostly tick data (billions of trade/bid/offer updates). It has bare bones SQL, and its own query langauge "q" which in many cases simplifies the queries immensely (with two features, "foreign key chasing" and "as-of joins" which are somewhat at odds with Codd's original pure relational model so are unlikely to get into the SQL standard)
Oh how I wish AWS RDS would add TimescaleDB as an extension. Not sure if we could migrate at this point (we are using InfluxDB for time series and metrics and Postgres for related relational data), but if TimescaleDB was available on RDS I'd seriously consider it as it would simplify our approach by a tenfold. Congrats on the funding!
Thanks! We're continuing to try and get that accomplished ourselves. It always helps for them to hear from potential users to let them know there is demand for it.
Thanks for asking. For some of the larger companies, takes a little bit of time to work with their marketing / legal teams :)
We do list some of them within the blog post:
- STE Energy, to back operational dashboards used to monitor 47 power plants in Italy, Albania, Colombia, Perú, Honduras, Costa Rica, and Chile, replacing Redis
- InTraffic, which monitors all of the roughly 5,000 remote controllable railroad switches of the Dutch railways, for use with Tableau as a Business Intelligence solution
- One of the largest American telecom companies, for storing server health data
- A top financial services firm, to replace Oracle for powering interactive analytics of market data
- A public lighting and semiconductor company, for deployment in a small-footprint IoT edge device collecting and analyzing sensor readings
Influx is a NoSQL, non-relational database, written from scratch, that optimizes for properties like higher compression and column scans for specific label sets. While it supports something SQL-ish, the query language is fairly limited, and I think they are moving further away to a more declarative, non-SQL language.
On the other hand, TimescaleDB has figured out how to re-engineer PostgreSQL to make it scale for time-series data. In doing so, it:
- Supports full SQL, including complex WHERE predicates, secondary indexes, non-time based aggregates, sub-queries, window functions, CTEs, etc.
- Allows you to store regular relational tables alongside your time-series data and perform arbitrary JOINS.
- Supports both full schema-tized and JSON/JSONB data models.
- Inherits the full ecosystem of Postgres (visualization like Tableau or Grafana, ORMs like Ruby and Django, connectors like Kafka, etc.). If it speaks to Postgres, it speaks to TimescaleDB.
- Also works with other Postgres extensions, including things like postgis for geo-spatial analysis.
- Enjoys the reliability of Postgres, including streaming replication for HA, clustered read replicas, point-in-time recovery, incremental backups, etc.
Our open source version already supports some forms of scaling/clustering today: e.g., read-only replicas for hot standbys and increasing query throughput. Timescale also supports hundreds of thousands of row inserts (millions of metrics) per second on a single node, and allows users to elastically add disks to scale up single-node capacity to ~100TB.
For scaling beyond those insert and storage limits, we're currently working on full, scale-out clustering. We'll have more to say about that soon.
I _really_ love that you all have built something that works really well when scaled vertically, and not kneecapped the whole thing in search of horizontal scale chasing the 1%-ers.
Thanks! I didn't know time-series databases existed. I had to rummage through examples to understand what it even means because it wasn't clearly explained in the FAQ. This makes things so much easier.
Hi, thanks for the feedback on our FAQ. We did write a blog post that goes into what time-series data is and why purpose-built time-series databases should exist / are necessary.
Have you guys had any interest from the existing SCADA process historian vendors (Wonderware Historian, OSISoft PI, etc) or historian users about using Timescale?
We use process historians, and it's unlikely we will ever get off them. Industrial process historians are hard to replace because of the multitude of corner conditions that they handle, a result of being hardened over 3-4 decades of being deployed in industry.
That said, most process historians are really not optimized for analytics use cases, which are becoming increasingly important for extracting value. They also don't scale too well horizontally, and are unwieldy to work with on the large. (typically they're used for visualization, and you are typically only able to work with small subsets of the data at a time, like when you're troubleshooting operations at a particular plant, as opposed to finding patterns across all plants which is more easily achieved with something like Timescale).
I think an ideal setup for me would be to get historian collectors to tee their data to both the historian database, and to something like Timescale. I do a lot of large scale time-series modeling and would really prefer to work with a SQL database than a historian API.
wenc's comments are spot on. Just to add: we have found ourselves being used alongside historians, as an easier way to extract and analyze the actual data.
In other words, instead of "rip and replace" companies are using TimescaleDB as a more flexible, powerful and easier-to-use operational analytical engine alongside historians, for easier access to their data via full-SQL and in a relational model that lends itself to more useful, complex analysis.
Always intrigued to learn more about where we can help here, so if you have any insight, we'd love to hear from you: hello@timescale.com
I think the scada vendors have already solved the same set of problems that timescale solves.
Edit: pretty sure the scada vendors partition the data in to tables so each table is a day or week of data, they do rollups on the data so the detail is available if you do a query with a 1 minute timespan but the data is already aggregated in to representative values for a 1 minute interval if the query timespan is 1 day.
Sounds like timescales clustering story might be better, but with OSI or wonderware your clustering story is ms SQL or oracle.
Except that most process historians store data in a non-relational way which makes it difficult to integrate with other types of data.
Process historians, like most NoSQL time-series offerings, are traditionally optimized for fast data retrieval (for HMIs, real-time monitoring dashboards, etc.) but are really weak when it comes to analytics. There are new products like Seeq (a kind of a Tableau for process historians) that attempt to address this, but somehow miss the fact that for analytics, you really need your data to exist in a dataframe-like shape.
Today, if you want to do any kind of non-trivial analytics, you typically have to first ETL data from the historian to a relational database via REST or OLE DB API. TimescaleDB on the other hand already stores data in a native relational format and lets you run full SQL on it.
> pretty sure the scada vendors partition the data in to tables so each table is a day or week of data, they do rollups on the data so the detail is available if you do a query with a 1 minute timespan but the data is already aggregated in to representative values for a 1 minute interval if the query timespan is 1 day.
I have not seen this, but if this is the case, and if these aggregates are stored in a proper relational database, that solves a big chunk of the problems people have with doing analytics on historian data.
As far as I know GE Proficy does not do this. Which is a pain.
One more thing: If this is the kind of thing that excites you, please come join us: http://www.timescale.com/careers. We have open positions in engineering (core database, R&D, customer success and Frontend) as well as content marketing, evangelism, recruitment, and partnership.
Cool. So where's the value here? How are you making money? Didn't understand that from the first 2 or 3 paragraphs. Seems like you have a lot of GitHub stars, but how about revenue?
We're following the classic "open core" model (similar to MongoDB, Elastic, Confluent), and we already have customers paying for features related to enterprise-grade reliability, performance, and support. We'll have more to share in the upcoming months.
We touched on this a bit elsewhere when someone asked about InfluxDB. One of the biggest advantages is that by being built on Postgres, we are able to support the full range of SQL/RDBMS features including secondary indices, complex WHERE predicates, CTEs, and more. Time-series data can also be easily stored alongside relational data, which makes it easier to simplify your stack.
Column store data warehouses have all those features, except indices which don’t make sense in the context of a columnar architecture. It seems to me that if you create a table in a conventional data warehouse with a partition key on a time column, you have all the advantages of TimescaleDB, plus all the advantages of an MPP columnar data warehouse. Am I missing something? Why should anyone use TimescaleDB instead of just a specific configuration of a data warehouse?
A lot of time-series data has complex relations and cross-correlations that make it unsuitable for column stores. Take for example a device that reports temperature and humidity. You may want to enforce foreign key relations between devices and readings, or you may want to do analysis for device in a particular location (which may be part of readings or devices table) secondary indexes help here. Or you may want to easy access to all data where temperature is above 90 degrees (partial indexes help here). You often want this in a real-time dashboard where the kind of latencies you see with MPP are problematic. Pretty much the columnar/MPP stuff is good for simple aggregates for ad-hoc non-real-time queries but TimescaleDB is geared to more complex analysis to power dashboards.
Plus, when you know which particular aggregates you want to regularly query, TimescaleDB makes it easily to perform continuous roll-ups so you can store in a separate (hyper)table.
> Pretty much the columnar/mpp stuff is good for simple aggregates for ad-hoc non-real-time queries
How is a column-oriented database which is primarily designed for fast performance across large data not good for real-time queries? Memsql, clickhouse, vertica, druid, etc are all fast systems that can scan terabytes in milliseconds with complex queries.
It's great that timescale provides new options and works with postgres but let's keep things accurate in a field that has plenty of confusion already.
Clickhouse is a very cool project but is more analytical than OLTP. Percona had a nice writeup[1] (the comparison to mysql is especially apt -- note the lack of real-time updates and deletes in Clickhouse).
For example, what is the online insertion rate story? As a rough rule of thumb, I expect an inexpensive Linux server to drive a million records/second through parsing, processing, indexing, and disk storage because you need that kind of throughput for efficiency when dealing with sensor data. But I would not expect that from PostgreSQL -- it is optimized for a different workload. Does this bypass the normal PostgreSQL I/O path some how?
What is the scale-out story? One of the most difficult parts of dealing with sensor data models is that you need to continuously shift data and load between machines to get consistent performance. TimescaleDB's sharding model looks like it might actually be able to support this kind of load shifting but PostgreSQL has no such architectural concept and I don't see anything written about it in a cursory read.
What is the geospatial analytics story? This is a huge part of analyzing sensor data for many applications. PostGIS uses GEOS, which is the only high-quality geospatial geometry engine in open source, but is that integrated into TimescaleDB? Is there proper support for high-performance geospatial sharding?
Looks like a solid architecture and implementation though.