So I added an index on (poll_id,time) since thats a more likely index for variou...

atombender · on Oct 8, 2015

Creating a time-series table is easy. However, you will quickly experience several challenges:

(1) Table scans to extract just a few columns will be extremely inefficient. It's better to create one table per column, each being a pair [time, value].

(2) B-tree indexes will be inefficient you for large extracts. For example, aggregating by month for one year will basically not use the indexes. The index will only mostly useful for looking up individual metrics, which is not useful. The new BRIN indexes in 9.5 should help, however.

(3) You'll end up reaching for something like Postgres' partitioned tables, which allows you to create child tables that are visible from the top-level table. Each table needs to be sharded by something like the current day, or week, or month.

(4) You say you need a proxy to batch data packets. So it's queuing, and needs persistence because you don't want to lose data if Postgres is lagging behind — your proxy is now a time series database! However, if Postgres is lagging behind, you have a problem. If you're writing at 600k/s and Postgres can only handle 500k/s, you have a never-ending queue.

(5) PostgreSQL can only have one master, and its write capability has a fixed ceiling (for HDDs you can calculate the max possible tps from the rotational speed, not sure about SSDs). At some point you'll end up sharding (by column, probably) the master into multiple concurrent masters.

Much of the point of something like InfluxDB is that there's no single bottleneck. You scale linearly by adding more nodes, each of which can write data at top speed. You don't need a "proxy" because it's designed to perform that very same function.

yxhuvud · on Oct 7, 2015

Both only reading or only writing is really simple, compared to when simultaneous high load on both reads and writes are needed, especially if at the same time as things are flushed to disk in a way that doesn't lose accepted data on power down.

That doesn't mean most people can't do well on relational databases as most people simply doesn't need both at the same time.

ciconia · on Oct 7, 2015

Try running that benchmark again on a table that already contains 500M records. I'm quite sure you'll see different numbers. Another problem you'll run into is reading back those records, and performing aggregations and window functions on your data. In my experience Postgresql is not a bad solution for time series data, but it's certainly not a perfect fit.

dkhenry · on Oct 7, 2015

If I take the index off the benchmark will run just about as fast for 500M rows as it did for 0 rows, at 500M rows MySQL's BTREE index starts to fall over, but I have plenty of options to mitigate that. I don't think anything is a perfect fit for time series data, but again my point was that the traditional relational database's are _really_ good, and they do in fact scale up well beyond what most people think they will.