Hacker News new | past | comments | ask | show | jobs | submit login

Even at 300k points per second, PostgreSQL or MySQL can easily service your needs. The scheme that this article presents is really about batching writes to disk to get high throughput.

    mysql> CREATE TABLE test.polls_memory ( id INT(11) PRIMARY KEY AUTO_INCREMENT, poll_id INT(11), time DOUBLE, value INT(11)) ENGINE MEMORY;
    mysql> CREATE TABLE test.polls_disk (id INT(11) PRIMARY KEY AUTO_INCREMENT, poll_id INT(11), time DOUBLE, value INT(11)) ENGINE MyISAM;
     
    #> for i in `seq 1 500000`; do echo "INSERT INTO test.polls_memory (poll_id,time,value) VALUES ($i,UNIX_TIMESTAMP(),$RANDOM);" ; done | mysql -uroot
   
    mysql> INSERT INTO test.polls_disk SELECT * from test.polls_memory;
    Query OK, 510001 rows affected (0.95 sec)
    Records: 510001  Duplicates: 0  Warnings: 0
And this was just on my laptop. I know that on enterprise grade hardware I can get that write rate up to millions per second. The question isn't getting it to disk in batches. Its the read patterns you can support once you decide you are going to batch your writes.



Yet your measurement includes no networking, no parsing of statements, no concurrency...

Repeat the experiment with 200 different clients hitting the server via networking with 1500 insert statements, each second. Then you can compare MySQL with the InfluxDB use case cited as an example in the article.

And don't forget the index on the timestamp.


The whole point was if you are going to batch writes to disk to get high thoughput then PostgreSQL or MySQL can easily sustain the same rates as InfluxDB ( which my little example shows ). I wouldn't have MySQL parse 500,000 insert statements per second, but I would have a proxy in front of it that could take in 500,000 data packets per second and then at some defined interval write them to MySQL.

I know its not as sexy as InfluxDB, but the old work horses are tried and true and if you use them right they are as good as if not better then some of the shiny new toys.


> I would have a proxy in front of it that could take in 500,000 data packets per second

Which pretty much exactly what timeriver is :) (That, and it can also do Graphite-like functions, e.g. movingAverage(), etc.)


Can you time the copy again, with an index on the timestamp this time? just out of curiosity.

Also, not sure it's wise to build system software (proxy) just to log data instead of using what already exists. Because that proxy you write will not be an old work horse either.


So I added an index on (poll_id,time) since thats a more likely index for various reasons.

    mysql> ALTER TABLE polls_disk ADD INDEX `timestamp` (poll_id,time);
    mysql> INSERT INTO test.polls_disk SELECT * from test.polls_memory;
    Query OK, 510001 rows affected (1.78 sec)
    Records: 510001  Duplicates: 0  Warnings: 0
So the index takes me down to 286k points per second. Again this is just on my laptop, but it shows a great point. I can write stuff down to disk super fast ( especially if I can batch it ), but reading it back out is where the problems are.

As far as a proxy in front of the datastore, the amount of effort to batch up data packets to hand them off to the data store is several orders of magnitude less complexity then a really well designed data store, so where do I want to custom roll software and where do I want to leverage the broader community. I don't think there isn't a place for the new tools, but I think we jump to them much more quickly then we should in a lot of situations.


Creating a time-series table is easy. However, you will quickly experience several challenges:

(1) Table scans to extract just a few columns will be extremely inefficient. It's better to create one table per column, each being a pair [time, value].

(2) B-tree indexes will be inefficient you for large extracts. For example, aggregating by month for one year will basically not use the indexes. The index will only mostly useful for looking up individual metrics, which is not useful. The new BRIN indexes in 9.5 should help, however.

(3) You'll end up reaching for something like Postgres' partitioned tables, which allows you to create child tables that are visible from the top-level table. Each table needs to be sharded by something like the current day, or week, or month.

(4) You say you need a proxy to batch data packets. So it's queuing, and needs persistence because you don't want to lose data if Postgres is lagging behind — your proxy is now a time series database! However, if Postgres is lagging behind, you have a problem. If you're writing at 600k/s and Postgres can only handle 500k/s, you have a never-ending queue.

(5) PostgreSQL can only have one master, and its write capability has a fixed ceiling (for HDDs you can calculate the max possible tps from the rotational speed, not sure about SSDs). At some point you'll end up sharding (by column, probably) the master into multiple concurrent masters.

Much of the point of something like InfluxDB is that there's no single bottleneck. You scale linearly by adding more nodes, each of which can write data at top speed. You don't need a "proxy" because it's designed to perform that very same function.


Both only reading or only writing is really simple, compared to when simultaneous high load on both reads and writes are needed, especially if at the same time as things are flushed to disk in a way that doesn't lose accepted data on power down.

That doesn't mean most people can't do well on relational databases as most people simply doesn't need both at the same time.


Try running that benchmark again on a table that already contains 500M records. I'm quite sure you'll see different numbers. Another problem you'll run into is reading back those records, and performing aggregations and window functions on your data. In my experience Postgresql is not a bad solution for time series data, but it's certainly not a perfect fit.


If I take the index off the benchmark will run just about as fast for 500M rows as it did for 0 rows, at 500M rows MySQL's BTREE index starts to fall over, but I have plenty of options to mitigate that. I don't think anything is a perfect fit for time series data, but again my point was that the traditional relational database's are _really_ good, and they do in fact scale up well beyond what most people think they will.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: