T in TPS means Transactions, right? What kind of a transaction is that one which...

leif · on July 29, 2013

You can achieve durability with very high performance using write-ahead logging, lots of concurrent writers doing group commit for the log fsyncs, and a write-optimized data structure like an LSM tree or a fractal tree. Maybe not 1 million on a $5k server, but you can get a lot closer than I imagine you're picturing right now.

In any case what they seem to be measuring is a read-only, in memory workload, so this is not that impressive. IIRC, InnoDB has no trouble pulling off something like this.

dschiptsov · on July 29, 2013

So, disk writes are necessary, after all?

Yes, there are lots of tricks, like placing that append-only physical transaction log on a different controller with a distinct storage device, etc. Data partitioning is the another big idea. Having indexes in memory to avoid unnecessary reads, using collected statistics in a query optimizer, etc. But nothing could beat the partitioning based on actual workloads and separation of tablespaces on distinct hardware, including decoupling indexes from the tables - this is what DBAs were for.

I used to be Informix DBA in old good days, so I can't help but smile when I look at MySQL (well, they added lots of partitioning options in recent InnoDB - the things Informix could do out of box 12 years ago) leave alone modern NoFsync "databases".)

Btw, not all NoSQL guys are insane.) Riak with LevelDB storage backend is very sane approach, which cares about and counts writes.

leif · on July 29, 2013

Of course they are, the trick is to get the most utility out of each one. I work at Tokutek where we use a data structure that does this, in the sense that when your working set is larger than RAM, we still don't incur very many I/Os for writes. If you want durability, there's nothing you can do about the logging fsyncs except buy yourself a nice battery-backed disk controller.

Partitioning is often a bad idea because it messes with your queries. I don't know what's new about partitioning in InnoDB but I think it's generally a symptom of the over-use of B-trees, which don't try to do anything smart about random writes. The change buffer is a decent idea but it's just a stopgap, when you have enough data it doesn't make a dent any more. A better idea is to use a data structure that can handle lots of writes.

I have just started learning about Riak, and from what I understand, they need to do a query (so, a disk seek) on every insert (to calculate something with vector clocks), so they aren't actually using the write optimization that LevelDB's LSM-trees can provide. I don't actually think it should provide that fantastic performance, but I should admit I haven't run it yet. Maybe they're more interested in the compression LevelDB gives them.

Shameless plug time! http://www.tokutek.com/2011/09/write-optimization-myths-comp...

tracker1 · on July 29, 2013

There's also RethinkDB which seems to be focused on the D in acid while being a non-relational database. When you really need performance, in general relationships/joins need to go out the window as much as possible, and often one or more of the letters in ACID are compromised.

It should get very interesting in the next couple of years.. of course MOST environments don't need the kind of performance or scale that these systems are really offering.

IIRC StackOverflow ran for a very long time on a single server, under some pretty serious demand. In some cases SQL with a caching system for mostly-read data can be better... other scenarios tend to fit a document (non-relational) data store better.. just depends.

leif · on July 29, 2013

I see rethinkdb as being focused more on the data model and language, and the cluster administration experience. The performance doesn't seem compelling yet, though they are admirably durable by default.

You're right, with a reliably performant engine you can get a lot more out of a single machine than a lot of people these days seem to think. That's part of our vision for TokuMX, to bring back a little bit of "scale up" potential to the NoSQL space.