Russ’ 10 Ingredient Recipe for Making 1 Million TPS on $5K Hardware

colanderman · on July 29, 2013

> context switches passing tcp-packets back and forth from the operating system were taking up 35%

You could get rid of this (and in doing so, double your TPS) by switching to a memory-polling-based network driver like PF_RING [1] (and obviously, keeping the kernel on its own core like you are doing).

> Lookup data in memory (this is fast enough to happen in-thread)

> I knew I had it right when I watched the output of the “top” command,

Does all your data live in cache? If not, have you tried using perf [2] to measure load stalls? They are typically the bottleneck once you get past context switches. Hyperthreading should help at least somewhat here (do you have it enabled?).

[1] http://www.ntop.org/products/pf_ring/

[2] https://perf.wiki.kernel.org/

wmf · on July 29, 2013

Of course, if you use Netmap/DPDK/PF_RING you have to bring your own TCP stack, which is more than many app developers are comfortable with.

colanderman · on July 29, 2013

Excellent point which I missed :) Can you tell I live in an L3 bubble?

I am surprised (from a quick Google) there is no open-source user-space PF_RING-aware TCP stack. Am I missing something?

wmf · on July 29, 2013

When you're doing packet capture that normally costs one system call per packet, Netmap/DPDK/PF_RING are a clear win. But with TCP a single system call can send or receive many packets and TSO/LRO offloads help even more. I haven't seen any numbers about the benefits of kernel bypass for TCP; maybe SolarFlare has some.

itchyouch · on July 29, 2013

I was about to point out solarflare's openonload. Its only compatible with an sfc nic though. With onload, i've seen about a 20% throughput increase when turning onload on.

justincormack · on July 29, 2013

No I don't think you are, at least for open source production grade. Intending to do some work on this when I get a chance. There is a talk at Eurobsdcon this year on using the FreeBSD network stack in userspace and Netbsd also an option, fixing locking and interrupts are issues with using these stacks out of kernel to get good performance as the environment is rather different.

wildmXranat · on July 29, 2013

>grafting AlchemyDB’s flexible data-modeling capabilities onto Aerospike’s high-velocity horizontally-scalable key-value data-fabric.

It reached an awesome Kps - keywords per sentence count as well.

dschiptsov · on July 29, 2013

T in TPS means Transactions, right? What kind of a transaction is that one which uses no persistent storage?)

1 million network-to-memory writes, well, that is quite possible, but, please, do not call this a transaction in the way it meant in TPS.)

What was meant in the old days by transaction, was an atomic operation which completes after storing the data in a persistent (usually direct-access, which means no buffering by an OS kernel) storage, so it could be read without any corruption if a power loss will occur the very next second.

leif · on July 29, 2013

You can achieve durability with very high performance using write-ahead logging, lots of concurrent writers doing group commit for the log fsyncs, and a write-optimized data structure like an LSM tree or a fractal tree. Maybe not 1 million on a $5k server, but you can get a lot closer than I imagine you're picturing right now.

In any case what they seem to be measuring is a read-only, in memory workload, so this is not that impressive. IIRC, InnoDB has no trouble pulling off something like this.

dschiptsov · on July 29, 2013

So, disk writes are necessary, after all?

Yes, there are lots of tricks, like placing that append-only physical transaction log on a different controller with a distinct storage device, etc. Data partitioning is the another big idea. Having indexes in memory to avoid unnecessary reads, using collected statistics in a query optimizer, etc. But nothing could beat the partitioning based on actual workloads and separation of tablespaces on distinct hardware, including decoupling indexes from the tables - this is what DBAs were for.

I used to be Informix DBA in old good days, so I can't help but smile when I look at MySQL (well, they added lots of partitioning options in recent InnoDB - the things Informix could do out of box 12 years ago) leave alone modern NoFsync "databases".)

Btw, not all NoSQL guys are insane.) Riak with LevelDB storage backend is very sane approach, which cares about and counts writes.

leif · on July 29, 2013

Of course they are, the trick is to get the most utility out of each one. I work at Tokutek where we use a data structure that does this, in the sense that when your working set is larger than RAM, we still don't incur very many I/Os for writes. If you want durability, there's nothing you can do about the logging fsyncs except buy yourself a nice battery-backed disk controller.

Partitioning is often a bad idea because it messes with your queries. I don't know what's new about partitioning in InnoDB but I think it's generally a symptom of the over-use of B-trees, which don't try to do anything smart about random writes. The change buffer is a decent idea but it's just a stopgap, when you have enough data it doesn't make a dent any more. A better idea is to use a data structure that can handle lots of writes.

I have just started learning about Riak, and from what I understand, they need to do a query (so, a disk seek) on every insert (to calculate something with vector clocks), so they aren't actually using the write optimization that LevelDB's LSM-trees can provide. I don't actually think it should provide that fantastic performance, but I should admit I haven't run it yet. Maybe they're more interested in the compression LevelDB gives them.

Shameless plug time! http://www.tokutek.com/2011/09/write-optimization-myths-comp...

tracker1 · on July 29, 2013

There's also RethinkDB which seems to be focused on the D in acid while being a non-relational database. When you really need performance, in general relationships/joins need to go out the window as much as possible, and often one or more of the letters in ACID are compromised.

It should get very interesting in the next couple of years.. of course MOST environments don't need the kind of performance or scale that these systems are really offering.

IIRC StackOverflow ran for a very long time on a single server, under some pretty serious demand. In some cases SQL with a caching system for mostly-read data can be better... other scenarios tend to fit a document (non-relational) data store better.. just depends.

leif · on July 29, 2013

I see rethinkdb as being focused more on the data model and language, and the cluster administration experience. The performance doesn't seem compelling yet, though they are admirably durable by default.

You're right, with a reliably performant engine you can get a lot more out of a single machine than a lot of people these days seem to think. That's part of our vision for TokuMX, to bring back a little bit of "scale up" potential to the NoSQL space.

jpallas · on July 29, 2013

Ingredient 0: redefine “transaction” to mean nothing remotely like what it means to everyone else.

aurelianito · on July 29, 2013

Which part of the article makes you think that the operations being made are not ACID?

diek · on July 29, 2013

> 100% in-memory workload. Don’t even think about hitting disk for 0.0001% of these requests.

They also discuss how these are 1M read requests.

jahaja · on July 29, 2013

Regarding point 9, "IRQ affinity from the NIC".

If one would like to delve a bit deeper into this one can find some info regarding the techniques introduced for linux here: http://lxr.linux.no/linux/Documentation/networking/scaling.t...

free652 · on July 29, 2013

I'd have to guess that they're limiting themselves to a specific platform (hardware & software). Probably selling an appliance would be best way for them, instead of dealing with sysadmins that don't understand how to properly configure the database on their machines.

bradhe · on July 29, 2013

Somewhat off topic, but this article perfectly shows how difficult taking full advantage of a multi-core environment truly is, and how lacking a lot of our tools are. Granted, this is an extreme example.

jamesaguilar · on July 29, 2013

Our tools for something like this aren't great because almost nobody needs to do 1M reads per second, and of those who do, only a tiny fraction have a working set that can fit economically in memory and is read-only, like the system in this article.

nasalgoat · on July 29, 2013

I was one of the people who commented on the original post. Since then I've had little success in getting any of these recommendations to work.

Does anyone have some actual configuration examples to provide for things like setting IRQ affinity?

antman · on July 29, 2013

The author has moved to Aerospike. Here are some benchmarks a few months back http://www.infoq.com/news/2013/04/NoSQL-Benchmark

sillysaurus · on July 29, 2013

Is there a domain where this is applicable outside of high performance trading? I'm trying to think of another use case that would legitimately generate 1M queries/sec.

derefr · on July 29, 2013

Some of the larger combats (5000+ ships) in EVE Online might do that.

(I single EVE out because other MMOs generally shard by user-cohort, so having that number of people on one shard is impossible. EVE, meanwhile, shards by location within the virtual world (each star system is a shard), so the entire player-base can "gather" on a single shard for a confrontation.)

recuter · on July 29, 2013

> In February 2013, EVE Online reached over 500,000 subscribers.

So what, the entire game userbase needs to be awake and in the same spot? :0)

ricardobeat · on July 29, 2013

Perfect timing: http://www.theverge.com/2013/7/28/4565558/eve-online-biggest...

voltagex_ · on July 29, 2013

It's getting there: http://wiki.eveonline.com/en/wiki/Major_Fleet_Engagements

pwelch · on July 29, 2013

I've never played EVE Online but from what I have read about it the engineering behind it seems interesting. Have any suggests for Dev Blog posts from their team?

colanderman · on July 29, 2013

Intrusion protection systems must legitimately handle upwards of 15M queries/sec when under attack (on a 10 Gbps link).

jonahx · on July 29, 2013

I'm curious, can you elaborate on this? What would they be querying? What kind of intrusion and what kind of attack?

colanderman · on July 29, 2013

TCP flow state, IP shunning & reputation, client rate limiting, others I'm probably missing.

Specifically during DDoS attacks, an IPS must usefully distinguish bad from good traffic at packet rates saturating a link. This inevitably involves maintaining lots of per-client and per-IP state. Caching is of little help precisely due to the distributed nature of the attack, and you can't shed load since that only helps the attacker.

In this situation, memory stalls become your biggest bottleneck – each one can eat on the order of 10% of your processing budget in a run-to-completion (RTC) design. The only solution (beyond tricksier data layouts) is memory latency hiding via micro- or hyperthreading (kernel context switches are just too slow). Rearchitecting a RTC design into a micro-threaded model is a lot of work, and bug-prone. Hyperthreading gives you latency hiding "for free", if the silicon supports it.

Generally you want between 2-4 micro- or hyperthreads. A second micro/hyperthread will generally just help keep the pipeline busy outside memory stalls; hence you can eke out extra performance with a third or a fourth. Intel chips only support two hyperthreads (when they do, and the OS supports it). Some more specialized processors support more.

jahaja · on July 29, 2013

A few off the points is applicable at much lower QPS than 1 million, especially in a virtualized environment such as EC2.

For example, I've had some issues with CPU (on core 0) saturation for soft interrupts on EC2, related to processing network packets.

stock_toaster · on July 29, 2013

Ad networks.

scotty79 · on July 29, 2013

TPS as in "Torsonic Polarity Syndrome"?

blhack · on July 29, 2013

Transactions per second, I think.

rorrr2 · on July 29, 2013

How often OS/network/context overhead is the bottleneck? In my experience most of the time it's the DB. Even if it fits in RAM, complex queries always take most of the time. (Web dev here).

tracker1 · on July 29, 2013

That's kind of a key argument in NoSQL data stores... In many scenarios having a key/value store is better, a single key to lookup a record and all the necessary related data.

Some hybrid stores like MongoDB, RethinkDB and others offer more characteristics similar a traditional SQL RDBMS, while offering horizontal scaling. It's when you have several join operations that performance really takes a hit under significant load. You can't really scale a relational database in the same way.

That said, as much as I like non-relational databases, they aren't the best fit for every use case. Beyond this, most situations don't need that type of performance scaling.

leif · on July 29, 2013

Pretty often, if you're doing simple queries. From there, you can either increase the size of the query, or make it more complex (which increases its computational size) to amortize away the overhead. It always depends on the workload but cases like this, where the overhead is significant, are fairly common.

ignoranimus · on July 29, 2013

Gee, that's spectacular. Now...

If only they would do the reader the simple, and most gracious service of defining precisely what they mean by this obscure "TPS" acronym.

...and before you downvote this comment (because I can smell your itchy little fingers all the way from the otherside of the internet), yes, I can assure you that I did actually Google for the answer. And yes, I did discern what is meant by TPS.

But that isn't the point. The point isn't that I'm a lazy slacker, and/or an ignorant yokle because I didn't already know the meaning of the abreviation innately, and feel inconvenienced by having to open another browser window, and search for some clue.

The point is that the author is assuming everyone will immediately know and understand that acronym, but meanwhile, when I conduct my search, I am forced to assume that my chosen definition is correct, wihout actually knowing for sure.

And for that reason, I'm going to leave out the meaning I've chosen as the author's intended definition for TPS. I have no way of knowing whether my assumption was accurate. So the mystery persists. What does TPS actually mean? Go search for it, you lazy, ignorant slacker.

coldtea · on July 29, 2013

>The point is that the author is assuming everyone will immediately know and understand that acronym, but meanwhile, when I conduct my search, I am forced to assume that my chosen definition is correct, wihout actually knowing for sure.

If you don't know what the acronym means, then the article is not meant for you. The author did not write it nor intend it as a general introduction for newcomers to learn the basics.

It's about some some specific techniques in a specific field.

With your logic, why stop at explaining TPS? He would also have to spell out IRQ, explain what IRQ interupts are, what is a tasket, what does it mean to 'pin' a process, what NoSQL means, what's this Redis thing he mentions etc etc.

(That the author of TFA abuses the acronym is beside the point).

kintamanimatt · on July 29, 2013

There's little excuse for laziness when pursuing new knowledge. A two second Google search and a new tab is not a burden. If you didn't get the answer first time, you should qualify the search with one of the other keywords from the article.

A certain level of reader sophistication is assumed, and not everything is explained with a high level of hand-holding. It's a technical tutorial, not a beginner's tutorial.

TPS = transactions per second.

regularfry · on July 29, 2013

In general you'd be right, but in this specific case there's a fair bit of ambiguity as to what "transaction" is taken to mean. A definition or two wouldn't have gone amiss here.