Fast Servers

someone13 · on Jan 9, 2016

Okay, this is a really cool post, but I have a small bit of criticism - the code samples are really hard to read. I'd recommend, at minimum, adding a bit more whitespace so you don't end up with lines like this:

    if(e[i].events&(EPOLLRDHUP|EPOLLHUP))close(e[i].data.fd);

Despite that minor criticism - pretty cool stuff!

kev009 · on Jan 10, 2016

I thought it was misformatting but wow look at his code https://github.com/geocar/dash/blob/master/d.c

geocar · on Jan 10, 2016

I've been known to write things for other people, in other ways at times[1], [2], but I'll usually start out writing this way[3], then expand it out simply to avoid that recoil that you just experienced.

I actually find working this way very helpful to finding bugs and noticing redundancies, and if you're curious about techniques that improve the programmer, this method is worth some study.

[1]: https://github.com/geocar/mmap/blob/master/mmap.cpp

[2]: https://github.com/geocar/cdp-tools/blob/master/cdp-listen.c

[3]: https://github.com/geocar/con/blob/master/con.c

swinglock · on Jan 10, 2016

But what is that method? To my untrained eye it just looks obfuscated.

geocar · on Jan 10, 2016

I am still working on a good explanation. Here's what I've come up with so far:

I have noticed every page I scroll causes a comprehension loss of around 90%, so in reading something that is 10 pagefuls long, I might only be able to reproduce a tiny part of the program.

I find not scrolling, and just moving my eyes, I rapidly absorb the program, and I find most bugs just by reading the code. This practice is absolutely impossible for me if I have to scroll very far and made difficult by scrolling at all.

A lot of J/K/Q/KDB programmers do this because it is similar to APL, but I don't have an APL background. I simply do not know how to do this except to write horizontally.

I learned a lot about this method by studying Arthur's code (example[1]). He frequently recommends Iverson's paper[2]. I just try to make every character count because it's still taking up space.

Maybe keeping that in mind will help you read each word carefully -- maybe even out loud, if it helps.

[1]: http://www.nsl.com/papers/origins.htm

[2]: http://www.jsoftware.com/papers/tot.htm

kev009 · on Jan 10, 2016

To me it looks machine generated and it would be hard for me to mentally keep track and tick off "okay I know what the lines above do" but if it works for you and the people you work with are fine with it then good for you and I don't mean that sarcastically.

The way I work and similar to most people I know but perhaps with less screen real estate.. I can fit a hundred vertical lines of one text buffer comfortably on my monitors (plural), and I can have about a dozen buffers up at the same time. If needed, that can be the same document scrolled to different areas. I also use a tool like cscope or for some languages an IDE to get additional context (type info, available operations), and that's the only way for me to quickly comprehend truly large bodies of code (kernels, etc). I don't think your strategy would work well for the projects I work on.

geocar · on Jan 11, 2016

We will see.

One of the fastest databases in the world written this way (kdb), and I was able to make an operating system kernel this way (kparc) and now a dynamic web server, and so on.

Maybe it would help you to keep an open mind to know it took some time to learn how to read and write code this way?

Perhaps it would also help to know it has actually improved my ability to work on code that looks more like what you are used to.

One of the hardest things to do is to look at something so alien, and suppress the recoil and discomfort and really be critical:

Here is something you cannot read; something others can do that you cannot.

Don't you want to figure out how to do something that others can do?

Maybe you will find other limitations: its alien nature is certainly a limitation, but maybe you're right and there are others.

But in the meantime, it produces small fast and correct code quickly, and that's pretty amazing. Maybe there are other benefits as well.

We will see.

kev009 · on Jan 11, 2016

I guess it seems like I'm bashing you by expressing shock but I don't mean that. Seriously, more power to you for expressing yourself in less space than I can or desire to.

geocar · on Jan 11, 2016

I don't think you're bashing me.

I'm telling you the shock is normal, and that this is something you can learn to do.

However the desire surprises me: If someone does something I cannot do, and it is beneficial, I want to learn it.

Here I point to smaller, faster, and more-correct programs, written in less time. Why would any programmer not want to be able to do that?

I never understood Paul's Blub essay either, though.

en4bz · on Jan 10, 2016

Seems like the author is a q programmer so I'm not surprised everything is mashed together.

markpapadakis · on Jan 10, 2016

ClangFormat is your friend ( http://clang.llvm.org/docs/ClangFormat.html )

rubicks · on Jan 10, 2016

I cannot second this hard enough.

Matthias247 · on Jan 9, 2016

Can't completly understand the main message of what is proposed here.

Using multiple threads were one thread accepts connections and others process the connections is already quite standard (e.g. look on how Netty for Java works with a boss thread and worker threads).

However the pattern won't work with blocking IO like it's suggested in the referred page if your worker thread should handle multiple connections. Even if poll tells you the connection is readable there might not be enough data for a complete request - so you need a state machine for reading again. Or you block until you have read a complete request and thereby block other connections that should be served by the same thread(pool). And if you block on writing responses then one slow connection will block the processing of the others.

What also should be considered is that by far not all network protocols follow a pure request -> response model. If the server may send responses asynchronously (out of order) or if there is multicast/broadcast support in the protocol the requirements on software architecture look different.

markpapadakis · on Jan 10, 2016

There is nothing fundamentally new described in the blog post, although pinning threads to core is almost never the default operation mode (though many popular servers expose an option for turning this one).

As someone else stated in another post, SO_REUSEPORT and accept4() is best, all things considered, way to accept connections across multiple threads. Again, most modern servers support this by default, if supported by the underlying operating system (e.g nginx).

By the way, Aerospike accepts new connections in a single thread and then directly adds the socket FD to one of the i/o threads (round-robbing selection scheme) directly using e.g epoll_ctl(fd, EPOLL_CTL_ADD, ..).

See http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma... for costs of context switching, and performance improvement when pinning threads to cores (it's quite impressive). Also, note that according to the author, on average, a context switch is 2.5-3x more expensive when using virtualization.

You may also want to read https://medium.com/software-development-2/high-performance-s... -- it's been a long time since I wrote this, but it describes how one can deal with asynchronous I/O and other operations that may block a thread.

geocar · on Jan 10, 2016

I'm glad Aerospike uses this technique.

Yes, it is not new. I've been using this method in one form or another (e.g. SIGIO/F_SETOWN) for more than a decade, however nginx (relatively modern?) doesn't use this technique and could really benefit from it on some workloads. I wrote about it after I casually mentioned it a roomful of quant programmers and found out they didn't know about it; I do not think most servers use this technique.

I've added a note about accept4(). When I have some time I intend to repeat my benchmarks with it to see how it does. Context switching is much cheaper than memory so I do not expect much.

You should publish benchmarks that are easy for other people to reproduce. I tried running Aerospike on my laptop and was unable to get above 25,000 writes/sec. I'm certain I'm doing something wrong, because the same hardware easily gets above 60,000 writes/sec on KDB per CPU.

markpapadakis · on Jan 10, 2016

I am not using AeroSpike nor am I affiliated with them. I just studied their code base (I study codebases as a hobby ). There is a _lot_ of room for improvement there IMO but that's true for the majority of codebases I went through :)

halayli · on Jan 10, 2016

This is not an ideal design, surprised it has this many upvotes.

Transferring the accepted connections this way involves an extra system call to epoll/kqueue to add the event into the particular thread and the accepting thread can become a bottleneck under high load.

A better design would be to share the listening socket across threads and have each thread accept at its own pace, at least this avoids the additional kqueue/epoll system call needed to add the new fd into the thread's poller, but it does cause lock contention in the OS which is still less expensive than a system call. What's even better is if you're on a newer linux version, or bsd, consider using SO_REUSEPORT which allow each thread to bind/accept on the same port and avoids the lock contention issue.

Also you should consider using accept4() to set the non-blocking flag during accept instead of the additional system call to set it to non-blocking.

geocar · on Jan 10, 2016

One of the benefits of this design overlooked is that you can have several dedicated accepting-threads. The benefit is magnified when you have something like HTTP where the client speaks first.

Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.

SO_REUSEPORT is not panacea, and I do not agree that you should have threads accept and handle requests, but I do think SO_REUSEPORT is useful for making upgrades easier. I have (historically) simply blocked outgoing ICMP port-unreachable messages for ports that I know are supposed to be running since that's more portable than SO_REUSEPORT.

accept4() is a good idea. I've added a note about that. Thank you.

wittrock · on Jan 9, 2016

Backpressure between threads and utilization could be hard here. Balance between the speed of the accept, request, and worker threads is something I'm curious about. In theory, you could create pools of each if you find that one set bottlenecks the other. Also, workload isolation is important--I'm curious how the author deals with (or avoids) transferring ownership of large amounts of memory and cache between cores without incurring significant transfer overhead.

geocar · on Jan 9, 2016

Try not to transfer memory.

On small examples, I arrange memory access so that once the memory has been migrated, a lack of thrashing will be good enough. If I need a hot start, I can use another trick:

On Linux, you can create separate processes that share fds but not memory. On modern UNIX you could pass the epoll/kqueue fds around using sendmsg. This allows you to get everything ready before hitting bind() and listen().

RyanZAG · on Jan 9, 2016

This is pretty old, right? These days 'Fast Servers' refers to bare-metal network interface without a kernel which can achieve far higher throughput than passing it through a kernel/epoll.

pstrateman · on Jan 9, 2016

Direct memory mapping from NIC to userspace with a userspace TCP/IP stack can significantly accelerate ultra low latency processing.

The effect basically disappears when you're looking at webpages that take 10-1000ms to generate.

RyanZAG · on Jan 9, 2016

Agreed, as soon as per-request processing becomes a large portion of the work it will dwarf these types of optimizations.

The article is talking specifically about "simple one-page performant servers that easily get into the 100k requests/second territory", which can see very large throughput increases with direct NIC access.

geocar · on Jan 9, 2016

dash[1] is a dynamic webserver that uses the logic described in the article (content logic is in Q/KDB).

I have considered writing a userspace TCP HTTP server like the one you suggest, but it would be a demonstration of how to write a state machine and not how to better use the facilities that the operating system provides for writing a server.

[1]: https://github.com/geocar/dash

nbevans · on Jan 9, 2016

This pattern is called "I/O completion ports" or more generally "overlapped I/O" in Windows and has been there since the very first version of Windows NT.

bediger4000 · on Jan 10, 2016

I believe it was present in Windows NT from the start because it was present in DEC's VMS operating system, which Dave Cutler was also a key developer. It may date to before VMS for all I know. It's not at all new.

vog · on Jan 9, 2016

Very good pattern description, which demonstrates nicely that if basic/system assumptions are changing, new pattern need to embrace.

I'd love to see more patterns like that.

eva1984 · on Jan 9, 2016

I think the pattern is the already the norm for a while, a.k.a async programming.

notacoward · on Jan 10, 2016

"One worker per core" is too simplistic. Yes, as I wrote in a pretty well-known article a dozen years ago, it's a good starting point. Yes, it can avoid some context switching. On the other hand, when you have to account for threads doing background stuff it can be too many, leaving you with context thrashing between oversubscribed cores. When you account for threads blocking for reasons beyond your control (e.g. in libraries or due to page faults) it might be too few. The right number of worker threads is usually somewhere around the number of cores, but depends on your exact situation and can even vary over time.

Those who do not know the lessons of history...

chillaxtian · on Jan 9, 2016

good job of explaining their paradigm, but i did not see any explanation as to why it is better than traditional epoll.

what advantage do you gain by separating accepting connections from handling request / response?

huhtenberg · on Jan 9, 2016

Consider the case of 1000 clients trying to establish an SSL connection all at once.

jsprogrammer · on Jan 9, 2016

If the request/response handlers are much slower (large computation or IO) than the connection acceptor, the acceptor will necessarily be slowed to the rate of your handlers, limiting the number of connections you can accept.

Also, connection acception and request/response handling are two logically separate things, which are always nice to keep separate when possible.

amelius · on Jan 9, 2016

> Fast Servers

Fast in what sense? Throughput, or latency?

> One thread per core

Oh, I guess the answer to the previous question is "throughput".

RyanZAG · on Jan 9, 2016

Could you explain why one thread per core translated to throughput and not latency targeting?

Each core can only handle a fixed number of instructions per second regardless of number of threads assigned to that core. If a core receives 100 requests, it could handle them all sequentially with 1 thread, or utilize 2-100 threads to handle them.

When using 1 thread, the first request will be completed within the minimum amount of time the core is able to do the work. The second request will complete in the same amount of time after the first.

When using 2 threads, the core will carry out the work an in interleaved fashion between the 2 threads. Both requests will complete at the same time as the 2nd request completed in our 1 thread example above.

I'll try make it clearer - assume a request takes 5 ms:

  1 Thread
    Request 1: start processing at 0ms, complete at 5ms
    Request 2: start processing at 5ms, complete at 10ms

  2 Threads
    Request 1: start processing at 0ms, complete at 10ms
    Request 2: start processing at 0ms, complete at 10ms

As can be seen, the average latency is better with only 1 thread - avg 7.5ms latency when 2 requests arrive at the same time. 2 threads gives us avg 10ms latency.

Throughput should theoretically remain the same if you disregard context switching time and cache invalidation. In the presence of both, the single thread method should outperform in terms of throughput as well.

amelius · on Jan 10, 2016

Suggestion: Try with 1 job of X ms, and N jobs of 1ms, and figure out for different N where the break-even point of X lies. That might give some more insight.

derefr · on Jan 9, 2016

This is all about achieving a dataflow architecture by exploiting CPU cache hierarchy, right?

We're basically talking about going from a design where 10k little "tasklet" state machines are each scheduled onto "scheduler" threads; where during its turn of the loop, each tasklet might run whatever possibly-"cold" logic it likes...

...and turning it into a collection of cores that each have a thread for evaluating a particular state-machine state pinned, where each "tasklet" that gets scheduled onto a given core will always be doing the same logic, so that logic can stay "hot" in that core's cache—in other words, so that each core can function closer to a SIMD model.

Effectively, this is the same thing done in game programming when you lift everything that's about to do a certain calculation (i.e. is in a certain FSM state) up into a VRAM matrix and run a GPGPU kernel over it†. This kernel is your separate "core" processing everything in the same state.

Either way, it adds up to dataflow architecture: the method of eliminating the overhead of context-switching and scheduling on general-purpose CPUs, by having specific components (like "I/O coprocessors" on mainframes, or (de)muxers in backbone switches) for each step of a pipeline, where that component can "stay hot" by doing exactly and only what it does in a synchronous manner.

The difference here is that, instead of throwing your own microcontrollers or ASICs at the problem, you're getting 80% of the same benefit from just using a regular CPU core but making it avoid executing any non-local jumps: which is to say, not just eliminating OS-level scheduling, but eliminating any sort of top-level per-event-loop switch that might jump to an arbitrary point in your program.

This is way more of a win for CPU programming than just what you'd expect by subtracting the nominal time an OS context-switch takes. Rewriting your logic to run as a collection of these "CPU kernels"—effectively, restricting your code the same way GPGPU kernels are restricted, and then just throwing it onto a CPU core—keeps any of the kernel's cache-lines from being evicted, and builds up (and never throws away) an excellent stream of branch-prediction metadata for the CPU to use.

The interesting thing, to me, is that a compiler, or an interpreter JIT, could (theoretically) do this "kernel hoisting" for you. As long there was a facility in your language to make it clear to the compiler that a particular function is an FSM state transition-function, then you can code regular event-loop/actor-modelled code, and the compiler can transform it into a collection of pinned-core kernels like this as an optimization.

The compiler can even take a hybrid approach, where you have some cores doing classical scheduling for all the "miscellaneous, IO-heavy" tasklets, and the rest of the cores being special schedulers that will only be passed a tasklet when it's ready to run in that scheduler's preferred state. With an advanced scheduler system (e.g. the Erlang VM's), the profiling JIT could even notice when your runtime workload has changed to now have 10k of the same transition-function running all the time, and generate and start up a CPU-kernel-thread for it (temporarily displacing one of its misc-work classical scheduler threads), descheduling it again if the workload shifts so that it's no longer a win.

Personally, I've been considering this approach with GPGPU kernels as the "special" schedulers, rather than CPU cores, but they're effectively equivalent in architecture, and perhaps in performance as well: while the GPU is faster because it gets to run your specified kernel in true SIMD parallel, your (non-NUMA) CPU cores get to pass your tasklets' state around "for free", which often balances out—ramming data into and out of VRAM is expensive, and the fact that you're buffering tasklets to run on the GPGPU as a group potentially introduces a high-latency sync-point for your tasklets. Soft-realtime guarantees might be more important than throughput.

---

† A fun tangent for a third model: if your GPGPU kernel outputs a separate dataset for each new FSM state each of the data members was found to transition to, and you have other GPGPU kernels for each of the other state-transition functions of your FSM waiting to take those datasets and run them, then effectively you can make your whole FSM live entirely on the GPU as a collection of kernel "cores" passing tasklets back and forth, the same way we're talking about CPU cores above.

While this architecture probably wins over both the entirely-CPU and CPU-passing-to-GPGPU models for pure-computational workloads (which is, after all, what GPGPUs are supposed to be for), I imagine it would fall over pretty fast if you wanted to do much IO.

Does anyone know if GPUs marketed specifically as GPGPUs, like the Tesla cards, have a means for low-latency access to regular virtual memory from within the GPGPU kernel? If they did, staying entirely on the GPGPU would definitely be the dominant strategy. At that point, it might even make sense to have, for example, a GPGPU Erlang VM, or entirely-in-GPGPU console emulators (imagine MAME's approach to emulated chips, but with each chip as a GPGPU kernel.)

If you can get that, then effectively what you've got at that point is less a GPU, and more a meta-FPGA co-processor with "elastic allocation" of gate-arrays to your VHDL files. System architecture would likely change a lot if we ever got that in regular desktop PCs.

pheo · on Jan 10, 2016

Woah.

I think that many "servers" today aren't going to benefit though from the level of optimization that you're talking about. Network bandwidth is at least an order of magnitude slower than memory and PCI throughput. For all but the most heavyweight computational tasks (ie. gaming, finance, scientific computing, etc.) this kind of multiplexing optimization is very helpful. If you're needing to get that kind of performance, you would userspace map off the NIC.

With how fast modern processors (CPUs and GPUs) are, whats much more important nowadays is not stalling on the flow from the NIC. The solution of the original author seems to me like a very simple fix for the old naive paradigm of state machine switching off from a single thread.

I think you might be (impressively) too close to the metal, and guilty of premature optimization. For example, a simple database application would benefit much more from keeping the slow I/O bottlenecks flowing than from the compute multiplexing.

derefr · on Jan 10, 2016

Ah, true; this whole thing makes very little sense for the general case. Most workloads aren't like this.

The real goal here, in my mind, was to allow people in those particular high-performance domains (gaming especially, though the other ones would probably benefit too) to code at a higher abstraction level.

In my ideal world, a physics engine, for example, would be implemented in code as actors sending messages—with one actor for each physical (e.g. collide-able) object. Not only is that really expensive; the naive approach wouldn't even work! (You need "tick" events spammed to every physics-actor once per frame, which is just absolutely ridiculous.) But then the compiler transparently takes that and spits out something that isn't actor-modelled at all, but rather is a modern-day physics engine doing GPGPU SIMD to arrays of quaternions.

Or, in short: I want to code games in Elixir, without an impedance mismatch in the places where one of my actors' components turns out to need groupwise synchronous shared-memory evaluation rather than piecewise asynchronous reduction-scheduled evaluation. (That sounds like something that should be followed up with "and a pony!", but—given that the Erlang BEAM VM already has the infrastructure of "dirty" schedulers for CPU-bound tasks, and an LLVM IR step in its JIT which could be shunted over to a PTX-ISA target—this actually shouldn't even be that hard an extension to make.)

jbarzyc · on Jan 10, 2016

check out http://dpdk.org. I just delivered a session on DPDK and SR-IOV. There's a whole set of libraries, classifications, and frameworks to tune/tweak linux systems on x86 in user space.

deathanatos · on Jan 9, 2016

It's hard to say from the article, but is the first example single threaded? If so, then yes, adding more threads is of course going to speed it up.

However, they don't each need their own queue; epoll supports being polled by >1 thread. In such a setup _any_ thread available can handle any request; in the author's setup, you're going to need to make sure any particular thread doesn't get too bogged down if the requests are not equal. (That pick function is important.) I'd be more curious how those two compared. (The author's is certainly slightly easier to write, I think.)

geocar · on Jan 10, 2016

If you can get into cache and stay there, you can enjoy 100x or even 1000x speedup.

Having your CPU do a lot of different things is a good way to get out of cache :)

JoshTriplett · on Jan 9, 2016

I'd love to see benchmark numbers comparing this approach to others.

geocar · on Jan 9, 2016

My dash webserver uses this trick, and I've published[1] benchmarks comparing it to KDB's built-in webserver and NodeJS (which uses libev)

[1]: https://github.com/geocar/dash#performance

pflanze · on Jan 9, 2016

There's also libev which claims to solve problems with libevent, which the OP doesn't mention. (I haven't worked with either.)

http://software.schmorp.de/pkg/libev.html

listic · on Jan 9, 2016

Why aren't servers using the proposed pattern already?

acidx · on Jan 9, 2016

My web server, uses the same pattern. There's an article explaining it in details here: http://tia.mat.br/posts/2014/10/06/life_of_a_http_request.ht...

_pmf_ · on Jan 9, 2016

Apache does this since more than 15 years (since the MPM module has been introduced).

cbsmith · on Jan 9, 2016

Has everyone forgotten the c10k site, which covers all this?

geocar · on Jan 9, 2016

Are you sure this technique is documented on C10k[1]?

I think most people use a single epollfd/kqueue instead of one per thread.

[1]: http://www.kegel.com/c10k.html

cbsmith · on Jan 11, 2016

The listed strategies are:

1. Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification. 2. Serve many clients with each thread, and use nonblocking I/O and readiness change notification 3. Serve many clients with each thread, and use asynchronous I/O and completion notification 4. Serve one client with each server thread 5. Build the server code into the kernel 6. Bring the TCP stack into userspace

You may notice a lot of them are "with each thread".

geocar · on Jan 12, 2016

The trick isn't "use a bunch of threads" but to have each cpu do a separate task (N1 threads for accept, N2 threads for read, N3 threads for write, N4 threads for close, etc) and using a separate keventfd/epollfd for each thread.

The tasks can still use blocking calls (or nonblocking, as they like), which makes it different than strategy 1 or 2. It shards by task which is different than 3, or 4. Completion-notification is not explicit. It is clearly different than 5 or 6.

I think you should read it carefully. Popular webservers like nginx and varnish do not implement this method, so I do not believe it is common, and there are a few neat features of this method:

1. Tasks will usually fit into cache. If the kernel hotpaths to IP they usually won't fall out of cache either. This can produce 100x or even 1000x speed improvements for complex tasks.

2. It may be easy to convert an existing program to this design than to replace everything with nonblocking IO, or to move into kernel space, or to attach to a userland TCP stack.

These things are important if you want to write faster servers.

cbsmith · on Jan 12, 2016

What if you have to map accept, read, write & close (so four types of tasks) to a six cores? ;-)

Yes, nginx & varnish have different approaches. Their approaches tend to work better for their type of application.

One thread or process per core, with its own event loop, is indeed a very standard pattern. The process-per-core style approaches tend to shard per fd (e.g., nginx), while the thread-per-core style tend to either shard by IO request or don't shard at all (each thread has an independent poll of all outstanding IOs). Particularly if you use a continuations style paradigm, it makes it really easy to convert over existing code to the design.

In practice, you will typically have better cache locality with the fd than your position in the state machine (and the model you are describing isn't quite where you are in the state machine, but you can adjust it to fit). If your state machine is simple enough, your approach might get better instruction cache hit rate, but you are basically guaranteeing that data has to move between the processor caches in very rapid succession. In practice, not only are you forcing the fd's to be passed between processor caches, but there tends to be data locality with that fd. While it is easy to have the instruction cache replicated across all processors, the data mutates quite rapidly. Again, this stuff is covered in the discussions on the c10k site.

jbarzyc · on Jan 10, 2016

check out dpdk.org