Scalable Network Programming

huhtenberg · on June 8, 2008

http://www.kegel.com/c10k.html is more concise and it has a better coverage of available options. It is the canonical reference for anyone asking "how to handle 10000 clients".

neilc · on June 9, 2008

It's a great read, although it hasn't been updated since 2006, so it should be read with care.

signa11 · on June 8, 2008

oldie but a goldie. obligatory non-scribd link (http://bulk.fefe.de/scalable-networking.pdf). if you are in this business get 'network algorithmics'. read it, be one with it ! it is sicp of networking.

yagibear · on June 9, 2008

"sicp of networking" is a bit over the top. IMHO it gives undue emphasis to Varghese's research interest in packet classifiers, and is filled with typos.

davidw · on June 8, 2008

It's been a while since I've checked, but I think he's mistaken that Apache is one process per connection. I think it fires up multiple processes just because that's best if you take a bit to serve each request. He doesn't really cover that aspect of things much. Serving static files quick is, if not quite a solved problem, not terribly hard. Doing applications, with significant processing overhead, is a bit more difficult.

aristus · on June 8, 2008

So is anyone interested in an essay about that? I've got a few topics on the hook. I used to run a search engine that served 600 million queries per month.

bluelu · on June 8, 2008

Now that sounds interesting. Why did you quit or did you sell your search engine? I'm doing some search stuff myself (I haven't launched so far) and it would be nice to be able to get some tips.

However I do not expect to reach 600 million queries per month, at least not in the first month ;)

aristus · on June 9, 2008

I only ran the initial architecture & buildout and then moved to sysop as they hired people. The rest of the business was not my business. :) After a good few years we hit financial trouble and were bought up by Yahoo for pennies.

bluelu · on June 9, 2008

Must have been an interesting job though. :)

strlen · on June 8, 2008

Should be a mandatory read for anyone who wants to talk about "scaling". There's lot of armchair discussion of this topic from people who have never looked at UNIX networking from a below-the-surface level (or don't even know about select(), epoll and the various UNIX IPC mechanisms).

Hexstream · on June 8, 2008

Executive summary of the first 53 pages: epoll is great, the other solutions range from not-so-good to terrible.

axod · on June 8, 2008

Interesting read, but I didn't see the case that you just don't use processes or threads covered. Did I miss it?

abstractbill · on June 8, 2008

Yeah, I thought for a minute I had skipped ahead a few pages by mistake. One minute it's all about thousands of processes or threads, and the next it talks about select/poll/epoll/kqueue without really introducing the idea properly (but other than that I thought it was great!).

Incidentally, the justin.tv chat server uses one thread per cpu, and epoll for its network io, and currently scales to a little over 10,000 connections per cpu (see http://irc.justin.tv/ for other numbers). It's the most scalable irc server that I'm aware of. I keep meaning to write something up about it...

strlen · on June 8, 2008

The discussion of epoll/select does talk about the event-driven scenario and there's the discussion of simplest web browser (one procession for all connections).

Memcached is one example of a purely event driven application. Another example (also by Brad Fitzpatrick) is perlbal (written in Perl and using epoll).

axod · on June 8, 2008

Sure, I just expected the doc to go a little more like:

1 process per connection (bad) -> 1 thread per connection (better?) -> 1 process (bingo!)

neilc · on June 8, 2008

~1 process per core, that is (unless the daemon is completely I/O bound).

rcoder · on June 9, 2008

This is only true if you think your application can do a better job of scheduling than the underlying OS kernel. In many (if not most) cases, this is false.

For HTTP servers hosting static content, you may be able to out-perform the OS thread scheduler. For most non-trivial apps, you're probably wrong.

Small fork()-ed processes can still compete with clever poll(), /dev/epoll, or kqueue()-based servers, if you can keep each instance lightweight.

neilc · on June 9, 2008

This is only true if you think your application can do a better job of scheduling than the underlying OS kernel. In many (if not most) cases, this is false.

Why is it false? As the other reply notes, you have more domain knowledge than the kernel's scheduler does. You also didn't need to pay the overhead of entering the kernel to context switch (which is why context switches between userspace threads are cheaper than between kernel threads). In the case of fork()-based servers, you also need to flush the TLB (depending on the CPU architecture).

I'd be curious to see links that support your last claim: AFAIK, it is fairly well-known that event-based daemons using epoll/kqueue are the most performant technique for writing scalable network servers. See C10K, etc.

rcoder · on June 9, 2008

The problem is that, in the real world, monolithic polling network servers have to manage all the transaction-specific context for your application in a single binary.

I'm not challenging the fact that the C10K architecture wins if we're talking about static content. Real webapps don't live and die by their static content performance, though: the critical path is through the dynamic content generation, which means that select()/poll()/et. al. don't buy you much, unless you can hook your database client events and application thread scheduling inside that loop as well.

Green (a.k.a. userspace) threads are one good solution, but fork() isn't the performance-killer that people seem to think it is, either.

staunch · on June 9, 2008

> This is only true if you think your application can do a better job of scheduling than the underlying OS kernel. In many (if not most) cases, this is false.

Why wouldn't you be able to virtually always do a better job of scheduling events (when you control all the code competing for resources) than a generic OS scheduler?

> Small fork()-ed processes can still compete with clever poll(), /dev/epoll, or kqueue()-based servers, if you can keep each instance lightweight.

Do you know of any examples of ultra highly scalable fork()ing servers?

rcoder · on June 9, 2008

> Why wouldn't you be able to virtually always do a better job of scheduling events (when you control all the code competing for resources) than a generic OS scheduler?

Simply put, I would say that "you" (where "you" is an average webapp developer) probably don't understand scheduling, event-driven programming, or memory management as well as the average kernel developer. I of course don't mean this to extend to anyone in this thread. Imagine, though, giving your average PHP or ASP developer, who may struggle to implement a basic sort algorithm, the problem of implementing cooperative multitasking in a scalable way.

> Do you know of any examples of ultra highly scalable fork()ing servers?

Under real-world workloads, I still consider Apache to be "highly scalable." In my experience, given 1:1 investment in hardware for front-end and database servers, the RDBMS craps out long before the webapp, so being able to accept another 5000 incoming connections is only going to hurt you.

axod · on June 9, 2008

If you use long lived connections, apache simply does not cut it any more IMHO

axod · on June 9, 2008

You're thinking that having the OS save everything in a context switch is better than you just calling a different function to do a piece of work for one of your connections? Not really.

rcoder · on June 9, 2008

Forive the reductio ad absurdum, but it seems that by your argument, we'd all be better off simply implementing our webapps within the init daemon, and handling all scheduling and resource-protection within our application logic.

Personally, I like having my orthogonal processes running in isolation from each other, mediated through a tested, scalable kernel.

Obviously, there's a balance to be found between needless context switches and dangerous coupling of function, but I've found that, as often as not, a process context per user is actually a pretty good point to come down on within that continuum.

axod · on June 9, 2008

I guess it depends how many users you want to scale to :)

rcoder · on June 9, 2008

I'm less concerned with cramming as many users onto a single box as possible than I am with insuring a high quality of service for each active user.

If you have 5000 concurrent sessions in a single OS-level process, you better have some pretty damn good fault-recovery mechanisms built into that process.

When you get right down to it, this is the same basic argument gets used to support Erlang: namely, that it supports large number of cooperating, isolated processes, rather than shoving everything into a single monolithic server.

I trust that the Linux (or BSD, Solaris, or other reasonable OS) kernel can handle scheduling a few thousand processes smoothly, so long as those processes fit into available RAM.

What kills big Apache (or other fork()-driven) server environments isn't context switching between backends, it's swapping.

That being said, you're right about it being all about scale. If you're like Google, or Facebook, and can afford to hire engineers who do nothing but worry about scaling to 10K sessions on each HTTP host by moving your critical-path code into hand-tuned C++ or Java inside the polling server, more power to you. If you're in the normal world most of us live in, where scalability via hardware is more economical than via developer time, then fault isolation may prove more useful than raw throughput.

axod · on June 9, 2008

Good point. In practice though I like to keep 1 networking process/thread, and use the other cores for other tasks, such as database access, and other operations that might block. Or just long operations that need to get done without compromising other connections etc.

After all, the actual networking code isn't cpu heavy.

That way you don't have to deal with any synchronization or concurrency issues in the networking code at all.