Thread Pools in Nginx Boost Performance 9x (2015)

tyingq · on Aug 14, 2017

This message should probably be repeated at the bottom of the article:

"The page cache works pretty well and allows NGINX to demonstrate great performance in almost all common use cases...So if you have a reasonable amount of RAM and your working data set isn’t very big, then NGINX already works in the most optimal way without using thread pools"

So using thread pools in NGINX isn't a general recommendation. The article suggests "a heavily loaded NGINX‑based streaming media server" as the type of situation where it makes sense.

dumbfounder · on Aug 14, 2017

The moral is that threads and asynchronous programming are not mutually exclusive.

No matter how much you can do with a single thread, in a highly parallel environment (such as a web server), on a cpu with multiple cores (the norm), you can (almost always) do more with multiple threads.

brianwawok · on Aug 14, 2017

Sure for threads up to cores / 2. Past that it is likely a throughput and latency loss.

reza_n · on Aug 14, 2017

Completely depends on if these are actual "worker threads" or just dumb threads which have been over scheduled.

If these are "worker threads", new workers are scheduled when all other workers are blocked and work is available. This guarantees that idle CPU will always be scheduled with work. When workers unblock, they continue execution on their work and start picking up new work. The worker scheduler will stop scheduling new workers and the extra workers will go back to the pool of inactive workers. This is the idea behind a worker or thread pool, they get scheduled when its possible to do more work.

A poor implementation of this idea will just hand work off to threads and then expect them to all execute in parallel, which will lead to negative performance impact.

The problem with (some implementations of) async workers is that just scheduling more will result in the latter problem, work split across too many workers. This is especially true for workers implemented as entire processes. The problem this post brings up is that 1 to 1 mapping of workers to cores suffers from idle CPU due to unanticipated blocking. The general solution is to smartly distribute work over a pool of workers, and not blindly schedule a fixed number workers.

virmundi · on Aug 14, 2017

Does anyone have numbers to back this claim? Seems like for IO focused software you'll have a lot of dead CPU time. If you have more threads than cores, you can take advantage of that.

jorangreef · on Aug 14, 2017

"Seems like for IO focused software you'll have a lot of dead CPU time. If you have more threads than cores, you can take advantage of that."

Yes, see here for more discussion on this:

https://github.com/ronomon/crypto-async#adjust-threadpool-si...

jandrewrogers · on Aug 14, 2017

No dead CPU time in well-designed software. If you care about I/O performance then you are managing that in user space asynchronously. No thread is waiting for I/O to complete or anything like that unless there is literally nothing else to do (in which case more threads won't help).

There is no performance benefit to adding more threads, they would increase CPU state/cache thrashing which will slow things down. The main benefit of using a lot of threads is that more programmers know how to design software that way because it was taught for so many years; it isn't good for performance on modern hardware.

ma2rten · on Aug 14, 2017

Asynchronous programming is also solving that problem by using an event loop.

solotronics · on Aug 14, 2017

I think it should make a difference if the threads end up being IO limited or CPU limited.

dboreham · on Aug 14, 2017

More like <cores> * 2

brianwawok · on Aug 14, 2017

Not for busy spinning threads.

Source - spent a year performance tuning a matching engine.

dboreham · on Aug 16, 2017

Busy spinning threads are a bug that needs to be fixed first..

frankpf · on Aug 14, 2017

Can you elaborate on why cores / 2 is the "ideal" number for throughput and latency?

brianwawok · on Aug 15, 2017

Well HT is generally crap. In most cases if you go up to a random server, it is going to have HT. So cores / 2 will get you ballpark. You really should go in the bios and shut down HT if you are into latency at all. Then you can just do core or core - 2 threads.

So once you are there, the next step is to busy spin at a Thread Per Core (TPC). You have 10 cores, you find 10 (or more realistically 8-9 to leave the OS some spare cores to muck with) threads, and busy spin them at 100% doing work. You never let the linux scheduler touch them.

If this stuff interests you, there are some cool papers by lmax and the disruptor and gil tene that talk about this stuff..

The amount of work you can do with a single 10-20 core CPU is amazing.

Neil44 · on Aug 14, 2017

As long as you're not using Apache in pre-fork mode then the rest boils down to a few percent in terms of http server overhead. Unless you're trying to serve tons of little static object with either no resource or big scale then there are better places to spend your time optimizing.

cagenut · on Aug 14, 2017

and even then it only matters if you don't have something in front of it handling the connection-pooling/spoon-feeding/static-stuff like a load balancer or a cdn or varnish or pretty much anything.

nwmcsween · on Aug 14, 2017

I think the main takeaway is that Linux is broken as far as async file reading goes.

fasteo · on Aug 14, 2017

It seems quite similar to libuv's work queue [1]

[1] http://docs.libuv.org/en/v1.x/guide/threads.html#id1

skyde · on Aug 14, 2017

So the main problem seem to be that libaio on linux does not use the filesystem cache because it's forcing the use of O_Direct. Any reason linux could not support AIO on regular file like FreeBSD ?

csense · on Aug 16, 2017

TLDR, the Linux non-blocking filesystem API requires you to bypass the OS file cache (!!!)

So the whole webserver serving clients might occasionally block waiting for disk if you're serving online videos or some other application where your whole website is too big to fit in the OS cache. But if you have multiple threads, then if one of them blocks, it isn't a big deal as the other threads will continue to unqueue requests.

> The asynchronous interface requires the O_DIRECT flag to be set on the file descriptor, which means that any access to the file will bypass the cache in memory and increase load on the hard disks

avip · on Aug 14, 2017

Looks similar to apache worker MPM (release ~2005)

scott_karana · on Aug 14, 2017

Different paradigms.

Worker does use process pools with their own thread pools, but they're still not evented/async.

Apache 2.4 (2012) saw the release of 'event', an async variant of worker, which is like the new nginx model.

https://httpd.apache.org/docs/2.4/mod/event.html

majke · on Aug 14, 2017

Ok, I understand why you might "offload" read() from disk backed files file descriptors. But why "sendfile"? I thought sendfile is properly async in linux... Unless I'm getting something wrong.

"aio_write" -> wasn't the idea of AIO operations to actually be truly async? Maybe it's because it requires XFS. Anyway, I'm confused.

josefbacik · on Aug 14, 2017

Sendfile isn't async, it just avoids the user space copy, and in some cases the copy from the fd to the socket. Aio doesn't require xfs, just O_DIRECT, which can be tricky to get right and more work than userspace guys want to deal with.

taf2 · on Aug 14, 2017

From the article

"Although Linux provides a kind of asynchronous interface for reading files, it has a couple of significant drawbacks. One of them is alignment requirements for file access and buffers, but NGINX handles that well. But the second problem is worse. The asynchronous interface requires the O_DIRECT flag to be set on the file descriptor, which means that any access to the file will bypass the cache in memory and increase load on the hard disks. That definitely doesn’t make it optimal for many cases."

jsn · on Aug 14, 2017

From what I can see, sendfile isn't really async, it still returns the number of bytes written or error status, and nginx probably needs to wait for that to e.g. log the results, handle short writes, or proceed with whatever next operation must be done to the socket (like closing it). No idea why aio_write needs offloading, though.

taf2 · on Aug 14, 2017

They say why in the article

"The asynchronous interface requires the O_DIRECT flag to be set on the file descriptor, which means that any access to the file will bypass the cache in memory and increase load on the hard disks"

They even go into how no interface has been surfaced yet that allows determining whether a file is in cache or not. Also how FreeBSD's aio interface doesn't have the same limitations

dom0 · on Aug 15, 2017

I never tried, but mmap and mincore could work (or not, as one of the many special cases in this area). I expect that's not a viable approach for this application, anyway :)

nemanjaboric · on Aug 14, 2017

No, sendfile is not necessarily non-blocking (note the difference between async and non-blocking IO), it still blocks the caller if the reading side is a disk device. The advantage of using sendfile is that you need to copy the data from the kernel-space (source) to the user-space (program) and then back to the kernel-space.

ioquatix · on Aug 14, 2017

It would be interesting to compare this with coroutines/fibers, and it would also be interesting to know how latency is affected. In my experience, thread pools introduce latency.

nemanjaboric · on Aug 14, 2017

Trick is that nginx uses thread pools for non-blocking reading on the "fast" devices - you can't use coroutines/fibers, because on Linux is not reliably possible to read the regular file in a non-blocking way, as far as I'm aware, so that means that `read` in a single fiber would block entire thread until the data is fetched from the disk.

In the storage engine I'm developing/maintaining, I've been using fibers, for both network and disk IO, until people started noticing that, if number of the client grows, and if they ask a lot of data (generating a lot of disk traffic), the performance drops severely. I've moved to the thread pools, dispatching my disk IO operations there (although it's much problematic for the reads, as writes would be performed by kernel anyway, `fsync` and `close` would be operations that you still want to do in the separate thread): https://github.com/sociomantic-tsunami/dlsnode/blob/master/s...

edit: used right link for the freshly opensourced repo.

srean · on Aug 15, 2017

A slightly off-topic question tinged with a touch of envy that you get to work on D.

Many HN readers are apprehensive of the use of D because it is garbage collected and conjecture that it is not appropriate for low latency and/or high throughput work loads. Correct me if I am wrong, but that is exactly the kind of load the sociomantic needs to address. I have thoroughly enjoyed the relevant Dconf videos but it would be great to hear some first hand account.

BTW is the move to D2 done ?

nemanjaboric · on Aug 17, 2017

Oh, I'm aware this is a late reply, but it's better than nothing.

Yes, what you describe is exactly the kind of load we're addressing, and we're not getting GC to be in the way, simply by making usage of the reusable memory buffers, which are allocated in the first few requests, and then always reuse/recycled & fetched from the pool, so that we're not giving a chance for GC to kick in (as explained in the great blog post series, in D, GC mark&sweep will kick in only on allocations, completely deterministic: https://dlang.org/blog/category/gc/). Ocean provides some help here: https://github.com/sociomantic-tsunami/ocean/blob/v2.x.x/src... https://github.com/sociomantic-tsunami/ocean/blob/v2.x.x/src..., etc. This makes D completely suitable for these kinds of applications, and from my experience, none of D's features or the misfeatures is making such implementations hard.

D2 move is not yet completely done. We're still writing code that's compatible in both D1 and D2. For example, DHT node project (https://github.com/sociomantic-tsunami/dhtnode) and libraries on which it's depending on such as ocean (https://github.com/sociomantic-tsunami/ocean) and swarm (https://github.com/sociomantic-tsunami/swarm) are able to run in D2 with no performance penalties, but they still don't use D2-only constructs.

srean · on Aug 18, 2017

Appreciate your reply and welcome to HN.

nemanjaboric · on Aug 18, 2017

Thanks!

zbobet2012 · on Aug 15, 2017

You can move the the co-routine onto another thread. The go scheduler does this: https://morsmachine.dk/go-scheduler

There isn't really a "comparison" here. Co-routines/fibers could behave exactly like nginx does for IO or not at all. It all depends on the implementation.

nemanjaboric · on Aug 17, 2017

Yes, sorry, I've missed the point that you can move them across the threads (was sidetracked by the usual notion of them running in the same thread).

vbezhenar · on Aug 14, 2017

Why files can't be read without blocking with Linux? There's certainly an API for this.

nemanjaboric · on Aug 14, 2017

Which API are you referring to? If you're referring to `aio_` that's just glibc's user-space thread-pool emulation of the POSIX's asynchronous IO + many operations needed for the truly async IO are not supported (such as `stat` and `open`). There's `io_` syscalls family, but that's not in the ready-to-use state yet, judging by the manpages and various patches being submitted every month.

vbezhenar · on Aug 14, 2017

I'm referring to `fcntl(fd, F_SETFL, flags | O_NONBLOCK)`. `stat`, `open` are blocking, yes, didn't think about that. Also according to [1] file operations will always block, even in non-blocking mode, so I was wrong, sorry.

1: https://www.remlab.net/op/nonblock.shtml

signa11 · on Aug 14, 2017

> ... Also according to [1] file operations will always block ...

not quite :) look at aio(7)

signa11 · on Aug 14, 2017

> Why files can't be read without blocking with Linux? There's certainly an API for this.

i guess you are referring to aio(7) ? posix aio on linux is implemented in glibc, and doesn't scale in presence of multiple threads. which is probably what is alluded to...

biokoda · on Aug 14, 2017

io_ syscalls only really work on XFS and require O_DIRECT which means kernel is not caching anything. Unsuitable for web servers.

dunkelheit · on Aug 14, 2017

Fibers and thread pools are not exclusive - you can schedule fibers onto threads. Nginx already uses event-based state-machine approach so context switch is even cheaper than with fibers (at the price of ugly code).

What they talk about in the article is that they offload potentially blocking operations to thread pools. That's not about networking, which is already done asynchronously, but about e.g. reading from a file not present in the page cache. Note that detecting that the needed bytes are already in the page cache and thus the read wouldn't block and should not be offloaded is tricky. As far as I can see they haven't done that yet so this new feature (edit: not so new, the post is from 2015) is not without downsides.

continuations · on Aug 14, 2017

Why can Linux do async network IO but not async file IO? Seems like they would be similar. What's so special about file IO?

crest · on Aug 14, 2017

When non-blocking socket I/O was implemented the disk was considered to be too fast to warrant the same kind of treatment. At the API level the different is that sockets, pipes and ttys are have an additional "ready for use" state while files are always ready, because the kernel can do something to access them even if it means blocking the thread while the kernel waits for the disks to respond.

nemanjaboric · on Aug 14, 2017

Yes, the trick is that disk device is considered "fast", so it's not selectable - it's always ready to read/write, which is a lie. I _feel_ (and think, I don't have any data to prove this) the main problem here is the file system layer, which may or may not need to block _after_ the operation is performed, and this complexity doesn't occur with sockets/pipes.

Having a always-ready state on regular files is a problem since Linux's non-blocking is going around readiness. On Windows (and I believe on Solaris/FreeBSD), completeness model is in place, so you schedule an operation, kernel does _everything_ and then you're resumed upon completion of IO.

dom0 · on Aug 14, 2017

The most general layer in Linux (VFS) doesn't really support asynchronicity, so changing this would require a bunch of changes to every tree and out-of-tree FS - not viable. However, the usual suspects (ext, xfs) actually use a bunch of other APIs as well, where the FS itself is often not involved in simple stuff like a read(2). This theoretically would allow async IO on files in many cases; I believe there is ongoing work in this direction, though. For now, only raw / O_DIRECT is async.

josteink · on Aug 14, 2017

> In my experience, thread pools introduce latency.

The headline uses a the word "performance" which is rather ambiguous.

Performance has many aspects: Latency, through-put, resource-usage (doing more with less), etc etc.

A more specific headline wouldn't hurt.

ogrisel · on Aug 14, 2017

In the benchmarks reported in the article, both through-put and latency were significantly improved by the new thread pool system.

Cursuviam · on Aug 14, 2017

Being specific about the area being improved (blocking operations with the example of file reads) could be nice.

signa11 · on Aug 14, 2017

this approach sounds very similar to SEDA (staged event driven architecture) that was done a while back, iirc, circa 2001.

scrp · on Aug 14, 2017

Interesting read, however a 2015 tag would be nice, since the post is from June 19, 2015

stephen123 · on Aug 14, 2017

must need some monads =P

acdjuiamadfn · on Aug 14, 2017

This is not the right place but I failed to find an answer elsewhere so would somebody answer?

Is it possible for me to use lighttpd as a proxy but at the same time serve static pages for a subset of urls?

dsr_ · on Aug 14, 2017

I don't know, but you can definitely do that with nginx.

You should ask here: https://redmine.lighttpd.net/projects/lighttpd/boards