The Unscalable, Deadlock-Prone, Thread Pool

magnetic · on Feb 27, 2019

I wrote a parallel image processing tool (in Swift) that essentially uses Grand Central Dispatch on macOS to process all files in a directory.

I hit the problem that GCD would happily let me enqueue tasks for processing, but each task would need quite an important amount of memory and GPU resources (because it would use CoreFilters that support GPU acceleration "when available"), which made the computer hit swap and slow to a crawl once the available RAM was exhausted.

I had to create an Admission Control component that would let me enqueue tasks only if there was enough memory left to process the image without risking swap, and that part isn't as trivial as it seems, as getting the amount of available memory for a process is a bit vague on UNIX/macOS/Linux since many OSes of this kind assume you can allocate an infinite amount of memory (see linux's overcommit system) and that the Virtual Memory system will do the right thing for you. Also nobody can guess if physical RAM is going to be allocated by some other process somewhere, so I used a fairly conservative heuristic that probably left some small amount of resources unused most of the time, but would almost never slow to a crawl. Not perfect, but "good enough".

Someone · on Feb 27, 2019

It seems you have a producer-consumer problem, where feeding in work to do faster than it can be processed only leads to lengthy queues, and running more tasks in parallel than you have cores only increases memory usage and task switching overhead.

Such processes rarely are memory-bound, and from what you write (it would use CoreFilters that support GPU acceleration "when available"), yours aren’t either. So why would you want to limit the number of consumers by looking at memory usage?

ioquatix’ comment to measure latency is better, but more work, so my first go would be to manage queue length, as that is simpler to do and often effective enough.

Determine how many consumer processes your system can really handle in parallel and limit the number of queued GCD tasks to, say, twice that number. That factor of two depends on how bursty the producer produces work items and on how CPU-bound the consumers are.

If memory of I/O aren’t your bottleneck, you want to minimize the number of items ‘in flight’ needed to keep your CPU or GPU at 100% usage.

magnetic · on Feb 27, 2019

> Such processes rarely are memory-bound, and from what you write (it would use CoreFilters that support GPU acceleration "when available"), yours aren’t either. So why would you want to limit the number of consumers by looking at memory usage?

In my case it was clearly memory bound. I demonstrated it through instrumentation and profiling, then addressed it by using an admission control scheme exclusively built on memory.

ioquatix · on Feb 27, 2019

If you can build a system which measures the latency of task processing and dynamically adjusts the resource allocation, I think you can be robust in the face of changing underlying conditions. It's non-trivial, but easier to look for spikes in latency than trying to estimate resource allocation/utilisation.

chrisseaton · on Feb 27, 2019

Right - we need holistic scheduling systems that balance all resources and tasks. For example scheduling an IO heavy-task alongside a compute-heavy task and letting them both run well, rather than picking two compute-heavy tasks at the same time, and only scheduling as many tasks as there is RAM, IO, and other resource capacity for to use productively. Lots of people are trying because at data-centre scale and with serverless that lets you run code wherever they want this could be a huge advantage... not sure how much success yet.

beached_whale · on Feb 27, 2019

I wonder if something as simple as tagging would give a huge benefit towards that. I should know if my job uses lots of memory(often I know exactly how much), or IO, or CPU.

ahartmetz · on Feb 27, 2019

This is something I have pondered for the Bitbake build system. It compiles a Linux "distribution" for embedded systems and uses a lot of CPU, RAM, and I/O. Often one is starving the other. I use a spinning disk for build dirs currently because they use a lot of space. Even with an SSD there is room for better scheduling.

0db532a0 · on Feb 27, 2019

That’s quite an interesting problem. The mlock [1] system call allows you to lock chosen virtual memory into RAM. What about using that in combination with a memory pool which you manage yourself?

[1] https://developer.apple.com/library/archive/documentation/Sy...

magnetic · on Feb 27, 2019

If I had all places where memory is allocated under my "jurisdiction", I think this would be a very applicable solution.

Unfortunately when calling APIs to process images, some opaque code that you don't see ends up doing allocations on your behalf, and you can rarely pass your own pool to allocate memory from.

0db532a0 · on Feb 27, 2019

For the record, mlockall on Linux can lock all future allocations, including those made by shared libraries. You don’t need to manage a memory pool yourself.

https://linux.die.net/man/2/mlock

magnetic · on Feb 27, 2019

That's very cool!

It looks like that API doesn't exist on macOS, though... :-(

0db532a0 · on Feb 27, 2019

That’s true. I didn’t consider that. I’d be interested love to see a solution to it though. Maybe there’s some sort of OS-level solution, like how you can set the priority of processes.

jorangreef · on Feb 27, 2019

Node suffers similar problems, although I would describe them differently to the author.

Essentially:

1. All Node's async IO is lumped together into the same threadpool.

2. There is no distinction between the nature of each async IO task.

3. Async CPU tasks (fs.stat hitting the fs cache, async multi-core crypto, async native addons) complete orders of magnitude faster than async disk tasks (SSD or HDD), and these can be orders of magnitude faster than async network tasks (dns requests to a broken dns server).

4. There are three basic async performance profiles, fast (CPU), slow (disk), very slow (dns), but Node has no concept of this.

5. This leads to the Convoy effect. Imagine what happens when you race trucks, cars, and F1... all on the same race track.

6. The threadpool has a default size of only 4 threads, on the assumption that this reflects the typical number of CPU cores (and reduces context switches).

7. 4 threads is a bad default because it leads to surprising behavior (4 slow dns requests to untrusted servers are enough to DoS the process).

8. 4 threads is a bad default because libuv's memory cost of 128 threads is cheap.

9. 4 threads is a bad default because it prevents the CPU scheduler from running async CPU tasks while slow disk and slower DNS tasks are running. Concurrent CPU tasks should rather be limited to the number of cores available, while concurrent disk and DNS tasks should be given more than the number of cores available (context switches are better amortized for these).

10. Because everything is conflated, hard concurrency limits can't be enforced on fast, slow or slower tasks. It's all or nothing.

There are efforts underway to support multiple threadpools in Node (a threadpool for fast tasks sized to the number of cores, a threadpool for slow tasks sized larger, and a threadpool for slower tasks also sized larger). The goal is to get to the point where we can have separate race tracks in Node, with F1, cars and trucks on separate race tracks, controlled and raced to their full potential:

https://github.com/libuv/libuv/pull/1726

felixfbecker · on Feb 27, 2019

This is a common misconception. Node (libuv) only uses the thread pool as a last resort. Many operations nowadays have a native async OS API, so they don't need to use (and therefor don't block/fill) the thread pool.

https://medium.com/the-node-js-collection/what-you-should-kn...

https://stackoverflow.com/a/22644735/4208018

jorangreef · on Feb 27, 2019

"Node (libuv) only uses the thread pool as a last resort."

When did you last read deps/uv/src/win/fs.c or deps/uv/src/unix/fs.c?

If by "last resort" you mean 99.9% of all async fs tasks, then I would agree with you.

coder543 · on Feb 27, 2019

isn't that because OSes have broken async filesystem APIs that can't be trusted?

But network requests (like dns) have well defined APIs for async, and should never be required to block a thread, and therefore do not need a threadpool.

mcqueenjordan · on Feb 27, 2019

Title is misleading. The author seems to mask what they're saying in a way that makes it really easy to conclude that their point is "Thread pools are bad, so I'll show you a better way." In fact, the proposed better way is ThreadPools(tm).

In fact, "The Unscalable, Deadlock-Prone, Thread Pool" is only referring to a badly used ThreadPool. Like, really badly: the strawman that forms the crux of the argument is that you are only using the same ThreadPool for all of your disparate forms of work.

> My favourite approach assigns one global thread pool (queue) to each function or processing step.

Yes, creating different thread pools for different kinds of work is a much better way to use them...

fiter · on Feb 27, 2019

The .NET libraries make it easy to use the default thread pool (a single pool). If other libraries are similar, then I wouldn't consider this a complete straw man.

eurg · on Feb 27, 2019

Yes, but the .NET thread pool is auto-tuning, although very badly, with at least a second latency to increased workload. For bursty requests that could be done within a few hundred milliseconds it is a bad fit. But you can't DOS your batch processing server so easily.

tybit · on Feb 27, 2019

Nonetheless the single thread pool in .net core seems to scale really well.

commandlinefan · on Feb 27, 2019

> Like, really badly: the strawman

Maybe so, but I've seen this pattern a LOT.

mcqueenjordan · on Feb 27, 2019

Fair point. I haven't seen it much, but I don't doubt that it happens.

I just wanted to emphasize the point, which wasn't clear, that ThreadPools aren't bad, but rather that bad uses of them are bad.

lemmsjid · on Feb 27, 2019

Agreed. I feel that he should engage with the actor model in this article or a subsequent one because it seems to me like the type of abstraction he’s searching for. E.g. where each actor generally performs one type of work, each has a mailbox with a variety of parallelism strategies, message passing semantics discourage deadlock scenarios... Not a perfect solution but one to discuss when talking about the problems he’s trying to solve.

spricket · on Feb 27, 2019

This feels language specific. Java has excellent thread support for example, it's quite hard to build an application that deadlocks.

Thread-per-request is how everything used to be done. Most languages switched to pools because of huge fork/threading overhead.

Other languages have embraced async programming to deal with these issues, and use thread-per-core. But that introduces some fairly awkward coding constructions itself.

I believe the best approach is transparent fiber support. Java's Project Loom is a very ambitious attempt at this. And if you're okay with a "hacky" solution, you can already use fibers with Quasars code generation.

In java at least, the biggest barrier to fiber support is existing libraries not supporting it (looking at you JDBC/JPA). Hopefully project loom will make existing code work with little or no modifications

xenadu02 · on Feb 27, 2019

The issue the OP is addressing covers your comment quite well. Async continuations, coroutines, and fibers can all race ahead and exhaust memory/DB connections/network sockets, or just saturate the IO subsystem. In a multi-step process this can easily lead to starvation in downstream steps.

The point of structuring an application this way is to prevent any one stage of your pipeline from consuming too many resources or stalling other stages (or just hammering some other system with requests that will inevitably timeout).

I don’t see how your comment about fiber support in Java applies.

agumonkey · on Feb 27, 2019

I'd love to learn how to design this way, any reading ?

ioquatix · on Feb 27, 2019

I wrote a full stack implementation for Ruby: https://github.com/socketry/async

Come and chat here: https://gitter.im/socketry/async

TruffleRuby folks are trying to support us using native fibers in the JVM, but I'm not sure when that will be working.

balfirevic · on Feb 27, 2019

> I believe the best approach is transparent fiber support. Java's Project Loom is a very ambitious attempt at this.

I guess the time was due for us C# developers to have something to be envious of in Java world :)

I'm really disappointed that .Net world went the async/await route. This blog post best describes why: http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y...

spricket · on March 1, 2019

I agree. C# is wonderfully designed, only reason I haven't switched is the huge number of wonderful libraries available in java.

Also sad that C# went asynchronous. I mean, it works great, and it's better than what Java has now, but it's not backwards compatible like I expect fibers to be. With how many third party libraries we pull in, this is a gigantic selling point

aardvark179 · on Feb 27, 2019

It is certainly the plan that existing code should work with little or no modification. Alan Bateman has been doing great work fixing the standard library to allow this, though there will be some limitations round JNI and other areas.

If you need to stop too many fibers running at once then this should be very simple to implement using any of java.util.concurrent’s mechanisms, if they block the fiber it will be parked and another can be run.

wtracy · on Feb 27, 2019

The author spends a lot of time describing the problem, and not a lot on the solution.

The solution proposed seems to be some variant on dataflow programming: <https://en.wikipedia.org/wiki/Dataflow_programming> Am I misunderstanding?

dkarl · on Feb 27, 2019

Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread.

My favourite approach assigns one global thread pool (queue) to each function or processing step. The arguments to the functions will change, but the code is always the same, so the resource requirements are also well understood.

I think I get what he's saying. To avoid one stage in the multi-step computation acquiring too many resources and starving other stages:

- Cap the resources used by each stage.

- Create one threadpool per stage.

- Include sufficient information in pending jobs that the threadpool can calculate the resources needed to process each job.

- Only start a job running when it can be run without exceeding resource cap for the stage.

Conceptually, you can think of it as pre-allocating a bundle of resources to each processing stage: threads, memory, etc. Each processing stage then runs as many jobs concurrently as it can with the allocated bundle of resources. That's the idea, but in reality each job allocates its own resources from a common pool that is shared among all the processing stages, so the processing stage has to calculate how much memory, etc. each job will allocate before giving it a thread to run on.

jayd16 · on Feb 27, 2019

So does it boil down to "grab all your locks upfront, if you can't then yield them all back"?

dkarl · on Feb 27, 2019

Pretty much, plus if you're even running, you have high confidence that lock acquisition will succeed, because the entity that gave you a thread to run on has checked that all the resources you need are available.

wppick · on Feb 27, 2019

It sounded to me like a pipelining architecture similar to https://en.m.wikipedia.org/wiki/Instruction_pipelining with a thread pool per stage?

CyberDildonics · on Feb 27, 2019

These are not the same things. The major difference is that hardware for instructions can do only one thing so it can be thought of more like a physical process.

For a software pipeline, you don't need to hand data off to a different thread, you just need to make sure the data is run with the right execution. That could be on any thread, and the execution/function can be run on any number of threads at the same time as long as the data is separate.

cryptica · on Feb 27, 2019

For some use cases, I like the approach of having a fixed number of main, long lived threads/processes at runtime (which matches the number of CPU cores on the machine to minimize context-switching). Where possible, the threads/processes are themseves responsible for picking up the work that must be done. Each thread/process can use hashing/sharding to figure out which portion of the state they are responsible for and only operate on those parts. These kinds of threads should be handling the vast majority of the work. But of course, for this to be possible, the work needs to be parallelizable (servers/sockets make a good use case).

If you have occasional I/O tasks which don't require much CPU (e.g. disk I/O), you can use a separate threadpool for those since the context switching penalty will be minimal. Context switching is mostly a problem when all the threads/processes are using a lot (and equal amount) of CPU. If a very heavy thread/process has to share CPU with a very lightweight thread/process, the context switching penalty is not significant.

willvarfar · on Feb 27, 2019

My experience (on Linux) is that:

- if you have lots of short-lived cpu-intensive tasks then a pool with one or two threads per cpu core (depending on how SMT works for you on your hardware) works well

- if you have lots of long-lived cpu-intensive tasks then just give each task a thread and let the OS schedule them

- if you have lots of io tasks, don't use threads; async io is a massive win

- if you have lots of io tasks on aws then you have to have high core counts even if they all sit idle because of the way credits are divvied up; even with massive brought iops you don't get good io on aws compared to, say, the cheap laptop you do dev on ;)

I am so so so tempted to go into a particular mysql storage engine that my day job often relies upon and move it from one-thread-per-core to async io. Obviously that's a pipe-dream but the wins would be massive (on linux, on aws blah blah)

Having made this list I can see there are so many cravats that I'm not sure generalizations get anyone very far, sadly. Its like the same server software has really different performance characteristics on different cloud providers vs dedicated hardware etc etc.

cryptica · on Feb 27, 2019

The exact terminology probably depends on your programming environment as well. I'm using Node.js now and async IO (e.g. for disk access) uses a threadpool behind the scenes but these threads use very little CPU: they're mostly idle in fact; their only two responsibilities are to start the IO operation then send back the data to the parent process when the IO completes so these threads use almost no CPU so they're not a problem but I guess it depends on how heavily they are used. Node.js is not really designed to be a DB engine though so this design works well.

zmmmmm · on Feb 27, 2019

> Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread. ... My favourite approach assigns one global thread pool (queue) to each function or processing step.

The author seems to do a remarkable job of basically describing actors without mentioning the word "actor". I use GPars [1] with Groovy which explicitly supports exactly what they talk about in a very elegant manner. You create a variety of actors and allocate them to different parallel groups so they cannot starve each other of resources, and most of these issues become controllable.

Perhaps it is the Lisp context that resists lumping state into the equation, since Actors usually carry state?

[1] http://www.gpars.org/

kitd · on Feb 27, 2019

Well yeah, Go channels/goroutines too, et al.

The problem isn't just partition of opaque work on threads though, but how to schedule work so that resources other than threads are used efficiently.

GPars etc don't cover that.

Const-me · on Feb 27, 2019

> The moment we perform I/O… both the futures and the generic thread pool models fall short.

On Windows they usually work fine. OS kernel support is the key. OS-provided thread pool scales really well. I’m talking about the modern one from threadpoolapiset.h, StartThreadpoolIo / SubmitThreadpoolWork API functions.

> we might want to limit network requests going to individual hosts, independently from disk reads or writes, or from database transactions.

There’re APIs like ReleaseSemaphoreWhenCallbackReturns and SetEventWhenCallbackReturns.

hyperpape · on Feb 27, 2019

This sounds an awful lot like SEDA, but my memory of it is too fuzzy to see whether the differences might be.

TeMPOraL · on Feb 27, 2019

Author does reference SEDA is one of the footnotes.

fraffo · on Feb 27, 2019

Does he even know Rx observables?