Hacker News new | past | comments | ask | show | jobs | submit login
The Unscalable, Deadlock-Prone, Thread Pool (pvk.ca)
122 points by dmit on Feb 26, 2019 | hide | past | favorite | 45 comments



I wrote a parallel image processing tool (in Swift) that essentially uses Grand Central Dispatch on macOS to process all files in a directory.

I hit the problem that GCD would happily let me enqueue tasks for processing, but each task would need quite an important amount of memory and GPU resources (because it would use CoreFilters that support GPU acceleration "when available"), which made the computer hit swap and slow to a crawl once the available RAM was exhausted.

I had to create an Admission Control component that would let me enqueue tasks only if there was enough memory left to process the image without risking swap, and that part isn't as trivial as it seems, as getting the amount of available memory for a process is a bit vague on UNIX/macOS/Linux since many OSes of this kind assume you can allocate an infinite amount of memory (see linux's overcommit system) and that the Virtual Memory system will do the right thing for you. Also nobody can guess if physical RAM is going to be allocated by some other process somewhere, so I used a fairly conservative heuristic that probably left some small amount of resources unused most of the time, but would almost never slow to a crawl. Not perfect, but "good enough".


It seems you have a producer-consumer problem, where feeding in work to do faster than it can be processed only leads to lengthy queues, and running more tasks in parallel than you have cores only increases memory usage and task switching overhead.

Such processes rarely are memory-bound, and from what you write (it would use CoreFilters that support GPU acceleration "when available"), yours aren’t either. So why would you want to limit the number of consumers by looking at memory usage?

ioquatix’ comment to measure latency is better, but more work, so my first go would be to manage queue length, as that is simpler to do and often effective enough.

Determine how many consumer processes your system can really handle in parallel and limit the number of queued GCD tasks to, say, twice that number. That factor of two depends on how bursty the producer produces work items and on how CPU-bound the consumers are.

If memory of I/O aren’t your bottleneck, you want to minimize the number of items ‘in flight’ needed to keep your CPU or GPU at 100% usage.


> Such processes rarely are memory-bound, and from what you write (it would use CoreFilters that support GPU acceleration "when available"), yours aren’t either. So why would you want to limit the number of consumers by looking at memory usage?

In my case it was clearly memory bound. I demonstrated it through instrumentation and profiling, then addressed it by using an admission control scheme exclusively built on memory.


If you can build a system which measures the latency of task processing and dynamically adjusts the resource allocation, I think you can be robust in the face of changing underlying conditions. It's non-trivial, but easier to look for spikes in latency than trying to estimate resource allocation/utilisation.


Right - we need holistic scheduling systems that balance all resources and tasks. For example scheduling an IO heavy-task alongside a compute-heavy task and letting them both run well, rather than picking two compute-heavy tasks at the same time, and only scheduling as many tasks as there is RAM, IO, and other resource capacity for to use productively. Lots of people are trying because at data-centre scale and with serverless that lets you run code wherever they want this could be a huge advantage... not sure how much success yet.


I wonder if something as simple as tagging would give a huge benefit towards that. I should know if my job uses lots of memory(often I know exactly how much), or IO, or CPU.


This is something I have pondered for the Bitbake build system. It compiles a Linux "distribution" for embedded systems and uses a lot of CPU, RAM, and I/O. Often one is starving the other. I use a spinning disk for build dirs currently because they use a lot of space. Even with an SSD there is room for better scheduling.


That’s quite an interesting problem. The mlock [1] system call allows you to lock chosen virtual memory into RAM. What about using that in combination with a memory pool which you manage yourself?

[1] https://developer.apple.com/library/archive/documentation/Sy...


If I had all places where memory is allocated under my "jurisdiction", I think this would be a very applicable solution.

Unfortunately when calling APIs to process images, some opaque code that you don't see ends up doing allocations on your behalf, and you can rarely pass your own pool to allocate memory from.


For the record, mlockall on Linux can lock all future allocations, including those made by shared libraries. You don’t need to manage a memory pool yourself.

https://linux.die.net/man/2/mlock


That's very cool!

It looks like that API doesn't exist on macOS, though... :-(


That’s true. I didn’t consider that. I’d be interested love to see a solution to it though. Maybe there’s some sort of OS-level solution, like how you can set the priority of processes.


Node suffers similar problems, although I would describe them differently to the author.

Essentially:

1. All Node's async IO is lumped together into the same threadpool.

2. There is no distinction between the nature of each async IO task.

3. Async CPU tasks (fs.stat hitting the fs cache, async multi-core crypto, async native addons) complete orders of magnitude faster than async disk tasks (SSD or HDD), and these can be orders of magnitude faster than async network tasks (dns requests to a broken dns server).

4. There are three basic async performance profiles, fast (CPU), slow (disk), very slow (dns), but Node has no concept of this.

5. This leads to the Convoy effect. Imagine what happens when you race trucks, cars, and F1... all on the same race track.

6. The threadpool has a default size of only 4 threads, on the assumption that this reflects the typical number of CPU cores (and reduces context switches).

7. 4 threads is a bad default because it leads to surprising behavior (4 slow dns requests to untrusted servers are enough to DoS the process).

8. 4 threads is a bad default because libuv's memory cost of 128 threads is cheap.

9. 4 threads is a bad default because it prevents the CPU scheduler from running async CPU tasks while slow disk and slower DNS tasks are running. Concurrent CPU tasks should rather be limited to the number of cores available, while concurrent disk and DNS tasks should be given more than the number of cores available (context switches are better amortized for these).

10. Because everything is conflated, hard concurrency limits can't be enforced on fast, slow or slower tasks. It's all or nothing.

There are efforts underway to support multiple threadpools in Node (a threadpool for fast tasks sized to the number of cores, a threadpool for slow tasks sized larger, and a threadpool for slower tasks also sized larger). The goal is to get to the point where we can have separate race tracks in Node, with F1, cars and trucks on separate race tracks, controlled and raced to their full potential:

https://github.com/libuv/libuv/pull/1726


This is a common misconception. Node (libuv) only uses the thread pool as a last resort. Many operations nowadays have a native async OS API, so they don't need to use (and therefor don't block/fill) the thread pool.

https://medium.com/the-node-js-collection/what-you-should-kn...

https://stackoverflow.com/a/22644735/4208018


"Node (libuv) only uses the thread pool as a last resort."

When did you last read deps/uv/src/win/fs.c or deps/uv/src/unix/fs.c?

If by "last resort" you mean 99.9% of all async fs tasks, then I would agree with you.


isn't that because OSes have broken async filesystem APIs that can't be trusted?

But network requests (like dns) have well defined APIs for async, and should never be required to block a thread, and therefore do not need a threadpool.


Title is misleading. The author seems to mask what they're saying in a way that makes it really easy to conclude that their point is "Thread pools are bad, so I'll show you a better way." In fact, the proposed better way is ThreadPools(tm).

In fact, "The Unscalable, Deadlock-Prone, Thread Pool" is only referring to a badly used ThreadPool. Like, really badly: the strawman that forms the crux of the argument is that you are only using the same ThreadPool for all of your disparate forms of work.

> My favourite approach assigns one global thread pool (queue) to each function or processing step.

Yes, creating different thread pools for different kinds of work is a much better way to use them...


The .NET libraries make it easy to use the default thread pool (a single pool). If other libraries are similar, then I wouldn't consider this a complete straw man.


Yes, but the .NET thread pool is auto-tuning, although very badly, with at least a second latency to increased workload. For bursty requests that could be done within a few hundred milliseconds it is a bad fit. But you can't DOS your batch processing server so easily.


Nonetheless the single thread pool in .net core seems to scale really well.


> Like, really badly: the strawman

Maybe so, but I've seen this pattern a LOT.


Fair point. I haven't seen it much, but I don't doubt that it happens.

I just wanted to emphasize the point, which wasn't clear, that ThreadPools aren't bad, but rather that bad uses of them are bad.


Agreed. I feel that he should engage with the actor model in this article or a subsequent one because it seems to me like the type of abstraction he’s searching for. E.g. where each actor generally performs one type of work, each has a mailbox with a variety of parallelism strategies, message passing semantics discourage deadlock scenarios... Not a perfect solution but one to discuss when talking about the problems he’s trying to solve.


This feels language specific. Java has excellent thread support for example, it's quite hard to build an application that deadlocks.

Thread-per-request is how everything used to be done. Most languages switched to pools because of huge fork/threading overhead.

Other languages have embraced async programming to deal with these issues, and use thread-per-core. But that introduces some fairly awkward coding constructions itself.

I believe the best approach is transparent fiber support. Java's Project Loom is a very ambitious attempt at this. And if you're okay with a "hacky" solution, you can already use fibers with Quasars code generation.

In java at least, the biggest barrier to fiber support is existing libraries not supporting it (looking at you JDBC/JPA). Hopefully project loom will make existing code work with little or no modifications


The issue the OP is addressing covers your comment quite well. Async continuations, coroutines, and fibers can all race ahead and exhaust memory/DB connections/network sockets, or just saturate the IO subsystem. In a multi-step process this can easily lead to starvation in downstream steps.

The point of structuring an application this way is to prevent any one stage of your pipeline from consuming too many resources or stalling other stages (or just hammering some other system with requests that will inevitably timeout).

I don’t see how your comment about fiber support in Java applies.


I'd love to learn how to design this way, any reading ?


I wrote a full stack implementation for Ruby: https://github.com/socketry/async

Come and chat here: https://gitter.im/socketry/async

TruffleRuby folks are trying to support us using native fibers in the JVM, but I'm not sure when that will be working.


> I believe the best approach is transparent fiber support. Java's Project Loom is a very ambitious attempt at this.

I guess the time was due for us C# developers to have something to be envious of in Java world :)

I'm really disappointed that .Net world went the async/await route. This blog post best describes why: http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y...


I agree. C# is wonderfully designed, only reason I haven't switched is the huge number of wonderful libraries available in java.

Also sad that C# went asynchronous. I mean, it works great, and it's better than what Java has now, but it's not backwards compatible like I expect fibers to be. With how many third party libraries we pull in, this is a gigantic selling point


It is certainly the plan that existing code should work with little or no modification. Alan Bateman has been doing great work fixing the standard library to allow this, though there will be some limitations round JNI and other areas.

If you need to stop too many fibers running at once then this should be very simple to implement using any of java.util.concurrent’s mechanisms, if they block the fiber it will be parked and another can be run.


The author spends a lot of time describing the problem, and not a lot on the solution.

The solution proposed seems to be some variant on dataflow programming: <https://en.wikipedia.org/wiki/Dataflow_programming> Am I misunderstanding?


Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread.

My favourite approach assigns one global thread pool (queue) to each function or processing step. The arguments to the functions will change, but the code is always the same, so the resource requirements are also well understood.

I think I get what he's saying. To avoid one stage in the multi-step computation acquiring too many resources and starving other stages:

- Cap the resources used by each stage.

- Create one threadpool per stage.

- Include sufficient information in pending jobs that the threadpool can calculate the resources needed to process each job.

- Only start a job running when it can be run without exceeding resource cap for the stage.

Conceptually, you can think of it as pre-allocating a bundle of resources to each processing stage: threads, memory, etc. Each processing stage then runs as many jobs concurrently as it can with the allocated bundle of resources. That's the idea, but in reality each job allocates its own resources from a common pool that is shared among all the processing stages, so the processing stage has to calculate how much memory, etc. each job will allocate before giving it a thread to run on.


So does it boil down to "grab all your locks upfront, if you can't then yield them all back"?


Pretty much, plus if you're even running, you have high confidence that lock acquisition will succeed, because the entity that gave you a thread to run on has checked that all the resources you need are available.


It sounded to me like a pipelining architecture similar to https://en.m.wikipedia.org/wiki/Instruction_pipelining with a thread pool per stage?


These are not the same things. The major difference is that hardware for instructions can do only one thing so it can be thought of more like a physical process.

For a software pipeline, you don't need to hand data off to a different thread, you just need to make sure the data is run with the right execution. That could be on any thread, and the execution/function can be run on any number of threads at the same time as long as the data is separate.


For some use cases, I like the approach of having a fixed number of main, long lived threads/processes at runtime (which matches the number of CPU cores on the machine to minimize context-switching). Where possible, the threads/processes are themseves responsible for picking up the work that must be done. Each thread/process can use hashing/sharding to figure out which portion of the state they are responsible for and only operate on those parts. These kinds of threads should be handling the vast majority of the work. But of course, for this to be possible, the work needs to be parallelizable (servers/sockets make a good use case).

If you have occasional I/O tasks which don't require much CPU (e.g. disk I/O), you can use a separate threadpool for those since the context switching penalty will be minimal. Context switching is mostly a problem when all the threads/processes are using a lot (and equal amount) of CPU. If a very heavy thread/process has to share CPU with a very lightweight thread/process, the context switching penalty is not significant.


My experience (on Linux) is that:

- if you have lots of short-lived cpu-intensive tasks then a pool with one or two threads per cpu core (depending on how SMT works for you on your hardware) works well

- if you have lots of long-lived cpu-intensive tasks then just give each task a thread and let the OS schedule them

- if you have lots of io tasks, don't use threads; async io is a massive win

- if you have lots of io tasks on aws then you have to have high core counts even if they all sit idle because of the way credits are divvied up; even with massive brought iops you don't get good io on aws compared to, say, the cheap laptop you do dev on ;)

I am so so so tempted to go into a particular mysql storage engine that my day job often relies upon and move it from one-thread-per-core to async io. Obviously that's a pipe-dream but the wins would be massive (on linux, on aws blah blah)

Having made this list I can see there are so many cravats that I'm not sure generalizations get anyone very far, sadly. Its like the same server software has really different performance characteristics on different cloud providers vs dedicated hardware etc etc.


The exact terminology probably depends on your programming environment as well. I'm using Node.js now and async IO (e.g. for disk access) uses a threadpool behind the scenes but these threads use very little CPU: they're mostly idle in fact; their only two responsibilities are to start the IO operation then send back the data to the parent process when the IO completes so these threads use almost no CPU so they're not a problem but I guess it depends on how heavily they are used. Node.js is not really designed to be a DB engine though so this design works well.


> Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread. ... My favourite approach assigns one global thread pool (queue) to each function or processing step.

The author seems to do a remarkable job of basically describing actors without mentioning the word "actor". I use GPars [1] with Groovy which explicitly supports exactly what they talk about in a very elegant manner. You create a variety of actors and allocate them to different parallel groups so they cannot starve each other of resources, and most of these issues become controllable.

Perhaps it is the Lisp context that resists lumping state into the equation, since Actors usually carry state?

[1] http://www.gpars.org/


Well yeah, Go channels/goroutines too, et al.

The problem isn't just partition of opaque work on threads though, but how to schedule work so that resources other than threads are used efficiently.

GPars etc don't cover that.


> The moment we perform I/O… both the futures and the generic thread pool models fall short.

On Windows they usually work fine. OS kernel support is the key. OS-provided thread pool scales really well. I’m talking about the modern one from threadpoolapiset.h, StartThreadpoolIo / SubmitThreadpoolWork API functions.

> we might want to limit network requests going to individual hosts, independently from disk reads or writes, or from database transactions.

There’re APIs like ReleaseSemaphoreWhenCallbackReturns and SetEventWhenCallbackReturns.


This sounds an awful lot like SEDA, but my memory of it is too fuzzy to see whether the differences might be.


Author does reference SEDA is one of the footnotes.


Does he even know Rx observables?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: