It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
I don't think that it's the threadpool's fault that an application uses it incorrectly. Also, I think there are a lot of developers who have not considered that on today's machines, just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case. I wouldn't say it's the case of bad software but rather software that was written for the CPUs of 5 years ago. And generally, I don't think this will be an easy problem to solve due to the variety of heterogeneous and non-heterogeneous topology of modern SoCs. In fact, I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
> I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
To do that efficiently you need to pin thread groups to cores based on having information of data usage.
This smells like over-optimizing architectures to me, if you want to go beyond separating stuff like hyper-threads on io.
Additional annoyance: There is no POSIX way to get hyperthreads and physical ones.
I think a general purpose threadpool should work well on general purpose hardware, and it seems like the most popular SoCs on consumer devices will have heterogeneous cores et al, so a good implementation would schedule the threadpool appropriately. I agree that there is no POSIX way to distinguish between hyper threads and regular threads, and this is something that should be improved. I'm not saying that the achievements made by the threadpool implementation are lackluster or that any of the other solutions solve the issues I outline any better. What I am saying that the comment I was originally referring was somewhat mistaken about the benefits of a more optimal, yet naive threadpool library.
This isn't just about hyperthreads, by the way. As long as the workload isn't compute heavy and often stalls on memory, hyperthreads are just as good as regular threads. On a hardware level, there is no distinction between a regular and a hyperthread core. Either you multiplex a single physical core or you don't. Anyway, there is more to it than slower threads and faster threads - accessing memory between threads will be slower depending on which core is trying to access which bits of memory - a core stealing work from a sibling core on the same chiplet will probably be able to do that quicker than stealing work from a core across the cluster if the data prefetcher has been doing it's job correctly. Spawning more threads than necessary might force a CPU to power up more cores than necessary, resulting in slower performance per core performance and worse power efficiency, especially if a fast or slow cluster needs to be turned on, where a more optimal scheduling of threads might not force that to happen.
I think a general purpose thread pool by default should no longer spawn as many threads as there are _cores_, whatever that term even means, with optional toggles to inform whether the work that'll be scheduled will be compute heavy or not.
> just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case
Absolutely not.
Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.
Having multiple threads per core would be more efficient: it gives the cores something to do while waiting for SSD, RDMA, or GPUs. Remember: even the earliest single-core systems from the 1970s had multiple threads on one core: its just more efficient to have multiple terminals to read/write to at a time.
--------
One hardware-thread per thread (since SMT8 machines exist like POWER9 / POWER10) is only efficient in the most computationally expensive situations. Which is in fact, a rarity in today's world. Your typical programs will be waiting on the network interface or SSD.
IIRC: there's professional thread-pools out there that are 1.5x threads per hardware thread as a default option, and then scale up/down depending on how computationally expensive things look. That is: a 64-core/128-thread Threadripper would be overloaded with 192 threads, under the assumption that at least 33% of them would be waiting on I/O at any given time.
>Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.
The argument I failed to make was that with heterogeneous distribution of memory bandwidth and compute resources, most user applications would benefit from spawning less threads than all available cores. In I/O heavy workloads, the correct thing to do is to do asynchronous I/O. This can be done for SSDs and GPUs. On contemporary systems where there's heavy costs associated with context switching ,avoiding them and servicing multiple tasks without blocking will always be superior to spawning more threads to do more blocking.
When it comes to hyperthreading, I assume that the threads are cores - because from the OS perspective, they are, and you cannot distinguish two hyperthreads running on a single core anyway.
Also, I apologize, but the first sentence you're quoting is not not what I intended to write - my point is that most application developers might still think that spawning as many threads as there are cores is a reasonable thing to do in all cases - but with CPUs that comprise of 4 core clusters with 16 cores each, it's often better to spawn far less than the total amount of cores available.
> In I/O heavy workloads, the correct thing to do is to do asynchronous I/O
You can't async mmap into memory reads (or the GPU-equivalent: cudaMallocManaged).
Today's I/O is looking more-and-more like a memory read/write. As such, your typical "node = node->next" pointer traversals could very well be an incredible string of I/O. That indirection could be in RAM, over SSD (mmap), over RDMA (ethernet pretending to be RAM), or into GPU-RAM (cudaMallocManaged)... with an appropriate PCIe command (possibly atomic-PCIe) to boot.
Async only works on the simplest of reading/writing patterns.
EDIT: Consider the "persistent kernel" pattern on GPGPU (which starts off with a lot of the similar thought process you have on CPU-land. Launch only X wavefronts, where X is the size of the GPU). You primarily communicate with a persistent kernel over RAM / atomics. You don't use the classic cuda <<<blocks, threads>>>() (which admittingly has an async interface)... Instead: you read/write to magic managed-memory locations that will be eventually sync'd over PCIe to the GPU. This is because the "persistent kernel" is always executing. You launched it upon program start, there's no event to async-trigger on. You just pass data to the GPU. The GPU operates on the data at its leisure. Then it eventually returns data to a buffer elsewhere (probably with atomic compare-and-swaps, which traverse PCIe these days, to ensure coherence)
I recently spent some time reading the kernel code for io-uring - however only for older revisions (5.4 - 5.7). There I found out that a lot of it is actually implemented on top of existing kernel functions for blocking IO and polled IO, and does not replace it and make the system fully asynchronous. Pratically that means with a 5.4 kernel doing lots of disk IO you will still have lots of threads being blocked on IO - only in this case those will be threads inside a kernel threadpool instead of in userspace. With 5.7 that model changed, and the threadpool is no longer necessary for read/write operations on sockets. Maybe also for files - but I don't really understand the kernel code well enough to confirm or deny that. And things might obviously also have changed for newer Kernel versions.
CUDA-streams probably already use io_uring under the hood.
The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).
Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.
-------
The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)
Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.
I/O is becoming indistinguishable from memcpy + atomics.
-------
EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).
Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.
I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?
Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.
Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.
What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.
Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.
On windows, no. You can use affinity to prevent the application from running on some of your cores but it will still spin up a ton of threads (which makes the situation even worse sometimes)
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.