> I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
To do that efficiently you need to pin thread groups to cores based on having information of data usage.
This smells like over-optimizing architectures to me, if you want to go beyond separating stuff like hyper-threads on io.
Additional annoyance: There is no POSIX way to get hyperthreads and physical ones.
I think a general purpose threadpool should work well on general purpose hardware, and it seems like the most popular SoCs on consumer devices will have heterogeneous cores et al, so a good implementation would schedule the threadpool appropriately. I agree that there is no POSIX way to distinguish between hyper threads and regular threads, and this is something that should be improved. I'm not saying that the achievements made by the threadpool implementation are lackluster or that any of the other solutions solve the issues I outline any better. What I am saying that the comment I was originally referring was somewhat mistaken about the benefits of a more optimal, yet naive threadpool library.
This isn't just about hyperthreads, by the way. As long as the workload isn't compute heavy and often stalls on memory, hyperthreads are just as good as regular threads. On a hardware level, there is no distinction between a regular and a hyperthread core. Either you multiplex a single physical core or you don't. Anyway, there is more to it than slower threads and faster threads - accessing memory between threads will be slower depending on which core is trying to access which bits of memory - a core stealing work from a sibling core on the same chiplet will probably be able to do that quicker than stealing work from a core across the cluster if the data prefetcher has been doing it's job correctly. Spawning more threads than necessary might force a CPU to power up more cores than necessary, resulting in slower performance per core performance and worse power efficiency, especially if a fast or slow cluster needs to be turned on, where a more optimal scheduling of threads might not force that to happen.
I think a general purpose thread pool by default should no longer spawn as many threads as there are _cores_, whatever that term even means, with optional toggles to inform whether the work that'll be scheduled will be compute heavy or not.
To do that efficiently you need to pin thread groups to cores based on having information of data usage. This smells like over-optimizing architectures to me, if you want to go beyond separating stuff like hyper-threads on io. Additional annoyance: There is no POSIX way to get hyperthreads and physical ones.