This article is clickbait and in no way has the kernel been hardcoded to a maximum of 8 cores.
If you read the commit [0], you can see, that a /certain/ scaling factor for scheduling can scale linearly or logarithmically with the number of cores and for calculating this scaling factor, the number is capped to 8. This has nothing to do with the number of cores that can actually be used.
If I've understood this article right (possibly not as it's not so clear):
* On single core machines the scheduler interval is quite fine grained to ensure responsiveness.
* As the number of cores grows, the scheduler interval increases (by default), presumably because there's greater cost to task switching across more cores and the greater number of cores inherently increases the responsiveness anyway.
* BUT – and this is the point of the article – above 8 cores, the scheduler interval remains constant rather than increasing further.
If I have read it right, then surely that's exactly what you want? If you have done enormously beefy server with hundreds of cores then you don't want a CPU bound task persisting for several seconds! You almost certainly have proportionally more tasks to run and need to keep that interval down. There's presumably also diminishing increases to the cost of switching across more cores (admittedly a logarithmically increasing interval, as mentioned in the article, world also cope with that). And, in any case, a huge server is more likely to have this sort of setting manually tuned.
If there is a bug here at all then it's a minor one, nothing like the title suggests.
I'm not well enough versed in the Linux kernel to comment on the post, but as a funny anecdote, 15 years ago, we were developing a system that relied heavily on using as many cores as possible simultaneously.
Performance was critical, and we had beefy developer machines (for the time), all with 8 cores. Development was done in C++, and as the project progressed the system performed very well, and more than exceeded our performance goals.
Fast forward a couple of years and it became time to put the thing into production on a massive 128 core Windows server. Much to our surprise the performance completely tanked when scaled to all 128 cores.
After much debugging, it turned out that the system spent most of its time context switching instead of doing actual work, and because we used atomic operations for message queue functionality (compare & swap), it effectively meant clearing the cache for every core working with/on that piece of heap memory, so every time a task passed a message to something else, it effectively reset the CPU cache, which would then have to be refetched from RAM. This was not (as big) a problem on the developer machines, as there were fewer cores and each task had more work queued up for it, but with 16 as many cores to work with, it simply ran out of tasks to do faster.
The "cure" was to limit the system to run on 8 cores just like the developer machines, and AFAIK it still runs in that way all these years later.
It's amazing how much "optimization" you can achieve on modern systems by simply fencing process(es) to run within a cache region, or within a NUMA node. Even my homelab server with a P-core cluster and two E-cores clusters benefits massively from some simple cpusets to keep each process running within a cluster to mitigate context switching. Each P-cores has its own L2 cache, but E-core clusters share L2. Unfortunately all three clusters share a LLC, so there's only so much you can do.
> It’s problematic that the kernel was hardcoded to a maximum of 8 cores
Why? The article show no evidence of this being problematic.
> It can’t be good to reschedule hundreds of tasks every few milliseconds, maybe on a different core, maybe on a different die. It can’t be good for performance and cache locality.
Is this an actual bug, or intentional behavior? I mean someone must have put the min in there for a reason…? I can imagine scenarios were it would be nice to have something scale with cores up to a certain point, maybe this is one of those? I didn’t see much analysis beyond “oh hey there is a 8 here that means the kernel people are dumb for 15 years”. Did this get fixed?
> The Linux kernel has been accidentally hardcoded to a maximum of 8 cores for the past 15 years and nobody noticed
I can understand having a bug like that, but unnoticed for 15 years? more than 8 cores was rare 15 years ago, and as a percentage of chips sold is still rare, but presumably people with threadrippers ran benchmarks? optimized? etc? just doesn't seem possible
The article is extremely confusingly worded, sometimes close to nonsensical (epyc as baseline? In what damn world?), and definitely clickbaity.
From what I understand what was limited to 8 cores is the scaling of the preemption delay (min_granularity / min_slice). Again from what I understand that what this is is the window during which a process can not be preempted, so this is only relevant when the scheduler has more tasks to run than available slices (the system is heavily / over - loaded).
I would assume well-administered systems where this would be relevant:
1. Are not overloaded
2. Have the important tasks pinned to avoid migrations
3. Have priorities configured to avoid preempting / descheduling their primary workloads
As such, on a well-administered system this would mostly translate to possibly over-pre-empting low-priority tasks (and most likely not pre-empting anything because the machine is configured with capacity for those ancillary / transient low priority tasks). This may show up during transient overloads, and worsen an already bad situation, but it probably wouldn’t show up during normal operations.
It also doesn’t seem accidental, the maintainers literally slapped an `min(8, …)` on there, so they explicitly designed the scaling to have an upper bound. Maybe it’s a mistake, maybe it’s too low, maybe it should be a tunable, but I’d think it makes sense to not allow the preemption delay to grow infinitely.
Exactly. The code is adjusting for responsiveness. With less cpus you need a smaller minimum slice. As you have more cpus you can increase the slice and still schedule the same number of processes per second.
E.g. 1 ms slice with 1 core = 1000 process switches per second. With 2 cores you can increase the slice to 2 ms and still maintain the same number of switches per second for the system, but reducing the switches per second on each core to 500. This reduces the overhead for the scheduler.
It seems like at around 8 times the slice efficiency starts to go the other way, so they’ve limited it. Seems reasonable, but scheduler math is crazy.
Note, that this has nothing to do with the scheduler assignments per core which have clearly been working or people would’ve noticed!
It's possible, the kernel itself not using more than 8 cores will be imperceptible except on things like network devices where theres' no userland<->kernel barrier. In those kinds of high-throughput applications usually you offload things the kernel does to hardware anyway.
You'll lose your mind once you realise that Windows NT handles a lot of things single-threaded. I had a situation once where I was handling a few million packets per second of TCP on Windows and it only pins a single core.
Though: that's not what TFA is looking at: in this case you're not actually limited to 8 cores, you're limited to slicing your executions into 8 parts per cpu per "tick".
This is a known way of scheduling, you only get 1/8th of a tick with a fair scheduler.
Because the framing is wrong and click bait... As anyone with many cores can tell you: Those are used.
The issue us more subtle: "[the minimum granularity] is supposed to allow tasks to run for a minimum amount of [3ms] when the system is overloaded".
That's supposed to scale with number of cores, but the scaling us limited to 8 cores.
However, imho that's not even necessarily a bad thing. It's a trade off between responsiveness and throughput in overload situations. You don't want slices to become too tiny/large...
If i skimmed this correctly then its a malus on performance not a complete cliff. I guess people just thought "hohum - there gotta be some overhead in scheduling".
Yes, when you run in parallel, and e.g. see all >8 cores nicely nagging up to 100% why assume something wrong?
Still don't get after rereading the article, what is the malus, it must be small by that? Because you definitely saw linear scaling with parallelizable problems on >8 cores, otherwise people would have noticed?
What happens is that the min_slice stops scaling up above 8 cores. The article misrepresents that as “limit to 8 cores”, but my admittedly shallow understanding is that min_slice is a preemption protection: if the kernel has tasks to schedule and no free core, it will try to preempt an existing task, a process within its min_slice is protected from that preemption.
So this is only relevant for an overloaded system, and furthermore just means that processes may be preempted after 3ms (instead of that protection delay keeping on increasing), ignoring all other tunables e.g. priority and stuff.
Not only that but it’s a log2, so if this was relaxed on a 128 cores system you’d have a preemption delay of 7ms instead. I don’t think that would save you if you’re overloading a 128 cores system honestly, although it does beg the question as to why the kernel devs felt the need to cap the scaling. Even assuming it scales per thread, and you have a dual socket epyc, log2(512) = 9.
Ok, thanks a lot. Agreed, one might even argue if it is better to scale this number with the amount of cores, or in what relationship, or if the initial values chosen are the optimal ones, certainly not for every situation. With the right comment one might have even claimed it intentional, lol.
Clickbaity headline from a technical person makes me sad :(
At first i thought, this is a big bug, but all the commits the author linked do not even back up his point. And the argumentation around is not following through. So basically we are looking at big claims with zero evidence.
Either the author is deranged or this was written by AI.
If this was true, many of us would have noticed including myself. Who hasn't run `make -j 32` and then `make -j 64` to see if it is faster? Many times on a 64 cores machine I've seen that increasing the core count for large compile tasks makes scale as expected up to 64 cores, then a bit faster up to 128 threads (because the threads are bottlenecking on I/O so there is some CPU leftover), and then it gets slower again past 128+.
I used to work on a supercomputer with 128 cores on it that ran Linux (I actually seem to remember it had 256 cores, but someone said the kernel had a limit of 128). This was less than 15 years ago. There were surely many systems just like that one. Does that mean the kernel had been patched? But nobody thought to push that patch upstream?
Reading the other comments here it seems the title is stupid and wrong and my suspicion that this can't be right was correct.
There’s probably nothing to patch: from my understanding this does not actually affect the scaling of the OS, it affects the throughput on heavily loaded systems, as min_slice is the delay before which a task can’t be preempted.
So it’s only relevant if the system has more tasks than cores, and if you ignore priorities and pinning. I assume these are the sort of mistakes people working with supercomputers would not be making, and residency would be a very carefully monitored to ensure the system is not thrashing.
You didn't even have to have a supercomputer as common Opteron systems pretty much had more than 8 cores. So, someone would definitely notice that. Especially AMD. :)
The statement looks very misleading or even fraudulent. I used a system with 192 cores often and with GNU parallel, it did not stagnate at 8 parallel tasks for simple demonstrations. If we're talking about a case where 8 is intentionally the maximum (it's possible that some tasks should not parallelize more than 8), then the statement is misleading as well, since it gives the wrong impression. I have the service tag and the output from nproc and the exact version of everything where I used 192 CPUs. I suppose pseudoscience will always return and claim that the statement is true anyway, no matter what observations the rest of the world can give. There is pseudoscience forever who always says that the rest of the world is misunderstood.
The minimum run time of a tread before it's potentially preempted when load is high is computed based on number of cores available giving more time when there are more cores since rescheduling latency is apparently longer with more cores.
And the function that does this uses the same value for 8 and more cores.
I'm annoyed that the author knows perfectly well what the code does, and yet he multiple times draws the clickbaity wrong conclusion that "the kernel has been hardcoded to use only 8 cores for 15 years" which is just patently wrong as the discussion is just about pre-emption timing and NOTHING about how many cores are used.
i’m doing high performance gamedev on a 7950x with smt off. 16 cores.
i can only use 8 to run the game.
mutex and other synchronization gets highly variable latency if the first 8 cores try to talk to the second 8.
i hadn’t heard of this before i started this game, but apparently it’s well known, and nobody makes a chip with more than 8 cores that can have low variability synchronization.
linux running on 8 cores seems like a potentially good idea. one wants the kernel to have low latency with low variability.
This all comes down to the cache coherency protocols. And it's not surprising that you see increased latency with zen4 microarchitecture because, as one of the parent commenters already said, it's almost as if you're running a NUMA architecture within a single physical chip.
When dealing with NUMA we know that cross-socket, or in this case cross-CCD, latencies are always higher than the ones within the same socket or same CCD. Usually multi-fold.
So, if you're able to somehow take advantage of this knowledge in your code (e.g. by scheduling less latency-sensitive tasks to the other CCD), you may be able to improve your overall performance.
It's not possible for every workload, and sometimes the necessary effort makes it infeasible. Getting a Xeon might be the cheaper option then ;-)
Edit: Though I'd still recommend heeding menaerus excellent sibling answer. Maybe not for this project, but it is great knowledge to have in your domain and I'd expect it to be relevant for the future.
I have the 7900 (non-x), I find it helpful to think of it as a dual socket 6 core system. For latency sensitive things like many games, it's best to pin it to a single socket/half the cores. Once you do that the other half can handle everything else. That way jitter stays low, performance is consistent, caches are warm, etc.
One thing I noticed is that min/max functions are tricky to use and it's easily to accidentally do the wrong thing, like in this case
Because you think "you want the minimum (that is, the number should be at least 8) of those numbers to be 8" then you slap min(). And you got it wrong. You should have used max()
After reading the article quickly, I'm not even sure there's a bug or not, the only thing that seems obviously wrong to me is the headline.
But that aside, I feel your pain around min() and max(). In my case, I feel the problem comes from language (I mean, the real life one): I speak Italian and in Italian the words for "at most" are "al massimo" (i.e. "at max") and the words for "at least" are "come minimo" (i.e. "as min") that is, the exact opposite of their mathematical meaning.
I've taken an habit of reviewing 3 times any piece of code where I write "min" or "max" for this reason.
It would have been more of a bug to set this value to minimum 8 on a system with fewer than 8 cores.
I appears the author has misunderstood exactly what was its purpose, and that setting this scaling factor unbound beyond 8 cores would have detrimental effects even if there are more cores available.
I don't think that's the case here, but yeah min and max can be a bit confusing to read. If your language allows adding methods or infix operators without performance overhead, I like to make something like this:
I find the word “clamp” extremely satisfying for this concept, especially with optional named parameters in order to leave off one bound e.g. `x.clamp(low=10)` but it still works fine with both required and anonymous, it’s just a bit less convenient.
If you read the commit [0], you can see, that a /certain/ scaling factor for scheduling can scale linearly or logarithmically with the number of cores and for calculating this scaling factor, the number is capped to 8. This has nothing to do with the number of cores that can actually be used.
[0] https://github.com/torvalds/linux/commit/acb4a848da821a095ae...