Hacker News new | past | comments | ask | show | jobs | submit login
The long road to lazy preemption in the Linux CPU scheduler (lwn.net)
220 points by chmaynard 79 days ago | hide | past | favorite | 50 comments



> It all adds up to a lot to be done still, but the end result of the lazy-preemption work should be a kernel that is a bit smaller and simpler while delivering predictable latencies without the need to sprinkle scheduler-related calls throughout the code. That seems like a better solution, but getting there is going to take some time.

Sounds promising. Just like EEVDF, this both simplifies and improves the status quo. Does not get better than that.


> A higher level of preemption enables the system to respond more quickly to events; whether an event is the movement of a mouse or an "imminent meltdown" signal from a nuclear reactor, faster response tends to be more gratifying. But a higher level of preemption can hurt the overall throughput of the system; workloads with a lot of long-running, CPU-intensive tasks tend to benefit from being disturbed as little as possible. More frequent preemption can also lead to higher lock contention. That is why the different modes exist; the optimal preemption mode will vary for different workloads.

Why isn't the level of preemption a property of the specific event, rather than of some global mode? Some events need to be handled with less latency than others.


You need CPU time to evaluate the priority of the event. This can't happen until after you've interrupted whatever process is currently on the CPU. And so the highest possible priority an event can happen is limited by how short a time slice a program gets before it has to go through a context switch.

To stand ready to reliably respond to any one kind of event with low latency, every CPU intensive program must suffer a performance penalty all the time. And this is true no matter how rare those events may be.


That is not true of quite a few multi-core systems. A lot of them, especially those that really care about performance will strap all interrupts to core 0 and only interrupt other cores via IPI when necessary.


I learned this when I pegged core 0 with an intensive process on a little quad core arm device, and all of my interrupts started behaving erratically.


This strategy minimizes the impact by making one core less necessary. But it does not eliminate it.


Sure which is a perfectly fine trade off; almost all recent CPUs have enough multicore capacity that make this trade favorable.


> You need CPU time to evaluate the priority of the event.

Not necessarily. The CPU can do it in hardware. As a simple example, the 6502 had separate “interrupt request” (IRQ) and “non-maskable interrupts (NMI) pins, supporting two interrupt levels. The former could be disabled; the latter could not.

A programmable interrupt controller (https://en.wikipedia.org/wiki/Programmable_interrupt_control...) also could ‘know’ that it need not immediately handle some interrupts.


The user you replied to likely means something different: The priority of the event often depends on the exact contents on the event and not the hardware event source. For example, say you receive a "read request completed" interrupt from a storage device. The kernel now needs to pass on the data to the process which originally requested it. In order to know how urgent the original request and thus the handling of the interrupt is, the kernel needs to check which sector was read and associate it with a process. Merely knowing that it came from a specific storage device is not sufficient.

By the way, NMI still exist on x86 to this day, but AFAIK they're only used for serious machine-level issues and watchdog timeouts.


This, too, can be done in hardware (if nothing else, with a small coprocessor).


This doesn't shed light

Generally, any given software can be done in hardware.

Specifically, we could attach small custom coprocessors to everything for the Linux kernel, and Linux could require them to do any sort of multitasking.

In practice, software allows us to customize these things and upgrade them and change them without tightly coupling us to a specific kernel and hardware design.


Exactly the point. We can compile any piece of software that we want into hardware, but after that it is easier to change in software. Given the variety of unexpected ways in which hardware is used, in practice we went up moving some of what we expected to do in hardware, back into software.

This doesn't mean that moving logic into hardware can't be a win. It often is. But we should also expect that what has tended to wind up in software, will continue to do so in the future. And that includes complex decisions about the priority of interrupts.


We already have specialised hardware for register mapping (which could be done in software, by the compiler, but generally isn't) and resolving instruction dependency graphs (which again, could be done by a compiler). Mapping interrupts to a hardware priority level feels like the same sort of task, to me.


> We already have specialised hardware for register mapping (which could be done in software, by the compiler, but generally isn't)

Wait, what? I’ve been out of compiler design for a couple decades, but that definitely used to be a thing.


They're probably referring to AMD Zen's speculative lifting of stack slots into physical registers (due to x86, phased out with Zen3 though), and more generally to OoO cores with far more physical than architectural registers.


We do register allocation in compilers, yes, but that has surprisingly little bearing on the actual microarchitectural register allocation. The priority when allocating registers these days is, iirc, avoiding false dependencies, not anything else.


Linux runs actual C code when an event occurs — this is how it queues up a wake up of the target task and optionally triggers preemption.


> Why isn't the level of preemption a property of the specific event, rather than of some global mode?

There are two different notions which are easy to get confused about here: when a process can be preempted and when a process will actually be preempted.

Potential preemption point is a property of the scheduler and is what is being discussed with the global mode here. More preemption points mean more chances for processes to be preempted at inconvenient time obviously but it also means more chances to properly prioritise.

What you call level of preemption, which is to say priority given by the scheduler, absolutely is a property of the process and can definitely be set. The Linux default scheduler will indeed do its best to allocate more time slices and preempt less processes which have priority.


Arguably PREEMPT_VOLUNTARY, as described in the article is an attempt in this direction which is being deprecated.


It's sort of what this patch does, from https://lwn.net/ml/all/20241008144829.GG14587@noisy.programm... :

> SCHED_IDLE, SCHED_BATCH and SCHED_NORMAL/OTHER get the lazy thing, FIFO, RR and DEADLINE get the traditional Full behaviour.


Mostly because such a system would install in fighting among programs that all will want to be prioritized as important. tbf it will mostly be larger companies who will take advantage of it for "better" user experience. Which is kind of important to either reduce to a minimal amount of running applications or simply control it manually for the short burst most users will experience. If anything cpu intensive tasks are more likely to be bad code than some really effective use of resources.

Though when it comes to gaming, there is a delicate balance as game performance should be prioritized but not be allowed to cause the system to lock up for multitasking purposes.

Either way, considering this is mostly for idle tasks. It has little importance to allow it to be automated beyond giving users a simple command for scripting purposes that users can use for toggling various behaviors.


You're talking about user-space preemption. The person you're replying to, and the article are about kernel preemption.


Games run in a tight loop, they don’t (typically) yield execution. If you don’t have preemption, a game will use 100% of all the resources all the time, if given the chance.


Games absolutely yield, even if the rendering thread tries to go 100% you'll likely still be sleeping in the GPU driver as it waits for back buffers to free up.

Even for non-rendering systems those still usually run at game tick-rates since running those full-tilt can starve adjacent cores depending on false sharing, cache misses, bus bandwidth limits and the like.

I can't think of a single title I worked on that did what you describe, embedded stuff for sure but that's a whole different class that is likely not even running a kernel.


Maybe on DOS. Doing any kind of IO usually implies “yielding”, which most interactive programs do very often. Exhausting its quantum without any IO decreases the task’s priority in a classic multilevel feedback queue scheduler, but that’s not typical for programs.


Games run in user space. They don't have to yield (that's cooperative multitasking), they are preempted by the kernel. And don't have a say about it.


Make a syscall for io. Now the kernel takes over and runs whatever it likes for as long as it likes.

Do no syscalls. Timer tick. Kernel takes over and does whatever as well.

No_HZ_FULL, isolated cpu cores, interrupts on some other core and you can spin using 100% cpu forever on a core. Do games do anything like this?


Pinning on a core like this is done in areas like HPC and HFT. In general you want a good assurance that your hardware matches your expectations and some kernel tuning.

I haven't heard of it being done with PC games. I doubt the environment would be predictable enough. On consoles tho..?


We absolutely pinned on consoles, anywhere where you have fixed known hardware tuning for that specific hardware usually nets you some decent benefits.

From what I recall we mostly did it for predictability so that things that may go long wouldn't interrupt deadline sensitive things(audio, physics, etc).


Nice, thank you


Thinking about it the threads in a game that normally need more CPU time are the ones that are doing lots of sys calls. You'd have to use a fair bit of async and atomics, to split the work into compute and chatting with the kernal. Might as well figure out how to do it 'right' and use 2+ threads so it can scale. Side note the compute heavy low sys call freqency stuff like terrain gen belongs in the pool of back ground threads, normaly.


Yeah you are right, however some of what I said does have some merit as there are plenty of things I talked about that apply to why you would need dynamic preemption. However, the other person who mentioned the issue with needing to take cpu cycles on the dynamic system that checks and might apply a new preemptive config is more overhead. The kernel can't always know how long the tasks will take so it is possible that the overhead for dynamically changing for new tasks that have short runtime will be worse than just preemptively setting the preemptive configuration.

But yeah thanks for making that distinction. Forgot to touch on the differences


> Why isn't the level of preemption a property of the specific event, rather than of some global mode? Some events need to be handled with less latency than others.

How do you know which thread is needed to "handle" this particular "event" though? I mean, maybe you're about to start a high priority video with low latency requirements[1]. And due to a design mess your video player needs to contact some random auth server to get a DRM cookie for the stream.

How does the KERNEL know that the auth server is on the critical path for the backup camera? That's a human-space design issue, not a scheduler algorithm.

[1] A backup camera in a vehicle, say.


Aren't we in the massively multi-core era?

I guess it's nice to keep Linux relevant to older single CPU architectures, especially with regards to embedded systems.

But if Linux is going to be targeted towards modern cpu architectures primarily, accidentally basically assume that there is a a single CPU available to evaluate priority and leave the CPU intensive task bound to other cores?

I mean this has to be what high low is for, outside of mobile efficiency.


As I understand it, kernel preemption of a user thread happens (and has to happen) on the core that's running the thread. The kernel is not a separate process, but rather a part of every process. What you're describing sounds more like a hypervisor than a kernel. That distinction isn't purely semantic; the two operate at different security levels on the CPU and use different instructions and mechanisms to do their respective jobs.

edit: That having been said, I may be misinterpreting what you described; there's a comment in another thread by @zeusk which says to me that more or less this (single core used/reserved for making priority decisions) is already the case on many multi-core systems anyway, thanks to IPI (inter-processor interrupts). So, presumably, the prioritization core handles the preemption interrupts, then runs decision logic on what threads actually need to be preempted, and sends those decisions out to the respective core(s) using IPI, which causes the kernel code on those cores to unconditionally preempt the running thread.

However, I'd wonder still about the risk of memory barriers or locks starving out the kernel scheduler in this kind of architecture. Maybe the CPU can arbitrate the priority for these in hardware? Or maybe the kernel scheduler always runs for a small portion of every time slice, but only takes action if an interrupt handler has set a flag?


Massive multi-core is subjective. Most industrial computers are still low core. You are more likely to find dual or quad core in these environments. Multi-core costs more money and increases the cost of automation. Look at Advantech computers specifications for this economic area.

Software PLCs will bind to a core which is not exposed to the OS environment and will show a dual core is a single or a quad core as a tri core.


"Current kernels have four different modes that regulate when one task can be preempted in favor of another"

Is this about kernel tasks, user tasks or both?


Kernel code, user-space code is always preemptible.


Not true when the user-space thread has RT priority.


RT threads can be prempted by higher prio RT, and IIRC some kernel threads run at the highest prio. Plus you can be prempted by SMI, an hypervisor, etc


Can't find any numbers in the linked thread with the patches. Surely some preliminary benchmarking must have been performed that could tell us something about the real world potential of the change?


From the article, second last paragraph:

> There is also, of course, the need for extensive performance testing; Mike Galbraith has made an early start on that work, showing that throughput with lazy preemption falls just short of that with PREEMPT_VOLUNTARY.


How would you benchmark something like this? Run multiple processes concurrently and then sort by total run time? Or measure individual process wait time?


I guess both make sense, and a lot of other things (synthetical benchmarks, microbenchmarks, real-world benchmarks, best/average/worst case latency comparison, best/average/worst case throughput comparison...)


How tight is scheduler coupled to the rest of kernel code?

If one wanted to drastically simplify scheduler, for example for some scientific application which doesn't care about preemption at all, can it be done in clean, modular way? And will be any benefit?


If you want to run a set of processes with as little preemption as possible, for example in HPC setting, your most powerful option is to reboot the system with a selection of cores (exact amount will differ on your needs) set as isolated cpus and manually put your task there with taskset - but then you need to really manually allocate tasks to CPUs, it's trivial to end up with all tasks on wrong CPU.

The standard way is to set interrupt masks so they don't go to "work" cpus and use cpusets to only allow specific cgroup to execute on given cpuset.


You can get 95% of the way there by running a clean system with nearly no daemons, and your application setup to run with one os thread per cpu thread, with cpu pinning so they don't move.

Whatever the scheduler does should be pretty low impact, because the runlist will be very short. If your application doesn't do much I/O, you won't get many interrupts either. If you can run a tickless kernel (is that still a thing, or is it normal now?), you might not get any interrupts for large periods.


Last time I looked, it was surprisingly decoupled.

But the reason for drastically simplifying it would be to avoid bugs, there isn't much performance to gain compared to a well-set default one (there are plenty of settings tough). And there haven't been many bugs there. On most naive simplifications you will lose performance, not gain it.

If you are running a non-interactive system, the easiest change to make is to increase the size of the process time quantum.


I'd just use RT Linux. That has its own basic scheduler with the kernel scheduler running as the idle task. Real time tasks get priority over everything else.


Yeah, it’d be cool if preemption could adapt based on the event, but managing that for all events might mess with system stability. It’s like using tools like Tomba Finder for lead gen

you gotta balance precision (targeted leads) with efficiency so everything runs smoothly overall.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: