Hacker News new | past | comments | ask | show | jobs | submit login
The Linux Scheduler: A Decade of Wasted Cores (2016) [pdf] (ubc.ca)
174 points by PaulHoule 9 months ago | hide | past | favorite | 66 comments



Same goes for I/O scheduling. The virtual memory subsystem seems to only have two modes: buffering and write-out. The system will buffer writes until some threshold is reached, then it switches to write-out mode and flushes all dirty pages regardless of their age. The problem is that new pages are locked during write-out, so pending writes will stall, regardless of the memory pressure on the rest of the system.

A better design would be to only lock a subset of dirty pages and let new writes to virtual memory continue while the write-out is happening, but it doesn't seem like the system can accomodate that.

A simple test to see this in action is to use netcat to transfer a large file across the network, and monitor the device activity with sar or atop (or just check the disk/network activity lights). What you'll see is that while the disk is writing, network activity drops to zero, and when network activity resumes, the disk remains idle again for seconds. It doesn't matter how much smaller you make vm.dirty_background_{bytes,ratio} with respect to vm.dirty_{bytes,ratio}, the network traffic will block as soon as the "background" disk write starts. The only effect a low value for vm.dirty_background_{bytes,ratio} has is to increase the frequency at which the network and disk activity will alternate.


A complete nightmare. In the past I had to work on an IO-heavy system with 1-5ms read-latency guarantees and the only working solution was to completely manage all writes in the userspace. Manual buffering, manual throttling, manual fallocate, then either DIRECT_IO or mmap+sync_file_range depending on whether you need the written data to be readily available for reads or not.

There was a kernel patch though that solved around 95% of issues with no userspace changes: vm.dirty_write_behind. Unfortunately it never made it into the mainline kernel. [1] For strong latency guarantees it ws insufficient but it greatly improved the typical network/IO spike alternations described above.

I'm surprised it was never fixed upstream despite that fact that even most basic and simple scenarios like "nginx writes a log file to the disk" sometimes explode with seconds-long lockups on memory-fat machines.

[1] https://lore.kernel.org/linux-fsdevel/156896493723.4334.1334...


> Unfortunately it never made it into the mainline kernel.

That is unfortunate, and a bit surprising given how positively both Linus Torvalds, and Jens Axboe, responded to the patch.


Ran into similar pathological behavior on servers with huge amounts of ram (>1 TiB). Negative dentries builds up until the kernel decides it is time to reclaim and locks up the server for >60 seconds scanning pages. :-/


I also ran into spooky negative dentry issues. On a fleet of seemingly identical physical machines one machine would constantly check for files that didn't exist (iirc it was somehow part of nss and a quick Google search seems to confirm https://lwn.net/Articles/894098/ ).

I'm okayish at systems level debugging but I never figured that one out. It caused the kernel to use alot of memory. Arguable whether or not it impacted performance since it was a cache.


Yes, indeed. The more ram available, the more pages are available as buffer cache, the more pronounced the system stalls are. Though in my example, the determining factor is available_memory/disk_write_troughput, while in your example it is available_memory/single_core_throughput (assuming the page scan is a single-thread process).


Does disabling transparent huge pages help any here ?


Have not tried it, but I would not expect any difference. FAFAIK, transparent huge pages are only about organizing free/unused/allocated memory; in both mine and the parent's example, the issue isn't about memory management per se but about synchronous operations on data taking too long (I/O write in my case, cache cleanup for the GP).


Why do you think it's still this way? Why hasn't somebody implemented something better? Is it too core that we can't afford any instability?

Edit: apparently it has been done, but only very recently and hasn't had a major distro ship it yet


Oh? That sounds important and interesting. Do you have a URL or commit hash with more information?


I'm very confused now so I have no idea what to think. I need to take some time to read through everything, but in the mean time here are some relevant links I've seen:

https://lwn.net/Articles/925371/

https://lkml.org/lkml/2016/4/23/135


Hm yes, maybe you are confused. The articles you posted seem to be about CPU, not I/O.


Would a solution to this include some kind of new queue? It is the fsync barrier enough for synchronising the writes? I'm thinking of situations where you try to write a subset of blocks, but you end up writing some new ones before old ones.


I'm not sure how the current vm system tracks dirty pages, so I wouldn't know where to start suggesting realistic solutions. But yes, assuming the dirty pages are organized in some sort of specialized data structure (queue or otherwise), you could do this without needing new barriers: it would suffice to create a new dirty-page root structure before starting writeout, and redirect freshly dirtied pages to the new root. The writer process can then take its leisurely time committing the pending writes to disk without obstructing other processes issuing new writes. New writes to pending pages can be reparented to the new root via an RCU mechanism (to not violate previous write barriers), and incomplete writes can be reparented by the writer process in a similar way. When the writeout process completes, the old root can be completely discarded; the writer could even consume and free the old structure while traversing it.

But really, this is all fantasy given that I know nothing about the virtual memory internal data structures.


The kernel page cache is an XArray [1] (data structure here [2]) and dirty page pointers are annotated with the dirty tag when the page is dirtied [3]. There is a per-filesystem callback that usually always ends up here [4], where you can see the the above data structure [2] entry for that page is being tagged as dirty.

Edit: fixed typo

[1] https://docs.kernel.org/core-api/xarray.html

[2] https://elixir.bootlin.com/linux/latest/source/include/linux...

[3] https://elixir.bootlin.com/linux/latest/source/mm/page-write...

[4] https://elixir.bootlin.com/linux/latest/source/mm/page-write...


Thanks. So there is no "specialized data structure" for tracking dirty pages, the page dirty status is tracked within the regular page cache hierarchy. And the obvious in-line solution (use a wraparound counter for tracking dirty state, increase dirty values for new pages while writing out lower values) isn't easy because the tag is a bitfield. So the easiest solution would be to have two separate dirty markers (e.g. dirty_a and dirty_b), and have the writeout process toggle the new-dirty marker from A to B so it can safely write dirty_a pages without interference. And vice-versa for the next writeout.

There is just one caveat with regards to the previous solution: pages that are already pending writeout still need to be locked. In the previous design, a write to a page already pending writeout could be duplicated into the new dirty queue. But since it seems that a page can exist only once in the page cache, it cannot be written to without possibly violating write barriers. So the write must block until the writeout process has un-dirtied the page.

Armchair programming is fun, isn't it? :)


Related. Others?

The Linux scheduler: A decade of wasted cores (2016) - https://news.ycombinator.com/item?id=33462345 - Nov 2022 (26 comments)

The Linux Scheduler: A Decade of Wasted Cores (2016) - https://news.ycombinator.com/item?id=15531332 - Oct 2017 (35 comments)

The Linux Scheduler: A Decade of Wasted Cores - https://news.ycombinator.com/item?id=11570606 - April 2016 (38 comments)

The Linux Scheduler: A Decade of Wasted Cores [pdf] - https://news.ycombinator.com/item?id=11501493 - April 2016 (142 comments)


The Linux 6.6 kernel ships with a new default scheduler[1]. Is the testing methodology in this paper relevant and used to assess the current (EEVDF) scheduler, or is it irrelevant?

[1] - https://lwn.net/Articles/925371/


I'd like to wager that EEVDF has been tested less methodologically than how this paper investigates CFS. The primary author of EEVDF and maintainer of the subsystem has been dismissing alternative approaches and plethora of robustly tested patches from Google and Facebook over the years, with mostly replies boiling down to "meh I don't like it".

I'd take a patch of CFS and its millions of broken knobs from Google over newly released EEVDF any day, because I trust scheduler AB testing by Google over millions of machines and every single scheduling pattern under the sun way more than whatever synthetic micro-benchmark a single kernel dev (as competent as they might be) ran.

If you're interested in quantitative analysis of schedulers & tooling around it, these 2 projects are very interesting:

https://github.com/google/schedviz

https://fuchsia.dev/fuchsia-src/concepts/kernel/fair_schedul...


With the pretty stunning performance improvements EEVDF posts (over CFS), I’m not so sure where your hate for it comes from.


Where can I find benchmarks of EEVDF vs CFS that were 1) not run by Zijlstra and 2) not synthetic? ie AB tested on a large fleet of computers running heterogeneous processes.

I have nothing against the EEVDF algorithm itself (in fact I like it) and I dislike CFS very much. But I dislike the current development process of the Linux scheduler even more. Proper quantitative benchmarks of CPU schedulers are missing, which is why CFS ended in the sad state it did, where hundreds of patches were submitted to fix random edge cases over the years. What makes you confident that the initial EEVDF Linux implementation won't suffer the same fate, given that the development process hasn't changed (single kernel dev implementing it and running micro benchmarks)?


I mean, I only know of the test at the time of the 6.4RC: https://openbenchmarking.org/result/2305210-NE-2305205NE89

An increase in performance of 10-30% across "real" workloads. Just for changing one line (well, two lines including disabling p-state drivers)? I'll take it.

I would say its not a strange position to assume it has been improved further since then.

What would be interesting although niche is checking how iGPU's perform alongside it. I know that on Intel, "thermald" lowers iGPU performance because it improves CPU utilization and thus leaves less mW for the iGPU. Perhaps something similar will happen with EEVDF.


There’s nothing in that article that talks about how work stealing is managed, so there’s no knowing if this scheduler fixes the problem or makes it worse.


and something reasonably like it existed for... more than a decade, but mingo was not a fan, threw a hissyfit and originated CFS, and well, torvalds didnt care too much, so deferred to mingo


Having experimented with a lot of CFS knobs, my high-level conclusion is that every single non-default behavior is broken to various degrees, and default behavior isn't immune to pathological edge cases either.


Ultimately -- the thing is, if anyone is both capable and willing, they can, and sometimes do, fix it.

Granted, this combination is rather rare. Most people aren't capable. Of those who are, they have better things to do and they probably have very well paying jobs they could be focusing on instead.

With that being said, Linux is _still_ more efficient than Windows.

I don't want to say Linux is free, in practice it's not, those who are running the big powerful machines are using RHEL and paying hefty licenses.

Which are still better than any other alternative.


> I don't want to say Linux is free, in practice it's not, those who are running the big powerful machines are using RHEL and paying hefty licenses.

Google/Amazon/MS isn't paying RHEL licenses.

Reason for using RHEL is basically "we don't want to hire experts", which makes a lot of sense in small/midsized company but in big one it's probably mostly the ability to blame someone else if something is fucked up, and the fact some of the software basically says "run it on this distro or it is unsupported".


That's basically it. As a sysadmin, I supported over hundreds of Linux servers (mostly virtual) pretty much by myself. We paid the license so that if the shit hit the fan (and we really had no way of fixing it) we can call on Red Hat. At least, I could be able to tell my bosses that things were being handled.

This never happened, of course. But it's CYA.


I've used RHEL support too, believe it or not, while working as a DevOps engineer at Red Hat on www.redhat.com. And the support specialist assigned to my issue was very sharp and solved the issue quickly. Easy in hindsight but I was stuck so I reached out, and I was glad I did. It involved a segfault and grabbing a core dump and analyzing it. ++Red Hat Support


A product I worked on got real popular, so scaling up the support department became a real pain since the devs were running all kinds of different distros and so we had CI and package support for Arch, Debian, Ubuntu, CentOS... except that all of those come in a bunch of different variants and customers expected the product to run on them. At one point more than 90% of the tickets were dealing with distribution-specific weirdness. There was an epic internal wiki page that must have weighted in at a couple megabytes of raw text that was an encyclopedia of run-time linker caveats for hundreds of distributions, complete with troubleshooting flowcharts and links to our internal test farm to get one-click access to a VM reproducing the scenario described. I still wish I had that, it would have been so handy for some of the Rust and MIPS porting I'm doing now.

We couldn't containerize it (it's a client for a security monitor), we couldn't ship it as a VM, so eventually we all agreed to ditch everything except RHEL and everyone else would be unsupported (we grandfathered all the existing ones in, and it took a year and a half before everything worked.).


There are well-paying jobs that allow smart people to focus on exactly this.


> capable and willing

This is the sort of research that scientific grants are _supposed_ to be targeting for the public good. Supposed to be.


When we replaced a venerable Tru64 cluster with Linux about 15 years ago, the head of the department (who was skeptical of linux) came to me with a problem:

he had scheduled 9 "use 100% of a core" processes on an 8 core machine and expected each process to get 8/9ths of a core on the machine, with each process proceeding at the same rate. (If I got that math wrong, basically I mean that all processes should receive the same amount of CPU time over an interval, with the expectation that the CPUs will be at 100%).

Instead, he found that one of the 9 processes always went slower than all the others. Tru64 did what he expected. I believe at this point, we were running a kernel before the completely fair scheduler, but I've always wondered if there was a right answer here- I don't think UNIX or Linux truly 'guarantee' fully predictable and evenly divded CPU performance when you have n+1 (or n+whatever) processes running on n cores.

(also, he used "yes > /dev/null" as the CPU-generating task and I had to spend a lot of time explaining to him that wasn't really 100% cpu)


> that wasn't really 100% cpu

Can you explain?


That's probably referring to the fact that running 'yes' is going to spend virtually all of its time on syscall and context switch overhead, and minimal real work or IO. So conclusions from such a test may not generalize to any kind of realistic workload.


the yes command, writing to /dev/null, is making IO calls, which interfere with predictable scheduling.

If you look at the source code for yes, https://github.com/coreutils/coreutils/blob/master/src/yes.c

it builds a buffer of output and then writes that in a for loop

  while (full_write (STDOUT_FILENO, buf, bufused) == bufused)
    continue;
Inside the kernel, /dev/null is handled by doing nothing. So basically when this runs, the majority of the CPU time it consumes is just context switches from process to kernel and back. It consumes 100% of a core on an idle machine, but what you really want for your basic scheduling test is to have a pure CPU hog with no kernel transitions (I've seen variations- increment a register, do some light flops that don't hit memory, or even just a nop loop (which is basically "increment register, branch if less than"). If you get the expected performance on the cpu hog, then you can proceed to the yes hog.


This paper is from 2016. Anyone know the status of this now?


It was discussed at the time, and the consensus was the patches were unusuable, the testing methodology dubious, and the machine configuration rare.

https://lkml.org/lkml/2016/4/23/135


This paper caused me to start doubting the Linux scheduler on medium sized boxes ( 36 physical cores in 2016 ) and helped me find an annoying bug in the rhel 7 kernel which prevented some processes from being scheduled at all in some cases( it was an issue with Numa load balancing if I remember well). Note my bug was most likely unrelated to the paper, but it marked in my mind a specific class of bugs as possible.

So it helped me at least :-)


The Linux kernel received a new scheduler not too long ago (https://lwn.net/Articles/925371/), so I'm not sure how relevant the critiques about CFS are.


Looks great! According to one article, those changes just hit shore last month in kernel 6.6, so it's likely not in most os distributions yet.

https://tuxcare.com/blog/linux-kernel-6-6-is-here-find-out-w....


FWIW, Pop OS just updated to the 6.6 kernel.


Interesting! I've used Linux Mint for the last 5+ years (and am using the Xanmod kernel, which is on 6.6), but have always been Pop OS curious. This increases my curiosity, I may have to give it a go on my laptop.

https://xanmod.org/


As did Alpine in 3.19


In fedora 38 and 39, just update.


I'm surprised that nobody talked about the problems of scheduling problem, being NP-Complete, is all about finding heuristics to workaround this ultimate algorithmic hard truth.

And for multicore scheduling, job shop scheduling [1], it is the even harder version of NP-Hard (no pun intended). This is literally taught in senior year uni CS class, that there is no perfect solution to the scheduling problem, unless Cook-Levin Theorem or anyone of those NP-Complete problems are solved [2].

[1]: https://optimization.cbe.cornell.edu/index.php?title=Job_sho...

[2]: https://www.cse.cuhk.edu.hk/~siuon/csci3130-f19/slides/lec22...


This is technically true but practically irrelevant. Case in point, I deployed at Netflix a real-time combinatorial solver specifically for CPU scheduling [1].

Most real-world combinatorial problems can be solved relatively well. This is why Gurobi is in business, Alpha Go exists, or why Amazon and United Airlines still manage to practically solve their resource allocation problems.

[1]: https://netflixtechblog.com/predictive-cpu-isolation-of-cont...


first time earing about Gurobi, is it some kind of SMT solver? I tried looking at their website but could only find high level marketing description.


I tried to read the article, but eventually gave up. There’s just too much bad science in it.

By too much I mean:

1. A clickbait title

2. A conclusion that phrases “scheduling was thought to be a solved problem”


> As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues.

This oddly lines up with an observation I've made about traffic lights. It is very surprising how often you'll be stopped at an intersection where everybody is waiting and nobody is allowed to go, sometimes for what feels like 10-30 seconds.

It seems like the lowest-hanging fruit for improving traffic is to aim never to have degenerate intersections. If someone is waiting and nobody is going, the light should change quickly.


That sounds like a protected left turn but no one is turning so it looks stalled.


Isn’t that usually to give pedestrians a head start on crossing?


In some places the lights work like this, but you need to install relatively expensive sensors

The sensors help gather high quality data for traffic models though. That should help traffic engineers to do their job


And yet Linux has consistently out preformed macOS in terms of throughput.


Yes. It turns out one thing being bad does not make a completely different thing good.

(You'd think this was obvious, but I've been on the Internet a while and it sure isn't.)


Is their LU benchmark just LU factorize and possibly solve? That seems like an odd candidate for this kind of problem, because of course anybody running that kind of problem that cares about performance just doesn’t over-subscribe their system… the scheduler has a much easier job when there’s one big chunky thread per core.

I mean, I get that they are testing a general fix for the sort of problems that, like, non-scientific-computing users get. So it isn’t to say the work is useless. It just seems like a funny example.

I think coming up with benchmarks for schedulers must actually be very difficult, because you have figure out exactly how silly of a thing you want the imagined user to do.


Further to this, if you are running a program where compute is mainly a sparse linear solve, you are memory bandwidth limited and on modern systems like AMD Epyc you will very very likely see improved performance if you undersubscribe cores by 2x - e.g. using max 64 threads per 128 cores.


That’s interesting. I think it must be the case but just to triple check my intuition, I guess you want the threads spread out, so you can get every chiplet in on the action?


No, I think it's more about how cores may share infrastructure (like M cache etc)


> Abstract: As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invari- ant is often broken in Linux. Cores may stay idle for sec- onds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold per- formance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14- 23% decrease in TPC-H throughput for a widely used com- mercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are in- effective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our in- vestigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers’ tool belt can help keep this type of bug at bay.


Didn’t read the paper, but Linux kernel and userspace gotten a lot of tech debt while it became the most common OS.

I’d love to see something new built with modern practices and ideas come and dethrone it. Maybe Theseus OS (not affiliated).


I hesitate to call "the shit that keeps stuff older than a year working on that kernel" a tech debt. It's necessity in any general use OS.

The stuff that can be changed without breaking compatibility is, well, it was the developer's best idea at the time and some of them turned out to be good, some turned out to be bad.

Going "but on paper that idea would be better, let's make OS around it" rarely ends up being plain better. It's not one dimensional.

For example if your CPU scheduler makes all jobs run faster (less cache thrashing while keeping CPUs busy etc.) but is unfair, that might be great for people running batch jobs, but shit for desktop users.

Scheduler wasting cycles but making sure interactive tasks get the right level of priority might be improvement for desktop users but not for server use etc.

Or you figure out that the reason something was "unnecessarily complex", was actual necessary complexity that wasn't obvious at first.

Also Linux isn't exactly stranger to taking the whole bunch of code and replacing it with something better, I think we're at 3th firewall stack now (ipchains -> iptables -> nftables)


Something I would love to find is a practical/succinct guide on Linux kernel performance tuning. I'd love it to give examples of sysctl's you'd adjust for specific use cases like:

- a real-time kernel

- a single user mode kernel

- an embedded device kernel (like an esp 32 or rpi)

- general purpose desktop use (surely there are gems in here)

- to use the linux kernel as a hypervisor hosting others

- tuning the linux kernel for within a vm (guest to another hypervisor)

- tuning for gaming performance

I myself do not know enough about sysctls, and I'm sure it's a goldmine.


There was another one before ipchains - ipfw.


I've been keeping an eye on Redox for several years now, as well. It'll be interesting to see what floats and what sinks in the decades to come.


Thanks for reminding me to bookmark that project! I'd forgotten about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: