Hacker News new | past | comments | ask | show | jobs | submit login
In defense of swap: common misconceptions (2018) (chrisdown.name)
182 points by linsomniac on April 20, 2022 | hide | past | favorite | 165 comments



(The article is from 2018.)

I think the article is strongly from the point of view of what to do "in production". If you have a bunch of servers with known specialized workloads, I can believe that enabling swap is good for efficiency. You can run more tasks and get closer to the memory limit while remaining safe. Coming from a Facebook employee, this makes sense.

However, for individual development machines, the pragmatic situation is completely different. The truth is that in practice running out of memory stems from two types of human errors:

1. You wrote some code that leaks lots of memory quickly.

2. You used Google Chrome.

In either situation, still in 2022, Linux reacts in the following way:

A. With swap, your system becomes completely unresponsive for longer than you have patience for, so you power-cycle it.

B. Without swap, the culprit is immediately identified and killed. Your system is perfectly usable, aside from a killed process.

Note that depending on your distro, the good case B. may have become a bit harder to achieve recently: systemd-oomd now tries to eagerly detect OOM situations before the kernel, and when it does, it kills the offending process' whole cgroup! If you have gnome or kde, you're good, but otherwise, this may terminate your session. This can be fixed with "systemctl mask systemd-oomd.service"


B. Without swap, the culprit is immediately identified and killed. Your system is perfectly usable, aside from a killed process.

This is not my experience. I ran my desktop for a few years on this theory, and in practice I found the system behaviour far worse when the system ran out of memory, it would lock up completely as you mention with case A. With swap enabled I would eventually reach an unresponsive state, but not immediately: there would be a gradual slowdown in which I could take action to resolve the problem. I believe this is because of the page thrashing of file pages of the executables running (point 3 mentioned in the article). In practice if you want behaviour B you need to run something like earlyoom which kills processes before the kernel starts thrashing the disk.


There's also B-2. The kernel kills an unrelated process that happened to request a bit of RAM at the moment of OOM. The system becomes responsive for a while, and then the kernel starts looking for a new scapegoat. Which might be the same as the old scapegoat that got automatically respawned by your monitoring tool. This poor program keeps crashing for no reason and you're tearing your hair out trying to find out why. :(


Yes, in my experience those are the actual tradeoffs. With swap things slow down and that alerts you to the problem and you can go manually OOM kill the right thing. Without swap the kernel "randomly" kills the wrong thing without fail, often leading to a system that you have to reboot to get back into a sane state, and leading to a small industry of trying to tune the OOM killer to never do that.

Back in my day (which was a long time ago) we did tune the OOM killer in prod to hit only the right processes first (e.g. apache httpd or whatever software processes were deployed on the server) and that would usually lead to self-recovering behavior where one bad request that caused an OOM in one proc would be killed and the server would recover. That was only in prod though where we understood exactly what software ran on which instances.

So, I'd tend to suggest running with little to no swap in prod and tuning the OOM killer because you know what processes are likely to be the issue in prod. While on a desktop/laptop or something its better to have swap because the random workloads you throw at that are going to make tuning the OOM killer impossible.

I appreciate the argument that you should still have some swap in prod for paging under utilized anon memory, but I'd like to see some solid numbers about that vs. running swapless and to weight that against the fiddliness of managing the swap files. I suspect in the majority of cases you're going to see that it doesn't make any difference in actual performance numbers, and that you shouldn't be running so close to the edge that it would matter. But measure for your own situation.

I was also running in a mostly HDD era not SSD so things may have changed. That probably suggests less of a penalty towards swapping though and the right answer for desktop/laptop loads to be using swap to avoid the randomness of the OOM killer. That may lead though to using more swap in prod since it may degrade and recover more gracefully these days instead of the absolute catastrophe that swapping to the HDD was back in the day.


There are lots of times I have ran out of memory on modern desktops. Even with ones that are relatively large mounts of RAM, 16/32 GB.

The argument for swap is even higher on desktops then Facebook servers because unlike servers you lack the predictable work flow.

And you are going to run into more situations were you have memory leaks or processes that are running that you don't use for extended periods of time.

Not having swap on the desktop leaves you with a unoptimized system.

> B. Without swap, the culprit is immediately identified and killed. Your system is perfectly usable, aside from a killed process.

No. This is never been my experience.

The normal experience is:

OOM killer kicks in, something strange happens, and user is left confused as to why what happened happened. It requires sysadmin skills to go in and interpret dmesg to understand what happened, and even then it is likely to be very unclear without additional information.

> Note that depending on your distro, the good case B. may have become a bit harder to achieve recently:

It's always been shitty. It hasn't "gotten worse". It's always been bad.

OOM is a emergency situation and still leaves your system in an unknown state.


> No. This is never been my experience. The normal experience is: OOM killer kicks in, something strange happens, and user is left confused

For linux servers and junior devs, yes, but for my parents on mac desktop, getting a "you are almost out of memory" dialog is simple and actionable while paging hell is mysterious and not actionable.


You are right. I wonder if it would make sense for distros to put processes that are critical for interacting with your computer in a cgroup with 0 swappiness, so your computer remains responsive, even if some processes gets dog slow.

Some candidate processes:

* ssh-server

* X/wayland

* Desktop environments

* Maybe some terminal emulators and shells?

The tricky part is that you probably want at least a shell that remains responsive, but you probably don't want to set their child processes' swappiness to 0. There could be some whitelisted system utilities that are known to not consume a lot of resources, so you can list processes conveniently and kill the appropriate ones.

I assume Windows also treats some processes specially so it can remain responsive, and I think now with appropriately set cgroups Linux could have the same capability.


> With swap, your system becomes completely unresponsive for longer than you have patience for, so you power-cycle it.

My entire life using linux (>15 years) never had a situation where once swap caused system to slow down, system eventually got back on feet. Usually, swap causes other apps to slow down, swap and put system in a state it can't recover from.

Maybe I'm mistaken but my rule of thumb is to never use swap unless I specifically need it. This means, for a default desktop OS, I would never enable swap. If we run out of memory, apps should be killed.


I think case B is not so cut-and-dry. If the code is leaking slowly enough, you will encounter case C, where you don't have swap and the kernel starts swapping out your code pages, and then your computer slows to entirely completely unusable levels for far longer than in case A.

Especially if you swap out to a relatively low latency disk (SSD), swap is much better than risking state C.


A much better way to handle desktop memory is to have both a generous amount of swap on fast storage, both to allow hibernation AND to allow a bit of wiggle room if things fill up too fast, and also installing and enabling earlyoom(1) with a high but effective threshold to prevent a lockup. Sure, it can cause things to break when using lots of memory, but it never lets you completely stall out.

There are now configurations within systemd for this, but I'm simply more familiar with earlyoom daemon.

https://manpages.debian.org/bullseye/earlyoom/earlyoom.1.en....

https://wiki.archlinux.org/title/Improving_performance#Impro...


For me it works well with swap. I'm playing an indie game that suffers from a slight memory leak. After 4 or 5 hours the oom killer comes along and puts an end to it. Everything else gets pushed to swap before the game is killed. No unresponsiveness.


The only real solution (at least until and unless the kernel OOM killer is tuned to be massively more aggressive, which I doubt will happen) is to run a userspace OOM killer. If you don't like systemd-oomd, there are many alternatives, some which even show a desktop notification when you're dangerously low on memory and when it actually kills processes.

Maybe it would be interesting to see if there could be better kernel APIs for things like userspace OOM killers; ideally, we'd want to guarantee that the userspace OOM killer is always prioritized in low-memory situations, and ideally it'd be possible to install low memory event listeners into the kernel rather than to poll.


I disagree. The real solution is to do away with the need for OoM killer in the first place by turning off overcommit (in its current form anyway) and fixing the broken programs that rely on its behavior.


In practice, programs crash when they receive a null pointer from malloc (either through throwing an uncaught exception, or through `if (!ptr) { abort(); }`, or through dereferencing a null pointer). So even if your solution was realistic, it would just entail killing a random process, and it would prioritize killing an essentially random process. When you reach OOM situations and need to kill processes, there are probably better heuristics than "kill whichever process happened to allocate memory after we ran out".


So we're covering up bad programmer behavior by lying to them and then shooting other programs in the head when everything goes south. By default. Sorry if I feel like we should be able to do better than that.

Windows, for instance, doesn't have this kind of overcommit. Allocations need to be backed by RAM+pagefile (though Windows will grow the page file if it can and needs to).


Things go very south in windows if you ever use really much swap. Had this with a CAM process using a lot of swap.

You barely move the barrier until the backing pagefile cannot grow anymore (which with a fast but small nvm can be reached within a few seconds). After that you get stuff like a taskmanager without fonts, as there is no memory for loading it anymore...


Yeah, so, worst case it is exactly like Linux. However, at least with Windows properly behaving software isn't being lied to about its allocations.


My argument was that _even without overcommit or swap_, we would probably want some kind of OOM killer, because the heuristic of "kill the process which happens to try to allocate first when we have run out" is probably one of the worse heuristics for which processes should be killed.


Except that's up to the application to behave that way instead of some mysterious heuristic. If the programmer decides that terminating is the appropriate behavior, or is too lazy to do otherwise, then that is on the program. If programs have broken behavior then fix them, otherwise what is the point of all this open source software anyway? I find it incredible that Linux developers are expected to routinely deal with API breakages on library updates but would rather have random processes be terminated because the OS lies about memory than fix the badly behaving software!


And instead force every application which needs to store large amounts of temporary data to implement its own swapping mechanism? Don't you think we will end up with a lot of even less optimized swapping systems that way?


If you need swap, you need swap.... You shouldn't borrow it from other programs executable pages (without proper accounting).


IMO, a good start would be to fix the OOMKiller to kill all processes that overcommit first ordered by size of overcommit (maybe match the uid first).


Usually in B my computer will lock up before the OOM killer is ran. I spam alt + sysreq + f and then wait 30 minutes hoping that the OOM killer will eventually kill a process and I pray it doesn't take down X and close everything I had open.

I am not running systemd-oomd or any other user space oom killer runner though.


Articles like this further cement my belief that Linux's low memory handling is deeply flawed. (macOS's too, btw.)

My experience with Linux machines that have HDDs and swap is that once they start swapping, they slow down not only until memory is freed (ctrl-C gets through, OOM killer runs, whatever) but actually until reboot. My theory is that processes' VM space is now a minefield. Both anonymous memory and file-backed mmaps now have many gaps which get paged in on demand with inadequate readahead. With each seek taking ~10 ms, it doesn't take too much of that before you should give up and just reboot.

Once memory is available, I'd like the system to proactively page back in both swap and file-backed pages to prevent this. If it does need to happen reactively, it should be in big blocks (each 10 ms of seeking should be follow by at least 10 ms of reading => ~1 MiB, not ~4 KiB or whatever it's doing now). By default.

I'm not convinced the decision to use swap is ever a good one. E.g., from the article:

> With swap: We can choose to swap out rarely-used anonymous memory that may only be used during a small part of the process lifecycle, allowing us to use this memory to improve cache hit rate, or do other optimisations.

This is short-sighted. Programs can be written to tolerate their filesystem I/O being slow. They often take care to do I/O on a particular thread which is not handling network or GUI operations. They can't realistically be written to tolerate arbitrary memory accesses (including program text, stack, and heap) being that slow. It's a stupid idea to potentially slow down the latter to (try to) speed up the former. And programs shouldn't all have to mlockall, which iirc has other bad consequences. (maybe it allocates backing pages for all mappings immediately? even stuff like stack guard pages that's pure waste?) I have written programs to mlock (just) their program text; I'd prefer that weren't necessary either.

I prefer that under memory pressure, a task dies promptly. Ideally the "right" task, but really it just has to be better than making the system unusable for a long time. Even if it kills the display server or pid 1 inappropriately, I was probably going to have to reboot anyway, so...


I agree. Another issue with Linux is that it provides no way to choose which tasks to kill when you start heavily swapping, and the OOM killer won't do anything because it still thinks everything is fine. You can't just use the normal UI to kill tasks at that point because it will have slowed to a crawl due to all the swapping.

Back when I used Linux with spinning disks it was much better to disable swap and hope the OOM killer did something sane than to wait for 10 minutes for a terminal to open and eventually give up and hard restart.

Windows handles it much better. You just press ctrl-alt-delete and because the GUI is properly integrated into the kernel it can always react and pause other tasks even if the system is heavily loaded.


>no way to choose which tasks to kill when you start heavily swapping

This is false, of course. oom_score_adj, or just read on oom_killer.


That's not a choice, it's a predetermined bias.

Anyway, the "real solution" is oomd + PSI (pressure stall information), which can detect when memory is running low and the system is affected.


> I prefer that under memory pressure, a task dies promptly.

This is exactly what memory pressure indicators can be used for. Linux also supports volatile page ranges, that get wiped out under memory pressure even with no swap support.


The os will drop program text if it encounters memory pressure even if swap is disabled. Anything that isn’t anonymous memory is fair game and it does it for even well behaved programs on the system.


I mentioned that, yes. It's also problematic, particularly on HDD systems. (SSDs are fast enough that it's mostly okay.) This is why I said "Linux's low memory handling is deeply flawed" rather than just "turning off swap fixes everything".


Got it. Totally agreed that it's deeply flawed.

Without swap, even with SSD's, the system gets into a very bad state when a cgroup is close to running out of memory.

The behavior ends up being that the kernel ends up constantly swapping out everything but anonymous memory until it can't anymore and then it finally terminates the process when it can't page anything else out. That can unfortunately impact other process on the system during that time.


It seems like there are enough failure modes that everyone has a different story about bad behavior, and unfortunately I don't think there's one solution a sysadmin can adopt (e.g., swap or not, sysctl settings, cgroup settings) that reliably prevents them all. I've been guilty in the past of telling people turning off swap dramatically improves things, but in your scenario it sounds like the opposite is true. /shruggie

The PSI-based oom daemon seems like an improvement, but I imagine there are still gaps there if it doesn't react quickly enough. Adding more parts can often paper over problems but rarely completely solve them...I think what's needed is a rethink: don't page these vital things out. Instead kill a process (chosen by userspace if possible, anything if not). Reserve enough for the killing rather than doing more reclamation to make it happen.


Also, more quickly understand the system isn’t making progress with dropping memory pages and kill the program before it becomes too late.


A mini version of this was presented in the wonderful & amazing "Linux Memory Management At Scale" video, also by Chris Down: https://www.youtube.com/watch?v=cSJFLBJusVY&t=11m42s

You should have swap. You should have swap!!

This video is one of the most amazing, useful, gem-after-gem explanations of Linux you could possibly ever encounter. There are so so few who have the in depth knowledge, across so many broad experiences, at such scale, to understand Linux's memory management.

Side topic #1: this video's advice on cgroups just makes me sad, because I love Kubernetes, but Kubernetes building it's own scheduler, in contest with not in tandem with the kernel's cgroup system, is super super super f-ing sad. Kubernetes running all it's shit in a single cgroup is some decrepit sad sad shit. I still love & user Kubernetes, but god damn, for what basically amounts to a cloud os, it's using the underlying host-OS extremely extremely poorly. This video was key in making me think & consider how much we've lost, how sad it is we don't take advantage of what the kernel is trying to do for us with trading off workloads. Leaving the kubernetes scheduler to try to fit all the workloads feels like we're letting a bunch of ignorant elementary schoolers run the show. It's embarrassingly superficial.

Side topic #2: it would be really interesting to try to go a bit deeper on the topic. I'd love a world where we can easily ask- what are these anonymous rarely used memory pages we get to swap out? That's the main advantage, under low & medium pressures: the system has some flexibility to move out things not really used, where-as it has to keep everything resident without swap. But answering: what are these things not being used: that probably requires bptrace foo to figure out what regions are getting swapped, plus deep skills to figure out what those addresses correspond to. I doubt Linux will just tell us, oh, I evicted this text block from the binary. (It likely doesn't know much about what blocks it's evicting.)


Depends?

If you have a build farm / CI machines, don't use swap. With swap, if a user schedules too many compiles at once, machine will slow to a halt and become kinda-dead, not quite tripping dead timer, but not making any progress either. Instead, set up the OOM priority on the users processes so they are killed first. If OOM hits, clang is killed, build process fails, and we can go on.

If you have a latency-critical server, don't use swap. With swap, some data will get swapped out, and you will have sudden latency trying to access it. Without swap, it may still swap out the code, but the code is usually much smaller than the data, plus it is all pretty hot.

If occasionally you have to process a very large datasets, and you are willing to wait a minute or twenty, you may enable swap. But be patient, and make sure your ssh has no timeout which can be hit.

If you have a small amount of RAM, enable swap. It would be slow, but at least you would be able to do something without crashing.


> If you have a build farm / CI machines, don't use swap. With swap, if a user schedules too many compiles at once, machine will slow to a halt and become kinda-dead, not quite tripping dead timer, but not making any progress either. Instead, set up the OOM priority on the users processes so they are killed first. If OOM hits, clang is killed, build process fails, and we can go on.

This doesn't really work that well. It's true that if you enable swap and have significant memory pressure for any extended period your machine will grind to a halt, but this will _also_ happen if you don't use swap and rely on the Linux OOM killer.

Indeed, despite the lack of swap, as part of trying to avoid OOM killing applications, Linux will grind the hell out of your disk - because it will drop executable pages out of RAM to free up space, then read them back in again on demand. As memory pressure increases, the period of time between dropping the page and reading it back in again becomes very short, and all your applications run super slowly.

An easy solution to this is a userspace OOM-kill daemon like https://facebookmicrosites.github.io/oomd/ . This works on pressure stall information, so it knows when your system is genuinely struggling to free up memory.

On the historical fleets I've worked on pre-OOMD/PSI, a reasonable solution was to enable swap (along with appropriate cgroups), but target only allowing brief periods of swapin/out. This gives you two advantages:

* allows you to ride out brief periods of memory overconsumption

* allows genuinely rarely accessed memory to be swapped out, giving you more working space compared to having no swap


Eh, I’ve never seen a machine actually use any notable amount of swap and not be functionally death spiraling.

I’m sure someone somewhere is able to use swap and not have the machine death spiral, but from desktop to servers? It’s never been me.

I always disable swap for this reason, and it’s always been the better choice. Not killing something off when you get to that point ASAP is a losing bargain.


FreeBSD isn't Linux, but I've had FreeBSD machines fill their swap and work just fine for months. I had one machine that had a ram issue and started up with a comically small amount of ram (maybe 4 mb instead of 256 mb... It was a while ago) and just ran a little slow, but it was lightly loaded.

I've also had plenty of machines that fill the swap and then processes either crash when malloc fails or the kernel kills some stuff (sometimes the wrong thing) or something things just hang. Measuring memory pressure is tricky, a small swap partition (I like 512 MB, but limit to 2x ram if you're running vintage/exotic hardware that's got less than 256MB) gives you some room to monitor and react to memory usage spikes without instantly falling over, but without thrashing for long.

You should monitor (or at least look at) both swap used % and also pages/second. If the pages/second is low, you're probably fine even with a high % use, you can take your time to figure out the issue; if pages/second is high, you better find it quick.


The issue is specific to Linux. I’ve had Solaris and SunOS boxes (years ago) also do fine.


Don't mistake every machine you have seen death spiraling using swap, with every machine using swap as death spiraling. Notably, how many machines did you not have to look at, because the swap was doing just fine?


That I’ve administered? None under any significant load!

I even finally disabled it on the lab raspberry pi’s eventually, and a SBC I use to rclone 20+ TB NVR archives due to performance problems it was causing.

It’s a pretty consistent signal actually - if I look at a machine and it’s using any swap, it’s probably gotten wonky in the recent past.


Apologies. I forgot I had posted something. :(

I am a little surprised that every machine you admin has had issues related to swap. Feels high.

For the ones that are now using swap and likely went wonky before, how many would have that crashed due to said wonkiness?


There are plenty of workload which sometimes just spike.

Batch process for example.

With proper monitoring you can actually act on it yourself instead of just restarting which just leads to a oom loop.


If you pushed something to swap, you didn’t have enough RAM to run everything at once. Or you have some serious memory leaks or the like.

If you can take the latency hit to load what was swapped out back in, and don’t care that it wasn’t ready when you did the batch process, then hey, that’s cool.

What I’ve had happen way too many times is something like the ‘colder’ data paths on a database server get pushed out under memory pressure, but the memory pressure doesn’t abate (and rarely will it push those pages back out of swap for no reason) before those cold paths get called again, leading to slowness, leading to bigger queues of work and more memory pressure, leading to doom loops of maxed out I/O, super high latency, and ‘it would have been better dead’.

These death spirals are particularly problematic because since they’re not ‘dead yet’ and may never be so dead they won’t, for instance, accept TCP connections, they defacto kill services in ways that are harder to detect and repair, and take way longer to do so, than if they’d just flat out died.

Certainly won’t happen every time, and if your machine never gets so loaded and always has time to recover before having to do something else, then hey maybe it never doom spirals.


I try to avoid swap for latency critical things.

I do a lot of ci/CD where we just have weird load and it would be a waste of money/resources to just shelf out the max memory.

Other example would be something like Prometheus: when it crashes and reads the wal, memory spikes.

Also it's probably a unsolved issue to tell applications how much memory they actually are allowed to consume. Java has some direct buffer and heap etc.

I have plenty of workloads were I prefer to get an alert warning and acting on that instead of handling broken builds etc.


I think the key here is what you mean by using swap. Having a lot of data swapped out is not bad in and of itself - if the machine genuinely isn't using those pages much, then now you have more space available for everything else.

What's bad is a high frequency of moving pages in and out of swap. This is something that can cause your machine to be functionally unavailable. But it is important to note that you can easily trigger somewhat-similar behaviour even with swap disabled, per my previous comment. I've seen machines without swap go functionally unavailable for > 10 minutes when they get low on RAM - with the primary issue being that they were grinding on disk reloading dropped executable pages.

I agree that in low memory situations killing off something ASAP is often the best approach, my main point here is that relying on the Linux OOM killer is not a good way to kill something off ASAP. It kills things off as a last resort after trashing your machine's performance - userspace OOM killers in concert with swap typically give a much better availability profile.


100% agree.

In a situation where a bunch of memory is being used by something that is literally not needed and won’t be needed in a hurry, then it’s not a big deal.

In my experience though, it’s just a landmine waiting to explode, and someone will touch it and bam useless and often difficult to fix machine, usually at the most inconvenient time. But I also don’t keep things running that aren’t necessary.

If someone puts swap on something with sufficiently high performance, then obviously this is less of a concern too. Have a handful of extra NVMe or fast SSD lying around? Then ok.

I tend to be using those already though for other things (and sometimes maxing those out, and if I am, almost always when I have max memory pressure), so meh.

I’ve had better experience having it fail early and often so I can fix the underlying issue.


When I reenabled swap on my desktop (after running without swap for years assuming it would avoid the death spiral, only to find out it was almost always worse because there was no spiral: it just froze the whole system almost immediately), it would frequently hold about 25% of my RAM capacity with the system working perfectly fine (this is probably an indication of the amount of memory many desktop apps hold onto without actually using more than anything else, but it was useful). In my experience if you want a quick kill in low memory you need to run something like earlyoom to kill the offending process before the kernel desperately tries to keep things running by swapping out code pages and slowing the system to a crawl.


It's only one datapoint, but at this very moment a server at work is using a notable amount of swap, 1.5 GiB to be more precise, while functioning perfectly normally.

    $ free -h
                  total        used        free      shared  buff/cache   available
    Mem:          3.9Gi       1.7Gi       573Mi       180Mi       1.6Gi       1.7Gi
    Swap:         4.0Gi       1.5Gi       2.5Gi


I wish you luck! Only time that’s happened before was memory leaks for me, and it didn’t go very long before death spiraling. But if you’re comfortable with it, enjoy.


It's still working just fine, with still the same amount of swap in use (approximately).


> Eh, I’ve never seen a machine actually use any notable amount of swap and not be functionally death spiraling.

For my low-end notebook with solid-state storage I set the kernel's swappiness setting to 100 percent and this problem got magically fixed. It's rock-solid now.

I don't know how it works but it does.


It's pretty common for me to see a gig or two in swap, never really wanted back, and that RAM used for disk caching instead.


I think "Linux drops will drop executable pages without swap" is a symptom of machines with small amount of memory, say 4G or less. So it is pretty outdated for regular servers, and probably only relevant when you are saving money by buying tiny VMS.

Those build servers had at least 64GB of RAM, while executables were less than 1GB (our entire SDK install was ~2.5GB and it had much more stuff than just clang). So a machine would need to finely balance on memory pressure: high enough to cause clang to be evicted, but low enough to avoid OOM killer wraith.

I don't think this is very likely in machines with decent amount of memory.


Fair enough - I've seen it more commonly in smaller machines, but they're also more common in the fleets I've observed (and the ones that are more likely to run close to the edge memory-wise). I have also seen it in systems up to 32GB RAM, so it's by no means a non-issue in systems that are at least somewhat larger. The general observation that oomd/earlyoom + swap is a better solution than no swap still generally applies.


There are CICD builds out there which consume much more resources and time were just killing one part of the build would destroy the work of hours.

Not sure why you wouldn't want swap for it?

It will allow you to fine-tune the build later and give that build a realistic chance to finish


Because once swap activates, build now takes hours instead of tens of minutes. So it would timeout anyway, but only after wasting lots of resources. And even if you increase the timeout a lot instead, your machine how has a bunch of things swapped out, so now your tests timeout, which is even worse.

Yes, killing that part of the build did destroy the work of hours. It was still better to disable the swap than try to "ride it out".


Things don't just take hours longer just because the Linux kernel throws out a few pages which haven't been used for a while.

And it also totally depends on how much memory is missing.

I still prefer to have something taking 20 minutes longer instead of failing and fine-tuning the resources after it.


I don't understand your paragraph on Kubernetes. Say you have 1000 machines with 4 TiB each, and you want to schedule processes that will cumulatively allocate 4 EiB. How would you do this without letting Kubernetes handle the scheduling?

Once Kubernetes decides which processes should run on each machine, it allocates separate cgroups to each process group, which it uses for enforcing CPU and RAM resource limits. You can see this by examining the cgroupfs control filesystem. How would you propose improving this?


I’m a little confused by your #1; I was under the impression that each container had its own cgroup.

Note that Kubernetes now has code to work with swap, although some work remains. Tracking issue: https://github.com/kubernetes/enhancements/issues/2400

There is a lengthy conversation in this issue, with some deep insights (and a lot of misconceptions).

https://github.com/kubernetes/kubernetes/issues/53533


Keep in mind that positive memory pressure is something that if done by more than a single application at a time, will completely break your system, and if done only one application at a time will severely reduce your performance predictability (but can indeed increase throughput).

So, what he talks about on that video is a very specialized performance optimization that may work on some (certainly not all) systems that you have complete control of. It isn't general advice in any way.


Although the article claims "Disabling swap doesn't prevent pathological behaviour at near-OOM," I've definitely found this to be the case. My desktop Linux machine has no swap, and when I've run into memory pressure the OOM killer just swiftly terminated the program that was using all my RAM (I later bought more).

When I've used machines with swap, my experience has been the OOM killer was much less aggressive. As a result you're left with an unresponsive system blocked on disk I/O.


That's interesting, I've have just the opposite experience.

I used to run with no swap; when compiling with many parallel processes, my system would sometimes live-lock. The OOM killer never triggered, and as far as I could tell, the issue actually was I/O saturation caused by thrashing on the few pages that the kernel _could_ evict. After enabling swap, I've never encountered this issue again. The machine remains interactive, if slow, under high load.

Anecdotally, SIGSTOP-ing some processes when memory pressure is high, even without swap, seems to have some benefit. I'd resume those batch processes later.


And I had the exact opposite experience as you just last week on multiple machines with swap enabled: compiling and going over the available memory made them entirely inoperable and required hard resets. Had to use another computer to be able to browse internet at the same time, until earlyoom can be installed there.


Sounds like the cause is orthogonal to the presence of swap. What kind of persistent storage do you have on those computers? For me it's been an M.2 in both cases.


I had this happen with various kind of storage - HDD, SSD (both SATA and NVMe)


I have no swap on my laptop, and due to an accidental bug in a program I was developing my machine ran out of memory a few months ago (twice... my first attempt didn't fix it, oops!) In both cases the system locked up for well over a minute before the process got killed by OOM. It was the "classic Linux experience" regarding OOM.

Back in the day I used a lot of FreeBSD on fairly cheap/constraint machines (even for the time) and somewhat frequently-ish ran out of memory, and it was a much "better" experience. I mean, OOM killing stuff is never a "good" experience, but it didn't lock for over a minute.


I think the key is near OOM. In some cases (e.g. runtimes with configurable heap sizes that trigger garbage collection when the max size is reached) it's easy to setup a system that operates near OOM


I read that assertion as "there still exist cases where this can happen", not "this has no effect on those outcomes ever".


You do need swap if you're running Linux on a laptop!

If you want Hibernate to work on your Linux laptop, you'll need to have a swap partition with space equal to (or bigger than) your RAM. Apparently, while there is a way to avoid setting aside a partition by using a swap file instead, it doesn't seem to work as well as using a partition.

Hibernate seems to work OK on my current laptop running Linux * and using a swap partition. I've had poorer experiences with previous laptops though. Now that I'm sure it's possible, having functioning "Hibernate" is a requirement for me when buying Linux laptops.

* - A LaptopWithLinux laptop running up-to-date Manjaro.


Hibernate is not a requirement for everyone. First time I set up Arch Linux, I just never got round to setting up swap, and never felt any need for it either. Over the course of three years, I ran into OOM trouble four times (8GB of RAM was a lot in those days…).

On my current laptop, I have a swap file and hibernation works fine (because this time I wanted to be able to switch over to Windows occasionally briefly without losing my session). It’s definitely more fiddly to set up in the Arch DIY style, but it’s still just following a set of instructions, so a distro that didn’t have the DIY philosophy could set it up easily enough.


Do you use multiple drives on your laptop? Because a single EFI system partition might become corrupted if you boot a system while the other is hibernated. Please refer to https://wiki.archlinux.org/title/Dual_boot_with_Windows#Fast...


I think this is a little overblown. The old kernel ntfs driver which nobody uses and which is read only anyway ignores the fact that a drive is mounted by windows, so you could read corrupted data from the ntfs drive mounted by windows if windows hasn't fully completed writes that are hibernated, but you can't fuck up the partition because Linux won't write anything.

The user space ntfs-3g, which is the one everyone actually uses, will refuse to mount a hibernated windows drive. The same appears to be true for the new kernel ntfs3 driver, since it's mount options include an option to force mount a dirty drive.

If you use the options to force mount a dirty drive read/write, then yes, all bets are off, but you have to have dug pretty deep into your config to find out how to do that and at that point you have to know what you're doing.


Sorry for the late reply.

Okay, so I'm aware of the conundrums surrounding mounting the ntfs volume of a hibernated windows installation detailed on the wiki page. However, the same section clearly states that booting another OS even while a Linux system is hibernating can result in a broken EFI System Partition.

This clearly does not relate to NTFS. I'm not knowlegdeable enough to understand the means by which this occurs, but I felt like I needed to give the person a heads-up before they brick their install.


Hibernate is just a workaround for having too long power-on or power-off times.

I have been using Linux on all my laptops for the last 20 years. I have always taken care to configure them such that the boot time does not exceed some tens of seconds at worst.

The result is that I have never felt any need to use either hibernate or a swap.


It's also related to the amount of volatile state that exists in applications: it doesn't matter in the slightest to me that I can boot my laptop to the desktop in 10 seconds, it'll take 10 minutes to actually open all my applications and restore them to the state they were in when I turned it off.


All the applications where the volatile state matters, e.g. browsers like Chrome or Firefox, source code editors, office applications, etc. already restore their tabs/working documents on restart, in a second or so on my computers.

If there are such applications which do not restore their previously open documents upon restart, it is they who should be improved, instead of using hibernate.

More frequently than wanting to close the entire computer, I just happen to stop using a certain application. In that case I want to be able to close it and restart from the same state at some later time.

So this is a feature that I want in any application. If the feature already exists at the lower granularity of each application, it also removes the need for a hibernate feature of the entire computer.


The biggest issue is restoring terminal states. I think there are some tmux plugins for this (but never looked too much in to it), and maybe some terminals support it natively too, but getting it properly restored is not so easy (including environment, shell variables, the shell's "local" history, and that kind of thing). Maybe there's some zsh plugin for it though.

Now, if you don't use terminals much then that's not much of an issue, but I tend to do most things there, with Firefox being the only actual GUI I'm running in Xorg.


You can get away with swap smaller than RAM just fine:

https://wiki.archlinux.org/title/Power_management/Suspend_an...


Most people don't really need hibernate. If you use your computer every couple days sleep is good enough. And if you make sure the apps you use save their state often enough that you won't lose things, it's just as fast to boot normally.


> If you want Hibernate to work

If you need swap enabled to have it, why would anyone want this? Swap makes everything worse.


Ahh now I understand why people disable swap on Linux boxes.

I came from a SunOS/Solaris & HP-UX background (late 90s) where we routinely configured swap for the machines.

Later on I encountered Linux admins who refused to configure even a small amount of swap and they couldn't explain why. It seemed very odd to me. Nobody ever said the simple "swap sucks in the Linux kernel" to me.


> I came from a SunOS/Solaris & HP-UX background (late 90s) where we routinely configured swap for the machines.

Most non-Linux Unixes also do/did not allow memory (RAM+swap) over-commit by default, which may have coloured behaviours.


Indeed. At a past job we had this million dollar cluster of HP machines (T520s), and one of the managers wanted to do some end of year spending, so they upgraded the memory in the cluster from 2GB to 4GB (back when that was an insane amount of memory, ~1995).

But, we had not storage to back that new memory. They purchased 2GB per machine for god knows what, but didn't add 2GB discs. So we literally couldn't use it (because of how storage was allocated on that cluster, we couldn't just re-carve it to get an extra 2GB swap, it had dedicated 2GB drives for the swap).


Why couldn't you use memory without corresponding swap space?


The HP-UX kernel required at least as much swap as RAM, due to the way it handled memory and swap. I used to know the details of those algorithms, but have freed those portions of my memory for other uses. It was something like: The algorithms HP-UX uses for virtual memory consider that memory based on swap, and the RAM is kind of like a cache layer in front of the swap. This makes the handling of allocation and swapping easier because you know where everything would go before you start trying to swap.


Back in the 90s memory was small compared to disk, so it was fine.

There's no way you'd enable swap equal to memory size these days, for large memory machines.


Disk are still much bigger than memory on the majority systems.

I can buy >16TB disks: I don't think I've deployed a machine with more than 1TB of RAM. (Maybe 2-4 (out of hundreds) in an HPC environment at previous employer had 1.5T?)


It's more about the speed; if you have 1Tb of swap, mostly used then the situation is so serious it's basically fatal.


This is an area where everyone's anecdotal evidence is different, because people are running all sorts of different workloads on all sorts of different RAM/swap/disk configurations.

Naturally, some people will have had consistently positive experiences with swap, while others will have come out with consistently negative impressions. 32GB RAM + 512MB swap backed by a fast SSD is very different from 2GB RAM + 512MB swap backed by spinning rust. JVM is different from MySQL, which again uses memory differently than PostgreSQL.

And yet we are so eager to hurl sweeping generalizations at each other, holding religiously onto our own subjective cargo cults. :(


For Linux on desktop, my experience with swap used to be extremely bad before systemd-oomd. I got those system lockups easily in various workloads that for the longest time I went without swap, and a crashing program was a better experience.

On the server, sprinkling in a bit of swap was standard practice in some orgs. As under load I could still log into a slow server and do something about it.

My current problem with swap in Linux is that they make bad default choices. While a swapfile isn't optimal, allocating a slice of the disk for swap complicates my life when I do ram upgrades. Hibernate being tied to configured swap on a slice of disk (ugh), I just didn't even bother with it since upgrading ram.


You are right. My own anecdotal experience:

Rails, Django, Elixir/Phoenix web development, sometimes with docker. No swap on my 32 GB laptop both when I used it with a 750 GB HDD in its early years and with two 1 TB SSDs now. I'm usually using about 20 GB RAM because I'm leaving every single project open in its own virtual desktop, to get there quickly if a customer needs something done.

No problem whatsoever that I remember. I'm never close to filling up RAM, that could be the reason. I'd buy extra RAM instead of enabling swap. The price would be extra battery depletion during the night.


> Disabling swap does not prevent disk I/O from becoming a problem under memory contention, it simply shifts the disk I/O thrashing from anonymous pages to file pages.

This is my pet peeve when using linux as desktop. Does anyone know if this can be disabled entirely? Something like mlockall(MCL_ONFAULT) but for all the processes. I wish there was a knob for this (e.g. vm.swappiness=-1) but I haven't found one. :(

On my desktop the most common problem I have is that I have a runaway binary or a runaway chrome tab. 90% I simply want that killed so swap is not being helpful. Unfortunately it takes ages for the OOM killer to activate. I have indicators for how much free memory I have left, and I can manage accordingly. I'd prefer a snappier desktop than pretending I have infinite memory. If I feel I have too little memory, I can just buy more, but that still doesn't protect against linux locking up if you accidentally use up everything.

Here's a simple repro:

  // gcc -std=c99 -Wall -Wextra -Werror -g -o eatmem eatmem.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <unistd.h>
  
  int main(int argc, char** argv)
  {
      int limit = 123456789;
      if (argc >= 2) limit = atoi(argv[1]);
      setbuf(stdout, NULL);
      for (int i = 1; i <= limit; i++) {
          memset(malloc(1 << 20), 1, 1 << 20);
          printf("\rAllocated %5d MiB.", i);
      }
      sleep(10000);
      return 0;
  }
And here is how you use it:

  $ gcc -std=c99 -Wall -Wextra -Werror -g -o eatmem eatmem.c
  $ ./eatmem
  Allocated 31118 MiB.Killed
  $ ./eatmem 31110
  Allocated 31110 MiB.
Without args it keeps allocating RAM until the kernel kills it. Basically it tells you how much free ram you have. The second time you ask it to allocate a little less to create a near-oom condition on your machine.

This pretty much locks up my machine (e.g. the mouse pointer doesn't budge) and I have to invoke the kernel OOM killer to recover (Alt+SysRQ+f). I don't use swap. If I could somehow disable the file paging then I could avoid this problem. (I really don't care how "efficient" file paging is, I want a snappy system instead!)


> This is my pet peeve when using linux as desktop. Does anyone know if this can be disabled entirely? Something like mlockall(MCL_ONFAULT) but for all the processes. I wish there was a knob for this (e.g. vm.swappiness=-1) but I haven't found one. :(

Looks like the supported way of doing this is to use the vmtouch(8) utility to ensure that the files (or file pages) you care about are resident in RAM. There's also a fadvise utility, packaged within fcoretools, that seems to be similar.


I've looked into these in the past but I deemed them a too much of a hassle. I don't know what files I care about. The solution I'm proposing is simpler and doesn't require any daemons or figuring out the files I need.


Install earlyoom. It's more aggressive than the default oom and you can tell it processes to prefer (so you can have it kill your browser, for example, rather than anything else).

Your test program got killed both times without any slowdown on my machine.


Thanks, I guess that could work. But still... I'd prefer a solution that doesn't require a daemon that has to continuously wake up and check for this. I wish for a cleaner solution. Besides, it doesn't fully solve my "snappiness" desire: it doesn't prevent the kernel from paging out pages. I guess the answer for that is to get faster disks but that can get tricky on my old raspberry pi machines.


I find swap and memory usage in general confusing. On my Windows machine, Chrome is listed as using about 1GB of RAM and I have maybe 50 tabs open. Seems like a lot. I think it’s possible that Chrome creates 50 processes and that the memory of DLLs loaded get over counted 50x. So 20MB of actual RAM DLL usage could be listed as 1GB. However maybe the DLLs are say 10MB and then Chrome would be using 500MB of data of 500MB of code. This might help with switching tabs and instantly seeing a page. For me though none of the tabs data need caching. I would prefer they just reload when needed. So I have this confusion. How much RAM is Chrome using? How much is truly needed vs hopefully fortuitous caching. Can some data be marked more important or is it all just LRU? Should Chrome free all memory on non-visible tabs? How do I know how much RAM to buy when I don’t know what anything really needs. Process Manager shows 8GB used of 16GB, but I currently have very little running. Chrome, Outlook, Ring Central. How would one know if this could run with 4GB or 8GB? Do you just have to try it and see if it seems sluggish. How does one distinguish memory used by code vs data?


To answer your Chrome question, it’s open source and the tab discarding code is pretty straightforward.

GetSortedLifecycleUnits sorts using GetSortKey, which for tabs is their last used time. So it’s just LRU. But some comments hint that maybe something smarter will happen eventually.

As for your overall question, there are no easy answers. Deciding if memory is being used well, or could be repurposed to do something more useful is a difficult question to answer.

https://source.chromium.org/chromium/chromium/src/+/main:chr...

Edit: also, it’s really hard to reason about how much memory you need unless it’s very well defined (you know the size of data, exactly what processes to run), once you get to desktop operating systems, it’s more run and find out than reasoning from first principals.

Disclosure: I work on ChromeOS.


1GB for 50 tabs seems reasonable to me considering for me it sucks around the same amount of memory for like 15-20 tabs.


Swap is exceptionally evil with any managed language (e.g. Java/C#) - running the GC over swap is effectively indistinguishable from a hard crash... except it takes forever.

I have not used swap for over 20y, and never "missed" it.


To be fair, concurrent GC running in a separate thread should be enough to address this issue.


I never liked the OOM reaper. What it didn't do, was ask me what I wanted to kill.

What it decided to kill, was usually something I didn't want it to kill, like X


Use `earlyoom`. It comes with a sane whitelist and blacklist.

https://news.ycombinator.com/item?id=27218036


Yep. sshd was another favorite target. I'd rather it just crashed the system and rebooted than doing what it does.

Ideally I'd want to tell it: first, kill java. Next, python. I think this is possible but I never was able to understand how to do it.


kill java, and keep on killing java with fire. Then move on to the printer daemon, the 200 vtys you forked on console I never use, the mouse daemon, and maybe in the end, sendmail.


There's a more configurable userspace alternative: https://github.com/facebookincubator/oomd


"Under no/low memory contention...

With swap: We can choose to swap out rarely-used anonymous memory that may only be used during a small part of the process lifecycle, allowing us to use this memory to improve cache hit rate, or do other optimisations."

How do you improve cache hit rate when there's no memory contention? And what other optimizations?

"Without swap: We cannot swap out rarely-used anonymous memory, as it's locked in memory. While this may not immediately present as a problem, on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use."

When there's no memory contention, what's the "more important use"?

These are honest questions - I'm not an expert on the topic.


By cache, I assume the writer means disk cache. Disks are slow! (Yes, even your PCIe 4.0 x4 NVMe SSD is.) So the kernel (usually) uses RAM to cache reads and writes. (This can be bypassed, but let's ignore that.) Under heavy disk IO, using fast RAM for this cache is more beneficial than reserving it on behalf of a program that may use it very rarely, or maybe even not at all (in the case of memory leaks)!

If on a hypothetical 8GiB RAM system, programs require 6 but 2 of those are rarely used, there is no memory contention but the extra space for disk cache is still a helpful boost.

I do think program memory is more valuable on most desktop systems, since commonly memory access is expected to be fast while disk access is expected to be slow. This means that interactive software tends to mitigate for slow IO, while it can't do much against being swapped out.

On servers however I'd probably rather let the kernel do its job (at least if I'm focusing on throughput), unexpected slowness may worsen a worst case but the average case is helped enough.

Luckily, the Linux kernel, for all of its memory management flaws (don't get me started on overcommit), allows configuring this using the vm.swappiness parameter that tells it what to prioritize.


I think the point is that, the more free RAM you have, the more data the OS can speculatively read from disk when there's otherwise unutilized bandwidth. If a program happens to need the data, then it's already in RAM and the access latency is almost zero, vs. the non-zero latency of waiting for a hard disk platter to spin around.

The claim being made is that there are certain situations where processes allocate a lot of physical memory but then very rarely accesses it. If it gets paged out to swap, then there's a latency hit on the next access. If the OS can avoid more latency hits total using the same physical RAM for caching, then the computer could do more overall calculations per second.


Yep! The idea of "free memory" largely doesn't exist in modern operating systems. We haven't gotten to the point of power gating DRAMs (have we?), so we're always paying the cost of the DRAM all the time. As such, not using it means you're just throwing energy away. Keeping lots of purgeable data in these unallocated space like (as you said) disk cache not only speeds up access time but also saves a ton of power since a hit to DRAM is leagues lower than having to access external storage. While this is a non-trivial engineering problem, it's almost always a good trade off since it boost performance and saves power without any extra cost other than the effort of speculative loads (which, when linear, is much less expensive than a random one off load!) and managing it


With the concerns stated in the article, I can't see any benefits to adding swap vs adding an equivalent amount of ram. The author mentions that not needing swap due to having more than enough ram is a common misconception, but I cannot see how that can be. For any given workload, you could replace the useful amount of swap for an equivalent amount of additional ram. Sure some pages cannot be reclaimed, but the amount of anonymous pages that can be reclaimed is limited by swap size. In an alternate system with more ram instead, the pages that would be in swap instead are locked in memory, but this does not give more useful memory. More ram would be more flexible as it allows more cache than the swap scenario when you dont have swap full of pages. Thus we can conclude that there will always some amount of ram that is sufficient, and therefore if your system has enough memory that you dont face memory contention, you don't need swap.

If theres a better explanation for why swap is useful, I'd like to hear it, but the article is unconvincing.


I also completely agree with everything you said.

I have stopped using swap on all my servers, desktops and laptops about 20 years ago.

I have never encountered any situation when swap could have been useful, but I have always installed generous quantities of RAM on all computers.

As long as the quantity of used memory does not exceed the physical memory, swapping pages cannot improve the performance in any way.

The only exception, which is mentioned in the parent article, is when some very seldom used memory pages would be swapped and the freed memory would be used as a disk cache, hopefully increasing the performance for a I/O bound application.

Whether this strategy of hoping that a larger disk cache, filling all the available physical memory, can improve the performance more than what can be lost if the prediction that the swapped pages will not be needed soon is wrong, is extremely application dependent.

This hypothesis, that a significant performance improvement can be enabled by a slightly larger disk cache, might happen to be true in certain cases on computers which still use slow HDDs.

Except for backup/archiving, where the disk cache matters very little, I have not used HDDs for more than a decade.

On computers with SSDs, the chances to find an application where a small increase in disk cache size can cause a noticeable performance increase, larger that what is lost by swapping the seldom used memory pages, are negligible, in my opinion.


What about when you can't add any more RAM?


I disagree with the position in this post -- in my experience swap has universally been a contributor to system instability and performance issues. I run all my Linux servers without swap, and for workstations try to restrict its use to suspend/hibernate only.

1. Allowing a process's mapped pages to be flushed to disk will cause performance to become unpredictable. This applies to both anonymous pages written to swap, and file-backed pages. The first thing every server process should do is mlockall(MCL_FUTURE) so the kernel can't decide to swap out parts of your RPC handlers. More sophisticated implementations (e.g. databases) can selectively mlock() specific pages.

2. Using swap to mitigate memory overcommit isn't useful because the process that overcommitted its memory should just be killed instead. This is where cgroups are useful: you can load test to understand the allocation curve, then tell the kernel to limit your service's process to 64 GiB or whatever. If it tries to go wild and take up more than its share, it gets a SIGKILL instead of taking the entire machine out of service.

3. Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.


My reading of your comments above is that you are speaking about about cases where swap is being used as a replacement for memory. I take that from, among others, "Swap will destroy SSDs". Swap used only for paging unused data out is an infrequent occurrence, no way it's going to kill an SSD.

My reading of this article is that he is specifically advocating NOT using it as a replacement for having adequate RAM.

My experiences, around 5 years ago, of running ~100 VMs without swap, basically confirmed the assertions in the article. Mostly the systems would run ok, but when they got into memory pressure they would live-lock. I had hoped for just OOM killing, which happened sometimes, but also sometimes the machine was just locked.

I switched over to having some small amount of swap, maybe 0.5-1GB on these boxes and haven't had a livelock since.

Plus, it's better to get unused pages out of RAM, which can be written to swap.

The one thing that HAS killed an SSD, in this case an Intel (320 series sticks in my head) is ZFS slog+L2ARC. Swap hasn't been a problem in my experience, but I also am not using swap to account for not having enough RAM.


I've never seen those livelocks before -- but I saw plenty of swap thrashing. How much memory did those VMs have?


I don't recall exactly, but probably 1-4GB. Not particularly low memory, but when something would come along that needed more memory than the system had, say a load burst, I was quite surprised to find the systems without swap just stop responding. I was expecting OOM killer to kill the offending process, the load balancer to remove it from the load, monitoring to say what process had gotten killed, but instead, even after waiting half an hour, the system was just wedged.


That seems pretty low, especially if this was 1GB of RAM. I'd use swap on such systems. I don't think there are any reasons today to buy physical machines with that little RAM, but I can see how you might have such VMs,

That said, I agree with other commenters that said this sounds like a kernel bug. I am wondering what was going on to wedge the system that bad. Would putting executables on tmpfs (so they cannot be paged out) helped?


Is "live lock" here page cache thrashing?


I think it's usually code page thrashing, since executables and shared libraries can always be swapped back out to disk, only to be swapped back in when the process gets the next quanta of time to execute. If people say instruction cache misses are costly, wait till you need to read the next instruction from a 5400 RPM HDD...


Thanks, understood. However file-backed pages such as executables and shared libs are not written to swap space as the pages are just dropped and read back in from the regular(non-swap) filesystem no? Not that causes any less disk I/O of course but they are not anonymous pages which is generally what is written to swap. When I think swap I think of "written to swap" but maybe you mean swap in the general sense of paged out?


Yes, I meant "swapped out" in the general sense of "paged out". On the other hand, if you have actual swap space, hot code pages will be much less likely to be paged out than colder anonymous pages.


> Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

Are you sure? I’ve heard the argument before that swap is one of the better cases for SSD longevity, because it’s few big writes instead of many tiny ones, and found that convincing. Also, on which workloads would a machine with that much RAM touch swap at all? My workstation has “only” 64 GiB, has swap configured (to enable suspend), even during heavy use with some databases running never touches it.


The alternative of the large writes swap does is not a lot of smaller writes, it's no write at all.

Also:

> on which workloads would a machine with that much RAM touch swap at all

On most of them? You seem to be using a flawed model of the Linux swapping algorithm. Writes aren't caused by lack of memory.


I know you're not the GP, but please give some specific examples of such workloads, I honestly want to know.

To be clear, yes, it's well-known that the kernel will page out rarely used stuff, but those are (almost by definition) pages that are basically never written to. The claim was that writes to swap occur all the time, to the point where they wear out the SSD, that seems to be something different.

At least I've never seen anything like that with any of my (workstation) workloads, like large C++ builds, TensorFlow experiments, or running lots of virtual machines. And that was only with 64 GiB RAM, not even the 512 GiB that the post I replied to mentioned.


The kernel write the pages into swap long before it decides they are rarely used. Otherwise rewriting them would be way too slow to be of any use. Any normal use will fill some swap space, and if the memory pages keep getting dirty (and you have spare disk IO), your computer will keep writing them to swap.

How much RAM you have is not really relevant.


I had good success using an Intel Optane drive for swap. Of course, that was because the task I was doing required hundreds of gigs of memory and my poor little machine only had 64; not exactly common in a business setting.

Performance was great. Normally when swapping the performance of the machine is terrible, as you say. But those Optanes are so fast that it just didn’t matter. I was doing several hundred thousand operations per second, and they were all taking <4µs. It was a wonderful upgrade.


> Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

I have an OCZ Vertex LE in a 32-bit laptop that was my daily driver for somewhat-serious development for the better part of a decade and sees somewhat regular use as a bedside media consumption machine. It ran (and still runs) Gentoo Linux, so -while it was my daily driver- its drive saw _frequent_ writes due to weekly system-updating-build activity.

This machine has 4GiB of RAM installed, of which only 3.2 GiB is available due to weird BIOS limitations. It has _always_ had swap enabled, and has _often_ made heavy use of it.

This SSD has 93,560 power-on hours, and has written 48,384 GiB. As far as I (and SMART) can tell, it's as good as the day I flashed the v1.1 firmware on it.

I don't believe your claim is generally true. Other folks who have intentionally run drives _way_ past their advertised wear-out points have similar stories to mine.


Running a light workload on a consumer SSD is fine. That's what they're designed for. Your ~48 TiB of writes over a decade is well within its design parameters.

If you used it as swap in a server, you could expect something closer to its max write rate, which is on the order of 10-20 TiB per day[1]. Run that for a year and you're at 3+ PiB total writes.

[1] Contemporary reviews say the OCZ Vertex LE had a maximum write throughput of 250 MB/s. Times 3600 (per hour), times 24, converted to IEC notation is 20,116 GiB per day.


If your swap is constantly used.

Which you should avoid by having proper monitoring.


So I run servers used to process images, 99% of the time it is typical ~5 megapixel photos, but occasionally it will be ~150 megapixels. We use swap because that 99% of the time it will fit in 4GB of ram. The occasional ones take an understandably longer time to process.


> Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

You can run with a low swappiness value and the kernel will not proactively swap out process memory unless required. Even light use of swap is probably fine as it likely involves rare process pages that will never be written to after being swapped out so there's no threshing on the SSD, only a few large writes.


I consider swap very necessary unless Linux is used like platform for something like Kubernetes and then you need to obey its rules. I'm adding enough swap for all servers I manage and so far they rock stable. Good thing is that even if there's enough memory, some unused memory with time will migrate to swap and that'd free some RAM for disk buffers or consumption spikes. This behaviour seems somewhat broken on some modern distributions, so I prefer old distributions, I guess some defaults were changed which lead to less stable and performant systems in my observations.


The one thing that used to drive me crazy, especially about Windows, was "why the hell are you using swap and why are you freezing to do stuff with the hard drive on each click when you have a third of the physical RAM free right here according to your own task manager". I still don't understand that. I still feel like modern desktop OSes are terrible with memory management when you run out of physical memory (which I did way too often on all my previous machines).


1. https://news.ycombinator.com/item?id=26244093 | M1 Mac high SSD writes

Not sure if the above case has been solved in M1 Mac

On x86 Mac, if the programs exceed the memory available, it would start swapping to disk too, though that is not as how Windows do swapping. On Windows, if I don't enable pagefile swap, some programs will crash even if there are so much memory available

Anyway, I do monitoring on memory usage on any OS that I use. Out of memory situation by people mostly is due to negligence, it's what it is. If people are unhappy because doing memory monitoring would waste their focus energy, just buy more computers


Part of the problem, and perhaps the main justification for swapping today, are all the programs which run infrequently but don't use the OS features to get themselves started on demand or periodically. On Linux, do "ps axl"; on Windows, use Task Manager. Look at all that stuff.

There has to be some way to evict those memory hogs.


> Swap is a storage area for these seemingly "unreclaimable" [anonymous] pages that allows us to page them out to a storage device on demand. This means that they can now be considered as equally eligible for reclaim as their more trivially reclaimable friends, like clean file pages, allowing more efficient use of available physical memory.

I don't see that argument. Why should anonymous pages be treated equally?

And executable pages / program text, maybe those also never should be swapped out? Is this configurable? This is usually more the behavior I would want on my system. I want that the OOM killer just kills sth directly.

I remember that Google also has some custom handling for OOM on Linux. Although I don't find much information now on it, except: https://lwn.net/Articles/432223/


> And executable pages / program text, maybe those also never should be swapped out? Is this configurable? This is usually more the behavior I would want on my system.

Aren't they? It's always been my believe that when such pages are reclaimed, they are not swapped out because they can be read back from disk anyway. Don't these pages fall under what the author means by "clean file pages"?


I'd like to know how swap-on-zram changes things. Fedora has started to enable zram on new installations (https://fedoraproject.org/wiki/Changes/SwapOnZRAM) and disabled swap partitions.


In my experience swap to zram works really well.

I have 16GB of RAM in my linux laptop running Ubuntu and it isn't quite enough: Chrome, Virtual box, gopls and thunderbird regularly push it over the edge into swap death needing a reset to recover.

My first install was the earloom killer a user space oom killer. This has a nice desktop notification and when things got tight I'd get notifications about it killing chrome processes.

That made things liveable with but not perfect. I couldn't run Zoom without earlyoom killing it.

I finally added swap to zram and that has helped enormously meaning I have enough ram to run everything I need without the earlyoom kicking in regularly.

Anyway I'm just about to upgrade to a new laptop I can put more than 16GB of ram in, but I think I'll leave both the earlyoom and the compressed swap in place.


See also "Do we really need swap on modern systems?" (2017):

* https://news.ycombinator.com/item?id=13715249


My practical experience is that Linux absolutely needs swap to work well. It doesn't need a lot, but some. I had a ton of OOMs even though the memory wasn't "full" on the machine running my search engine. The machine had like dozens of gigabytes in "buff/cache" that could be reclaimed, but for some reason it failed and let loose the OOMKiller instead.. I added 512 Mb of swap and it just went away overnight. I've since increased the allocated memory by a lot more than 512 Mb.


While I would not remove the swap on a production machine just to be safe, I've been running my home computers for ~5 years exclusively on Linux without swap. I've never encountered any of those issues for some reasons


I usually don't have swap on my workstations either, which is why I initially thought I could get away with it on my server.

Probably different usage patterns. I routinely memory map something like half a terabyte of data on this machine. Might also be due to huge pages, that feature seems a bit janky to be honest.


Two sidenotes:

- actual desktops, and some entry-level server are designed with much more CPUs that their usage need, and chronically too tittle ram. I'm curios if that's a classic "my CPU is big" vs and old reminiscence of the past were sw was far less resource hungry and CPU very slow...

- storage is "cheap enough" to "wast" some in exchange of "buffer for peak loads", however until GNU/Linux do something less ugly for the OOM killer there might be still issues...


Articles like this make me feel really out of touch. Honest questions: has it become standard practice to write applications that don't handle heap overflow at all because it's meant to be handled at the kernel level by specialists such as the author? As an alternative to using an OOM killer, why doesn't the kernel stop granting memory when it runs out and let the application that requested it sink or swim?


Other OSes such as Windows do the latter. Linux "overcommit" is sometimes problematic, but has been the lay of the Linux land for quite some time. I don't love it.


Opinions differs widely here.

My observations:

- I can get away with about half the memory on Linux vs Windows (significant contribution comes from not having to run VMs for docker and WSL, but also it seems applications and services use less memory on Linux.)

- I always used to use swap. These days it seems defaults in distro setups pushes one away from swap.

- For some reason I feel it worked better before (always w/swap).

- These days I feel I see some problems either I use swap or not.


I agree with this post, though there's another factor to consider: SSD endurance. A system that swaps a lot will write more, decreasing endurance. You might need to buy a new SSD every 5 years.


Sounds like both using swap and not using swap are common workarounds for some problem the kernel couldn't handle properly.


Swap has always been so slow for me that I just disable it on all of my machines. I would rather the OOM reaper just SIGKILL whatever is using all my RAM than deal with slowness (which often persists after the OOM situation is gone).


Yes, but that's not quite how it works. As mentioned in this article, without swap, the system can live-lock before the OOM killer can take care of things. This has been my experience as well.

I had hoped that getting rid of swap would prevent thrashing, but instead the system would live-lock.


I feel like the kernel live locking in low memory environments should be treated as a bug and not something we try to solve using swap. Like...if the OOM killer can't free memory under low memory constraints, something has gone seriously wrong


I've tried twice to update Lakka (a custom Linux image that runs RetroArch and emulators) on a RPi 3B with 1 gigabyte of memory (and I think no swap). Updating Lakka from the Pi itself invokes a script which downloads a file from the network to SD card storage. This should not take up unbounded amounts of memory, but seems to trigger OOM livelock or something anyway (I can't tell what it is, since the GUI and SSH session hang, and Lakka doesn't enable Alt+SysRq or TTYs for some godforsaken reason). Enabling SSH and running `watch -n1 free -h` shows a concerningly low free memory amount, and IIRC the available memory number crept downwards before the GUI and SSH session both hung simultaneously.

Terminating the GUI and performing an offline update from one SSH session while running `watch -n1 sync` avoids the hang. I haven't tried sync while leaving the GUI running.

Will Linux OOM handling ever be fixed?


The problem is the OOM killer only runs if no pages can be freed. But, on any system, there is some amount of memory that can be swapped out: code pages can be swapped out to their executable/library files on disk. On a system only running a few processes, this is unlikely to matter.

But if you have a lot of processes, you may end up in a situation where you can free enough pages by swapping them all out to disk. Now you have no OOMKill, but the next time a process gets to run, you will stall until it is swapped in from disk. Then, the next process will be scheduled, and it too will need to be swapped in, causing another stall, and so on. The machine will probably end up executing, at best, a few instructions per millisecond for hours...


I can't for the life of me find where the article talks about live-lock and the OOM killer. I searched for "lock" and "OOM" and none of them related to that.


He doesn't use the term live-lock, but see "6" at the top of the article where he talks about "pathological behavior at near-OOM". Also "3" "Disabling swap does not prevent disk I/O from becoming a problem". Those, in my experience, have been the situations that I was in that I called live-lock.


4GB swap on a 4GB RAM VPS allows me to do an occasional yarn build, which would otherwise crash the server and who doesn’t need 8GB RAM the rest of the time.


I just use zram for this usecase.


I will look into that, thanks!


putting swap on a ramdisk is faster than without swap and with swap. yes, it seems weird, yes it seems broken.


Swap implementations have gotten a lot more sophisticated and complicated over time. But fundamentally it's an undesirable thing to be dependent on it as it is just a lot slower than normal memory.

Part of the problem is that there is a lot of magical thinking rooted in decades of half truths about swapping on various operating systems with completely different implementations. Most of those per-conceptions if they ever were true at all, are now obsolete.

For example, Windows XP used to have swapping. It commonly sucked the life out of any PC it was on. For no good reason whatsoever. Symptoms: the system performance would degrade over time and people would routinely reboot to fix that. Even with plenty of memory on a system, it would spend lots of time swapping things in and out of memory. Very infuriating if you just spent a lot of money on RAM. It was optimized for people with cheap hardware. Computers with enough memory were an afterthought. So it pretended you were running out of memory even when you weren't.

Easy solution: disable swap completely; especially on machines with more memory than the total addressable memory per process on 32 bit windows XP (2 GB). I remember having a particularly slow windows laptop with 4GB of RAM. My life improved massively after I disabled swap on that thing. Alt tab was instantaneous; always. I could get individual processes to crash with an out of memory error but only if I deliberately opened way more processes than fit in memory (so, don't do that). Exactly the kind of thing that would make the laptop completely unusable with swap on in any case. I actually preferred the quick crash. Trying to close a non responsive application to free up some memory while the system is 100% engaged with grinding your disk to fine dust is not fun.

Linux is way more complicated and it depends on the use case. A swapping server is a dead server. So, you should be conservative with any kind of swap there. Basically, if you have enough memory and no memory leaks there should be no need for it to degrade response times (aka swap), ever. If you do have memory leaks, swapping is actively harmful because it degrades response times. Better reboot the server. Or rely on the docker oom killer. Or assign more memory. Spin up more servers. Etc. And of course you monitor memory usage on servers to preempt any kind of issues. Swapping is the punishment that happens on a poorly, configured un-monitored server. You should have no need for it.

For a laptop with not enough memory because the buyer tried to save some cash and bought something with a pitiful amount of memory, definitely have swap space available. Performance will suck but at least it will be able to run some software. And Linux isn't as bad at this as Windows XP used to be. People actually care.

Otherwise, you probably want it to be able to hibernate and you need swap space for that. But perhaps with a very low swappiness value so that it won't actually swap. Or in case you don't need hibernate, turn it off completely and save some disk space.


> For example, Windows XP used to have swapping. It commonly sucked the life out of any PC it was on. For no good reason whatsoever.

Win XP-era machines had tiny amounts of RAM, so it makes total sense that swap would be heavily relied on.


The windows XP era was quite long. I used it between 2001 and 2008. The laptop I mentioned, had more memory than the 32 bit version could technically allocate for a single process and most processes used far less than the maximum. So, I could run one big process like that and have all the usual stuff in the background without much issues.

The 64 bit version was released later and of course allowed for more memory. They never really addressed the swapping issues. So it continued to make sense to upgrade the memory and switch off swapping as opposed to replacing the entire laptop because it was "slow". Most instance of people complaining about their laptop performance basically were running into this and never realized the fix was a 100$ upgrade of their memory + simple configuration change. A lot of those laptops are still fine for running Linux today.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: