Hacker News new | past | comments | ask | show | jobs | submit login

I disagree with the position in this post -- in my experience swap has universally been a contributor to system instability and performance issues. I run all my Linux servers without swap, and for workstations try to restrict its use to suspend/hibernate only.

1. Allowing a process's mapped pages to be flushed to disk will cause performance to become unpredictable. This applies to both anonymous pages written to swap, and file-backed pages. The first thing every server process should do is mlockall(MCL_FUTURE) so the kernel can't decide to swap out parts of your RPC handlers. More sophisticated implementations (e.g. databases) can selectively mlock() specific pages.

2. Using swap to mitigate memory overcommit isn't useful because the process that overcommitted its memory should just be killed instead. This is where cgroups are useful: you can load test to understand the allocation curve, then tell the kernel to limit your service's process to 64 GiB or whatever. If it tries to go wild and take up more than its share, it gets a SIGKILL instead of taking the entire machine out of service.

3. Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.




My reading of your comments above is that you are speaking about about cases where swap is being used as a replacement for memory. I take that from, among others, "Swap will destroy SSDs". Swap used only for paging unused data out is an infrequent occurrence, no way it's going to kill an SSD.

My reading of this article is that he is specifically advocating NOT using it as a replacement for having adequate RAM.

My experiences, around 5 years ago, of running ~100 VMs without swap, basically confirmed the assertions in the article. Mostly the systems would run ok, but when they got into memory pressure they would live-lock. I had hoped for just OOM killing, which happened sometimes, but also sometimes the machine was just locked.

I switched over to having some small amount of swap, maybe 0.5-1GB on these boxes and haven't had a livelock since.

Plus, it's better to get unused pages out of RAM, which can be written to swap.

The one thing that HAS killed an SSD, in this case an Intel (320 series sticks in my head) is ZFS slog+L2ARC. Swap hasn't been a problem in my experience, but I also am not using swap to account for not having enough RAM.


I've never seen those livelocks before -- but I saw plenty of swap thrashing. How much memory did those VMs have?


I don't recall exactly, but probably 1-4GB. Not particularly low memory, but when something would come along that needed more memory than the system had, say a load burst, I was quite surprised to find the systems without swap just stop responding. I was expecting OOM killer to kill the offending process, the load balancer to remove it from the load, monitoring to say what process had gotten killed, but instead, even after waiting half an hour, the system was just wedged.


That seems pretty low, especially if this was 1GB of RAM. I'd use swap on such systems. I don't think there are any reasons today to buy physical machines with that little RAM, but I can see how you might have such VMs,

That said, I agree with other commenters that said this sounds like a kernel bug. I am wondering what was going on to wedge the system that bad. Would putting executables on tmpfs (so they cannot be paged out) helped?


Is "live lock" here page cache thrashing?


I think it's usually code page thrashing, since executables and shared libraries can always be swapped back out to disk, only to be swapped back in when the process gets the next quanta of time to execute. If people say instruction cache misses are costly, wait till you need to read the next instruction from a 5400 RPM HDD...


Thanks, understood. However file-backed pages such as executables and shared libs are not written to swap space as the pages are just dropped and read back in from the regular(non-swap) filesystem no? Not that causes any less disk I/O of course but they are not anonymous pages which is generally what is written to swap. When I think swap I think of "written to swap" but maybe you mean swap in the general sense of paged out?


Yes, I meant "swapped out" in the general sense of "paged out". On the other hand, if you have actual swap space, hot code pages will be much less likely to be paged out than colder anonymous pages.


> Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

Are you sure? I’ve heard the argument before that swap is one of the better cases for SSD longevity, because it’s few big writes instead of many tiny ones, and found that convincing. Also, on which workloads would a machine with that much RAM touch swap at all? My workstation has “only” 64 GiB, has swap configured (to enable suspend), even during heavy use with some databases running never touches it.


The alternative of the large writes swap does is not a lot of smaller writes, it's no write at all.

Also:

> on which workloads would a machine with that much RAM touch swap at all

On most of them? You seem to be using a flawed model of the Linux swapping algorithm. Writes aren't caused by lack of memory.


I know you're not the GP, but please give some specific examples of such workloads, I honestly want to know.

To be clear, yes, it's well-known that the kernel will page out rarely used stuff, but those are (almost by definition) pages that are basically never written to. The claim was that writes to swap occur all the time, to the point where they wear out the SSD, that seems to be something different.

At least I've never seen anything like that with any of my (workstation) workloads, like large C++ builds, TensorFlow experiments, or running lots of virtual machines. And that was only with 64 GiB RAM, not even the 512 GiB that the post I replied to mentioned.


The kernel write the pages into swap long before it decides they are rarely used. Otherwise rewriting them would be way too slow to be of any use. Any normal use will fill some swap space, and if the memory pages keep getting dirty (and you have spare disk IO), your computer will keep writing them to swap.

How much RAM you have is not really relevant.


I had good success using an Intel Optane drive for swap. Of course, that was because the task I was doing required hundreds of gigs of memory and my poor little machine only had 64; not exactly common in a business setting.

Performance was great. Normally when swapping the performance of the machine is terrible, as you say. But those Optanes are so fast that it just didn’t matter. I was doing several hundred thousand operations per second, and they were all taking <4µs. It was a wonderful upgrade.


> Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

I have an OCZ Vertex LE in a 32-bit laptop that was my daily driver for somewhat-serious development for the better part of a decade and sees somewhat regular use as a bedside media consumption machine. It ran (and still runs) Gentoo Linux, so -while it was my daily driver- its drive saw _frequent_ writes due to weekly system-updating-build activity.

This machine has 4GiB of RAM installed, of which only 3.2 GiB is available due to weird BIOS limitations. It has _always_ had swap enabled, and has _often_ made heavy use of it.

This SSD has 93,560 power-on hours, and has written 48,384 GiB. As far as I (and SMART) can tell, it's as good as the day I flashed the v1.1 firmware on it.

I don't believe your claim is generally true. Other folks who have intentionally run drives _way_ past their advertised wear-out points have similar stories to mine.


Running a light workload on a consumer SSD is fine. That's what they're designed for. Your ~48 TiB of writes over a decade is well within its design parameters.

If you used it as swap in a server, you could expect something closer to its max write rate, which is on the order of 10-20 TiB per day[1]. Run that for a year and you're at 3+ PiB total writes.

[1] Contemporary reviews say the OCZ Vertex LE had a maximum write throughput of 250 MB/s. Times 3600 (per hour), times 24, converted to IEC notation is 20,116 GiB per day.


If your swap is constantly used.

Which you should avoid by having proper monitoring.


So I run servers used to process images, 99% of the time it is typical ~5 megapixel photos, but occasionally it will be ~150 megapixels. We use swap because that 99% of the time it will fit in 4GB of ram. The occasional ones take an understandably longer time to process.


> Swap will destroy SSDs. You thought the write load from logging is bad? Try putting a consumer-grade SSD into a machine with 512 GiB RAM and let the kernel swap to it -- that SSD will be dead in a year.

You can run with a low swappiness value and the kernel will not proactively swap out process memory unless required. Even light use of swap is probably fine as it likely involves rare process pages that will never be written to after being swapped out so there's no threshing on the SSD, only a few large writes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: