The Day “/Proc” Died

verisimilitudes · on July 3, 2019

Some things never change. Is there a single UNIX variant that doesn't fail in some way when it can no longer write to disk? It's unacceptable for most programs to fail if they can no longer write to disk; Firefox also fails, but of course gives no indication as to what's happening.

Now, of course, no UNIX variant I'm aware of has a real notion of system programs and so there's no programs that get special privileges that would protect them from this. It would be preferable for the system to die, rather than permit this debugging nonsense, considering it would perhaps actually be fixed if the machine died in this case.

This is an asinine failure case mired in 1970s malpractice. Don't you agree this is damning and unacceptable? No one has any right to be proud of this mess. There's millions of lines of code and yet basic failure cases aren't truly accounted for or are handled in the most asinine of ways, such as with the ``Out of Memory Killer''.

theamk · on July 3, 2019

Did you know that Linux has this special note of “root reserved space” specifically for situations like this? And the same way, there is per-process “oom_adj” which can be used to control OOM killer priority, to spare the system processes.

This problem has been solved, multiple times. Of course many distributions fail to mark appropriate processes as system, so they fail anyway, but this is just a bug, not an a glaring design omission.

Avamander · on July 3, 2019

The OOM killer never in the twelve years I've used Linux has triggered before my system grinds to a halt and never recovers. This problem has not been solved.

segfaultbuserr · on July 3, 2019

Pro-Tip: You can use Alt + SysRq + F to trigger the OOM-killer action immediately. It helps me avoiding pulling the plug of my desktops on multiple occasions over the years when I accidentally start RAM-eating programs. Just make sure SysRq is enabled in sysctl.

emmelaich · on July 3, 2019

When the machine is busy -- and even when it's not, it's very uncertain thing to try sys-rq combinations.

segfaultbuserr · on July 3, 2019

Yes, it's risky. But when the machine is swapping to death and pulling the plug is the only option, SysRq is a better alternative.

colechristensen · on July 3, 2019

In ops, I've seen this happen very many times. A linux server running happy and free because the kernel OOM killer murdered the reason that server existed, leaving alone some side process with a memory leak (usually some external service agent or maintenance service run amok). I learned well how to fix that over the years.

(spoiler alert: it was much more often about becoming ornery when devs insisted that their JVM app could have a 12 GB heap on a machine with 12 GB of memory)

mshook · on July 3, 2019

That's pretty much why we use vm.panic_on_oom = 1 as you have no idea about what's going to be killed...

colechristensen · on July 3, 2019

I am in the camp of having systemd or your service manager of choice restart any killed service or service failing health check and emitting an ERROR or higher log message.

I want my systems to heal themselves. More often than not these memory problems end up being slow leaks which can be effectively permanently resolved with periodic restarts, and the engineering time to fix them is appropriately not prioritized.

I want to know that there has been a problem but I would rather not be forced to do anything about it unless absolutely necessary.

eikenberry · on July 3, 2019

That is probably being cause by your use of swap space, not an OOM issue. I've had multiple cases of the OOM killer kicking off on my system, all without it slowing way down.

rcxdude · on July 3, 2019

Lacking swap space causes more severe symptoms in an OOM situation, not less, from my experience. I think this is because everything that can get evicted from RAM is before the OOM killer gets invoked, which means every disk access slows to a crawl.

Avamander · on July 3, 2019

No. This happens even without swap, OOM killer might as well not exist.

lulouie · on July 3, 2019

Yep, when Chromium ate up all the memory, it just hang the whole OS, waiting for OOM killer about 10+ min, then the cursor can be moved again, then freeze again...

olliej · on July 3, 2019

If you want hilarious fun: make a gl shader than takes ~30 seconds to run. GPUs are only very recently preemptable (if they are at all yet? I lose track of what is “planned” vs released).

Make it run that shader in a loop.

See how well your system appears to respond.

IIRC macOS has a 60s or something watchdog the hard resets the GPU, while the gpu is hung the screen is not updated. Everything is running fine, cpu isn’t pinned or anything, but the gpu is blocked so no compositing, and so no screen updating.

I’m not sure what Linux does in that case, and I think windows may be able to paint because the directx driver interfaces let it do ... something? I’ve always assume some way to dma straight to the framebuffer, but no real idea :)

pas · on July 3, 2019

https://lwn.net/Articles/759781/ already merged in 4.20.

https://github.com/facebookincubator/oomd/blob/master/README...

and here's a less complicated, but similar proactive daemon:

https://github.com/rfjakob/earlyoom/blob/master/README.md

Avamander · on July 3, 2019

I'm on 5.0.0 and I legit haven't noticed a difference. If I run out of RAM without swap the system freezes and if I have swap then the system freezes when both are full. The only reliable solution is having a RAM+swap usage graph on my screen at all times and then closing stuff manually.

pas · on July 3, 2019

You probably won't until the distro you use (or you manually) set up something other than the default OOM killer.

You probably have too much swap. More than ~10 sec * your I/O speed (so let's say 512M-1G) is probably the max for the reasons you mentioned.

Avamander · on July 4, 2019

The system freezes also without any swap enabled but much more suddenly (there's no slowdown before dying). It's really just that the OOM killer triggers way way way too late.

AnIdiotOnTheNet · on July 3, 2019

The OOM killer is a terrible hack that no self respecting system should have ever employed.

olliej · on July 3, 2019

The OOM killer is a “solution” to a very real, and sensible design choice: not committing physical memory and swap whenever address space is mapped - there are very good (and noticeable) reasons for not eagerly committing, but fundamentally if you have done so you have to decide what to do when you end up needing more physical space than is available.

Linux went down the “if a process is trying to do this, it must be important so I’ll prioritise it and kill something else”, and alternative is to kill that process when the commit fails.

Either is a valid option, the OOM killer ran against a regular desktop user’s idea of what is the correct course of action, but for a server it might not have been.

pjmlp · on July 3, 2019

Problem is that something else might be actually performing a critical task.

olliej · on July 3, 2019

The assumption being made is that the app going mad for memory is the critical one. Something has to die, and deciding which is hard.

Symbiote · on July 3, 2019

It triggered on one of my systems yesterday, and killed the runaway process.

When we were running tests of a new distributed system on our development (slightly underspecced) cluster, it would kill the distributed system processes when they took too much RAM.

As other write, having slow or "too much" swap can delay the OOM killer from running in reasonable time.

viraptor · on July 3, 2019

Sounds like you're starting to swap heavily. Adjusting swappiness to 0 may help there.

Avamander · on July 3, 2019

This also happens without swap.

chmod775 · on July 3, 2019

I fill my tiny 250GB SSD on my thinkpad all the time (well it's mostly logs of a certain terrible database topping it up now and then).

Everything including my usual desktop environment keeps working. I have /home on a separate partition though.

Most of the time I only notice that it filled up again when pacman -Syu fails because it can't save anything.

It probably really depends on your setup and how stuff is configured. It appears the various systemd tools keep working just fine though, otherwise my system wouldn't even boot.

If I could make a wish, I'd want my system to just buffer writes (transparently) to RAM once the SSD filled up. Then fire some events that can be intercepted by a GUI, so the user could be informed that he'll lose data if he shuts the system down now. Then as a last resort, abort any shutdown the user initiated and drop him into a shell so he can fix the mess and allow the system to save his data.

dsfyu404ed · on July 3, 2019

Same here only I don't have /home on a separate partition. I don't usually notice it's full until I go to download something and it fails.

simonjgreen · on July 3, 2019

We have a legacy platform running on VMWare with an NFS SAN. If/when connectivity is lost between the cluster and the SAN we get to experience resolving disk issues across several hundred VMs. Yanking access to the underlying storage device creates the most frustrating array of issues in Linux, depending really on the workload on the server. By far the most common is the mount becomes read only, forcing a reboot, which halts in order to fsck. Windows on the other hand just resumes from where it left off, or sometimes reboots but is usually clean.

This platform is currently being retrofitted with vsan.

aidenn0 · on July 3, 2019

Since ext4 became the default, my most common cause of bizarre behaviors has been running out of inodes.

   df -i

will show this, but a bare df command will not.

Getting off topic now, but does anybody know if the ext4 utilities changed the calculation of inodes when formatting compared to ext3/ext2 utilities? Running out of inodes on those filesystems was fairly unusual, but I've seen it happen a dozen times in the past 5 years on ext4.

simonjgreen · on July 3, 2019

Across the approximately 2000 VMs we look after as an MSP, we see this at least once a week. It's almost always badly cleaned up session files from a php app or similar.

Symbiote · on July 3, 2019

The default ratio of inodes is configured in /etc/mke2fs.conf. You should either change that, override it with the -i argument when you create the filesystem.

My desktop's / filesystem is around 238,000,000,000 bytes, and with the default ratio of one inode per 16384 bytes, I have around 14,500,000 inodes.

Note that with a default blocksize of 4096, you're limited to 4× as many inodes as you have at present, so if you're seeing this weekly I recommend monitoring the number of remaining inodes (df -i), or changing the app to store sessions in a database.

simonjgreen · on July 6, 2019

We're talking about preexisting systems. inodes are already tweaked up during build to highest, and we monitor similar to how you suggest via a checkmk script. If the apps running were ours we could make changes to make them clean up better or store through other means, alas this is just something you have to handle when doing break fix response on 100s of clients making their own decisions.

dharmab · on July 4, 2019

So raise the inode allocation? It's a small amount of additional memory allocation to fend off a time consuming failure cause

free652 · on July 3, 2019

The rookie mistake a lot of admins do is never creating a partition for /var/log , I lost count how many times servers went into a weird mode when the root is getting filled to 100%

hinkley · on July 3, 2019

I broke a machine today by deploying a docker image to it. I couldn’t even ssh in because the disk was too full.

Which probably means all of those machines have the same fucked up bad configuration. If that team had something you could mistake for humility, it wouldn’t be so bad.

est · on July 3, 2019

I always rm the dir inside /var/log and ln them back from another disk. Because I am too lazy to change default config sparsely located on /etc of many installed programs.

Galanwe · on July 3, 2019

Most of the programs should by default use syslog, so you should not really have to configure much of them except enabling syslog output. Then you just have to configure your syslog implementation to write wherever you desire, rotate the files, or forward the messages to other machines for storing, etc.

majewsky · on July 3, 2019

> Most of the programs should by default use syslog

Is that still true today? Docker wants you to log to stdout, so that's what most newer applications do. systemd also wants you to log to stdout, and will redirect stdout to journald/syslog automatically. In fact, an application that only logs to syslog can turn into a minor headache when you want to dockerize it. Which is why stuff like https://github.com/sapcc/syslog-stdout exists.

I'm having a related problem with another project that I'm working on where I wrap an OpenLDAP server. It would be much easier to properly wrap it if OpenLDAP would just log to stdout instead of bypassing me and going for the syslog. Maybe at some point I'll set up a separate mount namespace for it to pass a /dev/syslog shim into it. But this shows that the log-to-stdout pattern is much more Unix-y because it composes better.

defanor · on July 3, 2019

> Is that still true today?

I think it is.

> Docker wants you to log to stdout, so that's what most newer applications do.

According to Debian popularity contest [1], which matches my observations, Docker itself isn't a particularly common package to be found in a system.

> systemd also wants you to log to stdout

While that's an option, there is sd-journal(3), which allows proper logging with priorities and custom fields.

[1] https://popcon.debian.org/by_inst

Edit: Perhaps worth mentioning that even with systemd and logging into stdout, syslogd (and maybe journald) configuration should be sufficient to sort out the log files, as mentioned in the grandparent comment.

imtringued · on July 3, 2019

If your dockerized application logs to a file you can always symlink that file to /dev/stdout as a last resort.

lousken · on July 6, 2019

one word: logrotate

bpchaps · on July 2, 2019

Interesting. I'm legitimately surprised the author put so much work into this research (happy they did!). By far one of the biggest culprits for "really odd behavior" is because of a full disk, or the disk is in some failing/failed state. To the point where when troubleshooting, `df -k` will be one of the first commands I'll run.

Does this company not have disk monitoring?

jve · on July 3, 2019

I think it is because once you know how to look at internals, you are tempted to do it that way :) And you learn alot that way too. But yeah, probably not the most effective way.

segmondy · on July 2, 2019

Did you not see that the date was 2011? I think this is lesson learned sort of article. When things go wonky, look for simple reasons. At least he didn't think it was a hardware bug.

bpchaps · on July 2, 2019

Nope, I didn't actually. Which is funny and makes a bit more sense, since I haven't seen a full disk cause bad locking like that since around the date of the article. Still, this never should have been an issue, since disk space monitoring shoulda prevented the problem from ever happening in the first place.

lucb1e · on July 3, 2019

This failure exists as long as storage exists, the article being from 2011 doesn't mean it was novel. I was a minor back then and I had already had this issue. Just an oversight, not due to it being a different age.

wahern · on July 2, 2019

Wow, that's almost exactly the same problem I encountered in Linux last week, where simply reading /proc/$pid/cmdline would block the process attempting to read the process info. It appears to have been related to this issue:

  https://lkml.org/lkml/2018/2/20/576

And much like the Solaris issue, as best I could tell the original processes (this occurred multiple times on two different nodes) would seem to be blocked either in the filesystem or memory management layers flushing pages.

majewsky · on July 3, 2019

Please don't put things that are not code in codeblocks. In this case, it makes the link unclickable for no good reason. Working link: https://lkml.org/lkml/2018/2/20/576

souprock · on July 2, 2019

This is part of why I resisted requests to make the Linux ps command sort processes by default. Half a process listing is a useful hint.

It also saves memory, which is great on the insane 10240-core boxes.

drudru · on July 2, 2019

Wow - cool example of live kernel debugging. It is very cool that the kernel is still usable while this issue occurred.

Also lucky that someone was logged into the zone without a /proc dependency. Usually people have complex shell prompts that might require /proc lookups.

It is concerning, though, that a less privileged zone could affect the entire system.

Iv · on July 3, 2019

Slightly tangential, but I find it a bit hard to accept that systems do not have a more graceful failure mode when their disks are full.

I keep a few GB free on my / but when I inadvertently fill it, it becomes almost impossible to use. Would it be so hard to keep the few last MB as reserved space for debugging purpose and refuse any space allocation that is not devoted to a 'ls' or a 'baobab' process?

theamk · on July 3, 2019

They do https://odzangba.wordpress.com/2010/02/20/how-to-free-reserv...

Iv · on July 3, 2019

You can reboot as root in a terminal but really, there is no good reason to fall back to this mode instead of more gracefully refuse, from userland, to use up more space.

Avamander · on July 3, 2019

Yep, things like desktop environments don't even start up when / is full even if /tmp or /home isn't. Awful.

spydum · on July 2, 2019

I've never used mdb on solaris, but its where I first learned of strace (and ptrace) - which unlocked soooo much of how unix worked for me, and lead me down the path to unix wizardy (well, that and access to open source code bases). in fact, when i started reading this, it's where I thought it was headed (in fact, I think could have come to the conclusion quicker than trying to read mdb)

roryrjb · on July 2, 2019

I've never heard of mdb before. I mean I've never actually used Solaris (OpenSolaris/illumos to be specific) for anything other than a few experiments here and there but nothing in production. But I've been intrigued by it mostly due to watching talks by Bryan Cantrill. It seems that at least Joyent bets heavily on it, and it seems that's mostly due to ZFS. There's also DTrace in this area, again haven't used it, but I've been itching to use bpftrace on Linux as an equivalent. Anyway, what I really came to ask is, who uses Solaris or illumos? Does it have a future? How relevant is it?

aidenn0 · on July 3, 2019

I've been using illumos on and off since it was called "OpenSolaris" and illumos is rapidly becoming less relevant just due to the tiny ecosystem. dtrace and mdb are both great tools, but it's getting to the point where ZFS on Linux (which possibly can't even be shipped as a binary without violating the Linux EULA) is seeing more usage than all other downstream consumers (and ZFS/FreeBSD was probably a wider usage than Illumos for a long while now).

Ultimately being a libre *nix that is better in a few ways than Linux seems to be a long-term losing proposition, as Linux will eventually check all the boxes you (even if it's not quite as nice) thus steadily shrinking your niche.

krageon · on July 3, 2019

> the Linux EULA

What are you talking about? This isn't a thing that exists as far as I'm aware.

aidenn0 · on July 3, 2019

https://www.kernel.org/doc/html/v4.18/process/license-rules....

Sarki · on July 3, 2019

And that's why you must monitor your servers health. You know, for when they're starting to act funny.

I'm not related in any way with them but as a personnal favorite (I even use it at home) I'd recommend Zabbix as it's open source and quite straighforward to install and deploy its agents, once configured you can even forget about it.

Gosh, its default alerts will give you hints on things you never considered checking before while the integration of new/bespoke software can be done in a matter of minutes.

DarkStar851 · on July 3, 2019

Heh. I've yet to witness a system yet that survived no disk space. Lucky the whole thing didn't lock up, had a remote mail server die like that and required an on-site intervention.

tetha · on July 2, 2019

Hm. I am impressed by the casual kernel debugging. I couldn't do that. However, why would I have to do that if running check_disk on all important file systems tells me if file systems are >80% or >90% full?

There'll be bugs I cannot debug, because I cannot casually debug my linux kernel. Yes. I still have to encounter them personally, but I'll get there. But this just seems weird, because this is a very, very basic problem to monitor for.

icedchai · on July 2, 2019

Nobody thought to run "df"? Disk being full is a common source of all sorts of weird server problems.

Alupis · on July 2, 2019

> Nobody thought to run "df"? Disk being full is a common source of all sorts of weird server problems.

Yep, to the point where `df -h` has become one of the first things I run when a server starts acting funny or things stop working.

Disk being full is far too common - run away logging, weird temp files, etc, or sometimes just a box that nobody maintained for years and years.

The fun part starts after you've identified which partition or drive is full - now you have to identify the problem files!

mobilemidget · on July 2, 2019

First things you run on login?

It ought to be monitored with alerting by default and inode use too.

icedchai · on July 3, 2019

True. Unfortunately good monitoring is an afterthought in most organizations.

rkeene2 · on July 2, 2019

The great thing about Solaris is all the debugging tools, the bad thing about Solaris is you have to use all the debugging tools.

As to why you might want to do this -- it wasn't obvious that the filesystem being full was related to "echo *" in /proc hanging, until after the debugging was done.

tetha · on July 2, 2019

Mh, I fully agree that I don't expect '(cd /proc; echo *)' to hang because of the disk being full.

However, with my current setup, the alerting checks all disks of all monitored systems every minute, and once a single of them exceeds 80%, we get alerts. That's no smart setup, that's stock nagios/iciniga/icinga2 with stock nrpe checks. This is considered a very cheap and basic setup when running servers. We've had this on our solaris systems out of the box.

And practically, this is an alert with a very high true positive rate and very few false positive alerts. Most false positives we've had to deal with had been with systems with > 3-4 Tb of storage. And systems filling up more than 5% - 10% of their storage per minute tend to trigger other stock alerts as well.

Roboprog · on July 3, 2019

No production monitoring type alert went “blip” when the disk(s) started filling past a certain percentage?

geofft · on July 3, 2019

I've encountered a ton of these at my previous and current job, in Linux.

The primary problem is that forms of ps output that read the full command line (from /proc/$pid/cmdline) require reading memory from the process. This requires, at least, a read-lock on the process's memory map semaphore (mmap_sem), and lots of other things like to access mmap_sem, including other memory allocation (write lock), a page fault (read lock, so you can figure out what to fault in), etc. In particular, if the process is in the middle of mapping or faulting a mapped page from a slow filesystem - such as NFS or a network-backed block device provided by a hypervisor - then it can sit around with mmap_sem for arbitrarily long.

Usually the process taking a read lock on its own mmap_sem, or someone else taking a read lock, is harmless, since it's a reader-writer lock and there can be multiple readers. But as soon as a writer declares an intent to take a write lock, further readers are blocked to avoid writer starvation, which means a single slow reader will prevent all further readers. See http://blog.nelhage.com/post/rwlock-contention/ for some excitement there.

You can generally read /proc/$pid/comm (short command line) and /proc/$pid/status, which both just reference info in the kernel's task_struct, and don't require taking a lock on the userspace memory map. You can also read /proc/$pid/syscall, which will tell you what syscall it's in and the numeric arguments, and usually you can read /proc/$pid/stack, which tells you the kernel stack of the process. (Though I have recently found that that one also takes a lock, but fortunately one that's much less frequently contended.) If you're trying to make sense of why a system is stuck, and ps aux is unresponsive, my goto is grep 'disk sleep' /proc/ * /status, followed by reading the corresponding /proc/$pid/stack. If you're lucky, you'll see which module is slow (filesystem / block I/O? networked filesystem? FUSE? etc.) and can try to address that. Or perhaps you'll see several processes trying to get a lock on something and one that looks like it's holding a lock and stuck doing work; if you can address (perhaps kill) that process, the system might make progress.

lsof likes to read /proc/$pid/maps, the list of mapped files, which of course requires an mmap_sem read lock. It does this so that it can list files that are mapped but no longer have a file descriptor (e.g., shared libraries get opened, mmaped, and closed). If you know that you're only interested in files with file descriptors - e.g., you're looking for a socket, or something - you can do this with less contention by looking at /proc/$pid/fd/, which is a directory of magical nodes that show up as symlinks to open files. (They're not really symlinks; for instance, they'll work even if the actual file is deleted. But you can ls -l them as if they were symlinks, so ls -l /proc/ * /fd/ * | grep is a pretty decent alternative to lsof.)

[sorry about the formatting, HN is really enthusiastic about asterisks]

unixhero · on July 2, 2019

Sounds like an excellent edit: [ strikeout Linux] kernel bug report

earenndil · on July 2, 2019

It's solaris, not linux.

rkeene2 · on July 2, 2019

And there was no point in reporting kernel bugs to Sun/Oracle at the time, because they did not care.

riffraff · on July 3, 2019

this is more confirmation of my anecdotal experience that when things fail it's 90% of the time a FS problem (run out of space, run out of inodes etc), thanks :)

de_watcher · on July 3, 2019

Just df -h, dude.

lucb1e · on July 3, 2019

Right, how do you know to check that? What if you're out of inodes, would you know to check for that, just randomly because a process is acting weird? Learn something everyday.

de_watcher · on July 3, 2019

The experience came from the days when I wasn't able to debug yet, and when full disks were frequent for similar reasons.