Killing processes that don't want to die

fake-name · on Sept 25, 2018

  Step 1: Install a kernel driver.  
  Step 2: Do blocking IO in the kernel that will never finish.  
  Step 3: Your process is now literally impossible to kill in any way.

This works in windows too (for similar drivers).

This happens fairly regularly with devices with crappy drivers. If something gets wedged in kernel IO, there is literally nothing you can do to kill it, bar rebooting. IIRC, the kernel won't do anything until the IO finishes.

When shitty <hardware driver tool> craps out this way, it's extremely annoying. I have semi-fond memories of trying all sorts of interesting tools that attach to a process and inject faults to kill a wedged TV-Tuner application when I was in high school. I never found a option that works.

Literally zero of the examples here for preventing unkillable processes would work, too. If you can load kernel drivers, you're hoseable.

bArray · on Sept 25, 2018

I was just about to suggest, get your process into `TASK_UNINTERRUPTIBLE` state and watch people squirm [1] [2]. Only reboot to recover.

You may think this is an edge case, but I mounted an FTP drive, ran a normal `cp` into it and the internet connection broke temporarily. You literally cannot kill the process without rebooting. Remove the drive, `kill -9`, `sudo` everything - nothing. Caused by a normal user no less!

[1] https://stackoverflow.com/questions/20423521/process-permane...

[2] https://web.archive.org/web/20131229000010/http://linuxgazet...

AnIdiotOnTheNet · on Sept 25, 2018

I agree with a comment from the SO thread: "And thus a simple, solvable hardware or locking problem will escalate to a major problem, needing reboot. And the kernel people doesn't even understand, that it is a… a… suboptimal handling of the things"

This is an example of the kind of thing that really irks me about developers, where they insist that they know better than the user and artificially constrain them "for their own good". It's one thing to make something difficult to do accidentally, it's another to prevent it entirely. I've run into unkillable task problems many times, and every time it happens on Linux I think back to the 90s when annoying linux evangelists would joke about how Windows required a reboot to fix things.

wyldfire · on Sept 25, 2018

I disagree. The defect is not that you can't kill these tasks, the defect is that the I/O wasn't terminated by a timeout somewhere in the kernel/driver.

> where they insist that they know better than the user and artificially constrain them "for their own good".

I'm not sure where this comes from but if it's the intended design of linux (to not be able to kill tasks in the uninterruptible state), then that's their design. You're free to design your own operating system without this limitation -- though it likely means just taking the other side of the tradeoffs involved here.

It's not as if linux designers think "tasks holding resources that can't be killed by a sysadmin are a good thing" rather instead "letting tasks currently executing kernel code to be interrupted is a bad thing." And you could imagine other solutions that might liberate you from the dilemma entirely, but then that wouldn't be linux anymore.

AnIdiotOnTheNet · on Sept 25, 2018

> I disagree. The defect is not that you can't kill these tasks, the defect is that the I/O wasn't terminated by a timeout somewhere in the kernel/driver.

Defects happen. This is sorta like saying you don't need ASLR because the real defect is that some code allows buffer overflows. Things don't always work perfectly and when they don't you still need tools to deal with it.

> You're free to design your own operating system without this limitation

Such an OSS response to criticism. This attitude is awful and a big part of the reason software sucks.

> It's not as if linux designers think "tasks holding resources that can't be killed by a sysadmin are a good thing" rather instead "letting tasks currently executing kernel code to be interrupted is a bad thing."

The problem is that they fail to recognize that there are potentially cases where even though it is a bad thing, it is still the least-bad option. It comes down to a fundamental philosophical disagreement, I guess. I think the computer belongs to the user and they should be able to do things even if they are dangerous as long as they acknowledge and accept that danger. But there is this -- unfortunately very prolific -- idea among developers out there that the computer belongs to them and users are beneath them, a lower life form incapable of making their own decisions that needs to have their hands tied.

DSMan195276 · on Sept 25, 2018

> The problem is that they fail to recognize that there are potentially cases where even though it is a bad thing, it is still the least-bad option. It comes down to a fundamental philosophical disagreement, I guess. I think the computer belongs to the user and they should be able to do things even if they are dangerous as long as they acknowledge and accept that danger. But there is this -- unfortunately very prolific -- idea among developers out there that the computer belongs to them and users are beneath them, a lower life form incapable of making their own decisions that needs to have their hands tied.

I don't think you really recognize the complexity here. Even if you could conceivably 'kill' the task, the kernel would still be in a broken state. It wouldn't fix the problem, it would just create a bigger one - now you have absolutely no hope of recovering as the task that could do so is gone. Keep in mind a task in `TASK_UNINTERRUPTIBLE` is already sleeping, it's not doing anything. Killing it doesn't change anything or unblock the kernel, it just removes the `task` from the scheduler.

On that note though, it's worth noting the kernel people do (or rather, did) recognize this is a problem, hence why is has already been addressed in several different ways. As mentioned above, timeouts solve the problem in most cases (And if there isn't a timeout, that's probably a bug), but the `TASK_KILLABLE` state was added around 10 years or so ago to solve this problem - as a properly written kernel task that sleeps in that state can be killed like normal and will leave the kernel in a safe state when killed.

AnIdiotOnTheNet · on Sept 25, 2018

> Keep in mind a task in `TASK_UNINTERRUPTIBLE` is already sleeping, it's not doing anything. Killing it doesn't change anything or unblock the kernel, it just removes the `task` from the scheduler.

....so why can't I force that? When you encounter this problem recovery isn't ever going to happen anyway.

wyldfire · on Sept 25, 2018

It's already in a state where it can't get back on the run list. The problem isn't that it's going to eat your CPU cycles, the problem is that it's still holding resources that you want it to give back. Taking it out of the process list won't give back those resources, it would just make it so it doesn't show up under `ps`.

> When you encounter this problem recovery isn't ever going to happen anyway.

The kernel's design can't distinguish between this unrecoverable situation and the normal functionality. Or rather it could, if it had a sane timeout. It doesn't, that's the defect.

AnIdiotOnTheNet · on Sept 25, 2018

And there's really no way around that? I find this difficult to believe, but whatever. Guess I just have to live with stuck mounts and other crap.

wyldfire · on Sept 25, 2018

> The problem is that they fail to recognize that there are potentially cases where even though it is a bad thing, it is still the least-bad option

You are way oversimplifying the problem. Linux designers want a nice, stable OS with good performance. Introducing a new mode where signals can asynchronously interrupt a task means that you have to account for your kernel code to not only be safe when CPU interrupts occur, but now when user space signal handlers execute too. To be safe means that you'd need to sprinkle fences/barriers over lots of code. This flushes pipelines and unnecessarily forfeits performance. Performance aside, it's really hard to imagine the problems caused by asynchronous interrupts until they actually occur. As a designer, it's much simpler to just bar them for the critical region. Now, it's totally reasonable for us to discuss 'hey can we minimize these critical regions somehow, to make these problems less likely?'

If you did a `kill -9 1234` and then your computer suddenly hit an NMI, whose bug would that be? The kernel's! You'd say "hey, kernel, wtf, I was just trying to reap this process and then my computer rebooted." And you'd be right.

There's lots of design decisions made early on in linux that are likely baked in and hard to change without impacting user space.

For you to declare it "the least-bad option" you have to be able to evaluate it from the designer's perspective. From the user's perspective, I agree that it's clearly not ideal. But there's much more at stake here.

> idea among developers out there that the computer belongs to them and users are beneath them

You're introducing some motives that just aren't there. Linux is all about enabling people. Perhaps we can agree that the death of general purpose computing is nigh [1], but it's not linux's fault.

>> You're free to design your own operating system without this limitation

> Such an OSS response to criticism. This attitude is awful

I don't agree. But I will express regret that I didn't say instead "you're free to use another OS without this limitation" -- that's a more realistic option for a user.

For a user to casually say "you made the wrong design decision here" without acknowledging the existence of the tradeoff is unfair.

[1] https://boingboing.net/2012/08/23/civilwar.html

AnIdiotOnTheNet · on Sept 25, 2018

Why does such a feature have to be safe? It's for an edge case where something is misbehaving badly, and your only alternative is a reboot. Why shouldn't the user be able to do this as long as they accept the risk?

> Linux is all about enabling people

Since we're talking about the kernel, I agree for the most part. Which makes it all the more annoying that "reboot" (and sometimes even that doesn't work and you have to smother the system with a pillow) is the only option in this circumstance. I can overwrite disks and memory at will, destroying who knows what data, but I can't kill a useless hung process because it might be unsafe?

Admittedly I am ignorant of the details of the kernel implementation, but I have difficulty believing that there is nothing that can be done.

ddingus · on Sept 26, 2018

Say your kernel panics 10 minutes later, due to an unplanned state caused by ripping the process out of the table.

What then?

Stuck mount, or unplanned, immediate restart?

See the dilemma?

This way, you can keep working, save what may be super important, and then restart.

CRConrad · on Sept 27, 2018

In reply to "AnIdiotOnTheNet"'s: >> idea among developers out there that the computer belongs to them and users are beneath them

"wyldfire" wrote: > You're introducing some motives that just aren't there. Linux is all about enabling people.

Yeah, that would be nice. But perhaps one would be more inclined to believe it were true, if you hadn't in your paragraph above that written:

> For you to declare it "the least-bad option" you have to be able to evaluate it from the designer's perspective.

Huh? No, isn't it the designer(s) of the operating system who should evaluate things from the perspective of the actual owner of the computer???

Especially if what they're doing is "all about enabling people"?

ddingus · on Sept 26, 2018

>users are beneath them

More like they do not want to support all cases. So they don't.

Allowing a least bad type option that gets abused, or not well understood, may well result in a different, negative, "why allow this?" dialog.

Liabilities, obligations all can expand on something like this. Not always an easy call.

IronBacon · on Sept 25, 2018

Well, in the 90s Windos usually required a reboot after modify a config setting, like the host IP address...

jiveturkey · on Sept 25, 2018

i’m told it still does if it’s a domain controller

zodiac · on Sept 25, 2018

Couldn't it be argued that a design without these unintereuptable states artificially constrains the user (of the driver) by preventing them from entering a state where all signals are blocked?

Not saying these should be the primary considerations, but I don't really understand how "don't artificially constrain the user" can be a consistent design principle...

on Sept 25, 2018

[deleted]

sitkack · on Sept 25, 2018

You need to build your own hardware.

pjc50 · on Sept 25, 2018

Yes, this is the most common way of accidentally getting into this state. Certain combinations of NFS options can also cause this, or even worse produce a filesystem where any process that looks at it gets permanently stuck in "D" state.

vesinisa · on Sept 25, 2018

Haha, had exactly this happen to me, which is why I switched to CIFS / Samba. I don't know what I was doing wrong, but each time the LAN connection to my file server broke, NFS would enter a state where any attempt to touch the file system - including umount - would just freeze the process and make it uninterruptable. This made even turning the computer off safely impossible, since init would try to unmount the NFS file system and hang.

Samba seems to have saner defaults, and does not hang if the connection is interrupted. I guess NFS was written with C and A but no P of the CAP theorem in mind, or I had it just misconfigured?

pjc50 · on Sept 25, 2018

Nah, NFS cares neither for C nor A nor P; it's almost the simplest possible protocol, with primitives of "read/write the block at offset X from inode N". There are all sorts of hacks due to the only reliable atomic primitive on NFS being "rename".

Technically it makes no effort to solve either A or P - all the data is stored in one place, with maybe a small local dirent and block cache. There's only one server (so no "P") and no failover (no "A").

Someone · on Sept 25, 2018

To learn the basics about NFS, read chapter 14 of the Unix haters handbook” (http://simson.net/ref/ugh.pdf)

”By design, NFS is connectionless and stateless. In practice, it is neither. This conflict between design and implementation is at the root of most NFS problems.

File systems, by their very nature, have state. You can only delete a file once, and then it’s gone. That’s why, if you look inside the NFS code, you’ll see lots of hacks and kludges—all designed to impose state on a stateless protocol.”

It goes on to mention some issues that NFS wouldn’t handle if it were truly stateless and connectionless:

”NFS is stateless, but many programs designed for Unix systems require record locking in order to guarantee database consistency.

[…]

NFS is based on UDP; if a client request isn’t answered, the client resends the request until it gets an answer. If the server is doing something time-consuming for one client, all of the other clients who want file service will continue to hammer away at the server with duplicate and triplicate NFS requests, rather than patiently putting them into a queue and waiting for the reply.

[…]

If you delete a file in Unix that is still open, the file’s name is removed from its directory, but the disk blocks associated with the file are not deleted until the file is closed. This gross hack allows programs to create temporary files that can’t be accessed by other programs. (This is the second way that Unix uses to create temporary files; the other technique is to use the mktmp() function and create a temporary file in the /tmp directory that has the process ID in the filename. Deciding which method is the grosser of the two is an exercise left to the reader.) But this hack doesn’t work over NFS. The stateless protocol doesn't know that the file is “opened” — as soon as the file is deleted, it's gone.”

paol · on Sept 25, 2018

Nope, you didn't misconfigure it, NFS behaves like that by default. In fact it's really hard to configure it not to that.

jandrese · on Sept 25, 2018

It can be annoyingly difficult to even configure NFS to enable soft-interrupts that actually work.

On one hand it is nice because if your fileserver reboots for some reason you can be reasonably sure that you won't lose data or have truncated/corrupted files, but it's annoying if your fileserver goes down hard then it brings down everything else on the network with it.

kpcyrd · on Sept 25, 2018

If you are able to load kernel drivers you could just nop out the kill syscall though.

oblio · on Sept 25, 2018

For the mere mortals amongst us, how would you actually do that, practically? I.e. what commands?

TheDong · on Sept 25, 2018

There are many options.

One new way would be to install a kprobe. You can read more about kprobes here: https://www.kernel.org/doc/Documentation/kprobes.txt (note, they're not meant for modifying syscall arguments in general, but in the case of kill it will be possible :)

Effectively, they let you write code that will run before syscalls and can mutate the state of the kernel and registers and such. You could write a kprobe which changed "kill"s pid argument to some random pid that didn't exist each time the given pid was being killed.

You could also just patch the kernel in memory yourself easily enough (which is even easier for a kernel module; look up modfying the sys_call_table; that table exists for a reason). Modifying the sys_call_table is probably what the parent poster was thinking of, and is definitely the "right" way to do this.

You could compile a new kernel with the "kill" syscall modified by editing the c code, and then do something akin to what ksplice did to patch the running kernel with that. kexec/kpatch are the modern tools to do that, though neither of them are as good as ksplice is... not that ksplice is actually something you can use without working at oracle on that team.

You could also modify the kill binary under the assumption the user is running "/usr/bin/kill", not making the syscall in some other way, and modifying that binary is pretty trivial.

These last few are more like what rootkits do and are harder to detect. kprobes and kernel modules are both meant to be visible to users in general (though they can ofc hide themselves with work), but patching binaries on disk and the kernel memory live is a bit less transparent.

You could do various other strategies rootkits do, though I'd rather not get too deep into those since those can get relatively complicated (with the ideas presented in the Blue Pill rootkit paper still being some of my favorites).

> what commands

Asking it like that is less productive. The ideas I mention above should be enough for you to research and do it yourself. The exact details are complicated because this isn't a simple "out of the box" sort of thing to do. Modifying one call in one or two processes with gdb / LD_PRELOAD hacks is easy, but still isn't just "out of the box" since you'll have to write some gdb scripts / a preload shim. Doing it system wide in the kernel is even more complicated and even more resilient to listing commands instead of concepts, and asking "what commands" doesn't really make the answer more meaningful. Me saying "sys_call_table + gcc + insmod" isn't really any more helpful than just saying "sys_call_table".

Hopefully the above helped! I recommend doing more research, especially on the sys_call_table, kprobes and, ksplice as those are the most relevant techniques, though realistically only the first two are things you could reasonably do without becoming an expert in patching kernels in memory.

oblio · on Sept 25, 2018

> Asking it like that is less productive.

Less productive only if there's no command :)

Thanks a lot for the info, it should provide quite a few hours of hacking around!

jakeogh · on Sept 25, 2018

A syscall zombie is pretty easy to find, I often run into them when making NFS mistakes (like leaving the local net without properly ending some dep)... is there a general way to see what syscall the undead process is waiting for? Is there a simple way to make it return ERRx?

Can I get a list of pending syscalls by age?

TheDong · on Sept 25, 2018

It's very easy to find the current stack of running syscalls and how long they take using perf [0]. In fact, a flame graph will give you a wonderful way to visualize this [1].

I could go on and on about perf, but there's no reason since Brendan Gregg's page is so darned good. Just read that, play around with the perf cli, find the answers to all your questions... Well, all of them except how to cause that syscall to error out.

There's no simple way to force an in-progress syscall to error out as far as I know. If any one else knows of a simple way, I'd be curious to hear of it.

[0]: http://www.brendangregg.com/perf.html [1]: http://www.brendangregg.com/perf.html#FlameGraphs

jakeogh · on Sept 25, 2018

Very cool. I'm trying to start with:

perf stat -e 'syscalls'

but haven't figured out how to specify all events... if that's even the right route. The examples that are close dont work for me: https://bpaste.net/raw/455ebe2dd400

cyphar · on Sept 26, 2018

If you have the process that is currently blocked (note that "zombie" doesn't apply here -- zombies are processes that are dead but are waiting for someone to collect their exit code) you can look in /proc/$pid/stack to see what the current kernel stack is (the bottom-most entries tell you what syscall the process started by doing). This is the easiest way to start debugging a kernel problem (what is the function it's blocked in -- then go read that function and see whether the the pathology is obvious).

jwilk · on Sept 27, 2018

> You could also modify the kill binary under the assumption the user is running "/usr/bin/kill"

POSIX requires kill to be a shell builtin, so the (/usr)/bin/kill binary is rarely used in practice.

ajross · on Sept 25, 2018

Step 2 needs some clarification. It is possible to implement "blocking I/O" in the kernel with the wait_interruptible() family of functions, and many drivers and subsystems do. These have the fairly straightforward behavior that they can return early with an exit code instead of when the given waitq ueue has been tickled. In the kernel, in C, this is pretty obvious stuff.

But the side effect of not using the _interruptible variants is that there's no way for the kernel to return except from the intended event, which means that other things that you'd want to produce a return from the kernel context can't happen either, including signal delivery. So to userspace the process becomes unkillable and it looks like voodoo.

But at the bottom level all it is in a C API that made promises that weren't kept at runtime.

mrguyorama · on Sept 26, 2018

Interestingly enough, Windows has gotten a little smart about kernel drivers misbehaving. I found out on my Windows 7 machine that if the AMD video driver becomes unhappy in certain ways, Windows will be able to recover it with no blue screen or loss of functionality other than the program who's access to the direct3D device caused the problem no longer having access until it attempts to re-establish a connection or is restarted.

An incredibly limited set of circumstances, but I was awestruck to see Windows catch itself hiccuping in kernel space and not explode.

rusk · on Sept 27, 2018

and yet, user-space applications can still bring down the whole thing; and just try bringing up Task Manager to kill a task when you need to.

mrguyorama · on Sept 27, 2018

Killing the "task" hasn't worked since Windows 95, but killing the process itself still unwedges nearly anything userspace in my experience

rusk · on Sept 27, 2018

Still need task manager though

caf · on Sept 25, 2018

If you have a kernel driver that is giving you this problem, it should be fairly easy in most cases to replace its use of TASK_UNINTERRUPTIBLE with TASK_KILLABLE.

jakeogh · on Sept 25, 2018

Why does TASK_UNINTERRUPTIBLE exist? 'I'm sorry Dave'?

caf · on Sept 25, 2018

It's because for a long time, there was only "interruptible" and "uninterruptible", where "interruptible" meant that any signal could interrupt it.

Historically, many operations were specified not to be interrupted by signals - reads and writes on disk files (as opposed to "slow devices" like ttys and sockets), mkdir(2), fsync(2), etc. They're not allowed to return EINTR. This meant using an uninterruptible sleep, which has the consequence that SIGKILL can't interrupt it either. That's why the code in top or ps for uninterruptible sleep is 'D' - historically it meant "Disk Wait".

TASK_KILLABLE is a more recent invention, which can be interrupted only by fatal signals. The code doing the wait still has to unwind as if it were going to exit from the syscall with an error, but userspace won't see that error because it will die from the signal instead.

notriddle · on Sept 25, 2018

Because the call doesn't support rolling back partial progress. Imagine if you interrupted `mount`, and it had written to the directory table but hadn't yet initialized the underlying filesystem and now the kernel reads uninitialized memory whenever you look at that directory (or, less disastrously, the mutex that protects that table is left locked and you can never mount or unmount anything ever again).

ovi256 · on Sept 25, 2018

Obviously for the app developpers that write rocket flight computers. They also don't use free, their garbage collection is implemented in a physical mechanism.

ahartmetz · on Sept 25, 2018

This happened to me once in ~2006 on Linux with a scratched CD. I hope it's been fixed.

filomeno · on Sept 25, 2018

I think you're out of luck. I once accidentally scratched a CD and ten years later it only got worse.

jfroma · on Sept 25, 2018

You made me laugh a lot. Thank you

imtringued · on Sept 26, 2018

It can also happen if you run Linux on a crappy flash drive.

jacobparker · on Sept 25, 2018

Fun article!

A modern way to do this on Linux (specifically) is to use PID-namespaces for isolation.

You can pass CLONE_NEWPID to the flags argument of clone(2) so that the new process is PID 1 in a new PID-namespace. One of the special properties of PID 1 is that when it dies all of other processes in the PID-namespace will receive SIGKILL.

So, create a new "shim" process in a new PID-namespace and have that process fork&exec the rogue program (it's probably better to fork&exec than just exec because of the special properties of PID 1 that the rogue programming you're running may not be built to handle.)

If you want to kill the rogue program, just kill your shim.

You need to be root (have CAP_SYS_ADMIN) to create a new PID-namespace. If you're not (or more likely don't want to be) you can first create a new user-namespace with the CLONE_NEWUSER flag. You'll be root in that user namespace and can create a new PID-namespace. You'll probably also want to launch the rogue program as a non-root user though, so make sure to figure that out.

Docker does this sort of stuff.

man pages:

* http://man7.org/linux/man-pages/man7/pid_namespaces.7.html

* http://man7.org/linux/man-pages/man7/user_namespaces.7.html

* http://man7.org/linux/man-pages/man7/namespaces.7.html

* http://man7.org/linux/man-pages/man2/clone.2.html

cyphar · on Sept 25, 2018

In a shell you can do this with `unshare -prf` which will create an unprivileged user namespace (mapping only yourself -- if you want more complicated stuff you'll need to use rootlesskit[1] or a shell script). Note that most programs really don't like being in an unprivileged user namespace (something the rootless-containers project has been working on fixing so you can have completely unprivileged containers in every sense of the word "just work").

[1]: https://github.com/rootless-containers/rootlesskit

Liskni_si · on Sept 26, 2018

You might also want to have a look at bwrap [1], a handy little sandboxing tool for unprivileged users.

[1]: https://github.com/projectatomic/bubblewrap

dataflow · on Sept 25, 2018

Windows has job objects that work nicely for this purpose. :)

lostmsu · on Sept 25, 2018

Isn't their purpose basically the same as process groups?

dataflow · on Sept 25, 2018

I think they're similar, yeah, though they seem more general/orthogonal... e.g. they allow a process to be associated with multiple jobs (or so I think -- I've never needed to associate 1 process with multiple jobs), and this is independent of whether the process is detached and starts a new hierarchy or not.

jakeogh · on Sept 25, 2018

I increasingly find myself looking at pledge...

https://man.openbsd.org/pledge.2

It's objectively the right model, if my assumptions are wrong, the program fails. I like that.

Maybe it could be taken further... to lie to the app and report back about it's sandbox... or halt and log it's image.

pcwalton · on Sept 25, 2018

pledge is objectively the wrong model, because it's an ad-hoc set of fixed-function rules. By contrast, Linux seccomp-bpf allows much more flexibility, because you can actually write an arbitrary program to implement whatever sandboxing policy you want.

In fact, your wish list highlights the problems of pledge precisely. You can implement "pretend to succeed" and "dump core for inspection by a trusted supervisor", but only on Linux with seccomp-bpf, not with pledge.

ori_b · on Sept 25, 2018

It's objectively the right model, because it's simple enough to implement that the vast majority of OpenBSD is pledged, including many ports like Firefox. The vast majority of other systems do not have the equivalent, in spite of having mechanisms that are in theory more powerful than pledge.

Flexible sandboxes don't help if your programs don't run in them.

kbenson · on Sept 25, 2018

Maybe it's objectively the right model for right now, and the wrong model for tomorrow, which would make you both right. It's hard to get programmers and packagers to add new OS specific features, at least until most OS's support from variant of that feature, and then they either bite the bullet or they use from library to abstract it.

Pledge is simple, and that helps people adopt it and start using it. When most operating systems are providing something equivalent and most programs are making some use of it, it then becomes a selling point to do it better, because the status quo has moved. At that point, pledge might not compare well, unless it has evolved.

ori_b · on Sept 25, 2018

> Maybe it's objectively the right model for right now, and the wrong model for tomorrow.

I'm using my computer right now. And pledge is evolving over time, as we gain experience in what real programs need by implementing protection for them.

kbenson · on Sept 25, 2018

Sure, but some things are hard to evolve after a while, and it may be that a clean (or older but more complex) implementation has advantages at some point. I have no idea if pledge will fall under this or not, but it's a common theme in computing.

pcwalton · on Sept 25, 2018

Firefox is sandboxed via seccomp-bpf on Linux.

jakeogh · on Sept 26, 2018

Awesome. I just checked, and it appears one FF process is, and one is not: https://bpaste.net/show/d8b3d6ada513

https://wiki.mozilla.org/Security/Sandbox/Seccomp

I would like much more general control, like disable write, or prevent forks, or net access, right from the executing shell. I have selinux experience, and it's too low level.

adrianN · on Sept 25, 2018

I think the idea behind pledge was that it's simple enough to be actually used whereas the Linux solution is technically superior but too complicated to be widely used.

jakeogh · on Sept 25, 2018

I was not referring to the implementation. The first , should be a ;. I agree, bpf looks sweet. It might be so useful that it becomes optional only in spirit.

foodstances · on Sept 25, 2018

How many programs are running on your computer right now with seccomp-bpf policies?

cyphar · on Sept 25, 2018

As a complete aside, I'm happy that they added "execpromises" to pledge. It was one of the main things that really concerned me about its usage as a sandboxing system (seccomp I think is still more powerful but I will be the first to admit it's ridiculously complicated to get right).

tux1968 · on Sept 25, 2018

Wonder how hard it would be to provide a Pledge API which simply passes through to seccomp.

ori_b · on Sept 25, 2018

It's possible, but it's fairly hard. You'd need a second process to supervise the seccomp'ed process, watching the paths that it accesses with ptrace, and allows/disallows the actions based on inspection of the process state.

The problem is that seccomp, as far as I'm aware, does not provide you with easy access to the actual arguments of system calls if they are not pointers. The actual pointed-at values of the arguments are required for emulating pledge.

cyphar · on Sept 25, 2018

The need for ptrace is soon going to change with the new seccomp userspace notification API (still undergoing review) which allows a userspace process to get notifications when seccomp is tripped and decide what the return code of the syscall is (you can also give it an fd which is copied to the process unix socket style). This would allow you to use process_vm_{read,write}v without the need to mess around with PTRACE_SEIZE.

The main usecase for this is container runtimes (emulating privileged operations like auto-loading of kernel modules), but it could also be used for this an a whole host of other really neat features.

(Huge shout-outs to Tycho Anderson for working on this for quite a while!)

Ua2Reemo · on Sept 25, 2018

Yeah, the seccomp limitation that it can't deref pointer arguments makes things a lot less elegant than pledge.

But at least unveil can be implemented via mount namespaces.

anon49124 · on Sept 25, 2018

Most old-school shutdown init scripts called the following at some point before shutting down or rebooting:

   kill -TERM -1 # sending all processes the TERM signal...
   # ...
   sleep $SOME_SECS
   # ...
   kill -9 -1 # sending all processes the KILL signal...
   # ...
   sleep $SOME_MORE_SECS

Also old-school local DoS from root:

   # used to be very bad
   rm -rf / ; kill -9 -1

Another classic:

   # bash fork bomb
   :(){ :|:& };:
   # creates a function : and calls it
   # spawns a background process that pipes to itself, which creates another process... effectively two recursive calls per call.

adrianratnapala · on Sept 25, 2018

Why are the pipes useful for the fork bomb? Or is it just a visual thing?

mongol · on Sept 25, 2018

I think it is important to make it explode exponentially. Without it, the bomb will be linear.

jwilk · on Sept 25, 2018

FWIW, pipes aren't necessary to achieve explosion. Here's an example fork bomb without pipes:

    :(){ :&:& }; :

VLM · on Sept 25, 2018

During "its bad, but not that bad" situations over the last couple decades I've had positive results with dropping the init level to single user mode "1" and then going back to normal multiuser mode which historically could be any number "2", "3", or "5".

Obviously this doensn't help with processes stuck in a kernel call. But for other sorts of malfunctions, or simply testing init scripts and their dependency tree, there's really no reason to reinit the kernel and check the filesystem and all that, and it can be very fast compared to a complete boot process.

This is not exactly a "gui-generation" solution for problems, but it has occasionally helped very quickly clean things up.

emilfihlman · on Sept 25, 2018

It's absolutely infuriating that you can, with a simple NFS, cause an unkillable, unremovable, always there process to exist. There is simply no reason to require this.

giovannibajo1 · on Sept 25, 2018

For non vital filesystems, I use "-o intr,soft,timeo=5" at NFS mount time. This causes syscalls to return EINTR and timeout without hard locking. It can sometimes confuse userspace, but at least you don't get a stuck process.

pjc50 · on Sept 25, 2018

I always preferred "intr,hard", because having filesystem operations fail really stresses out a lot of userspace applications. Most developers haven't thoroughly tested their handling of "what if this disk operation fails?", because it's a pain to do and a rare case.

hayden592 · on Sept 25, 2018

Articles like this give me a better understanding of computers and how they fundamentally work. Thank you.

johannkokos · on Sept 25, 2018

I cannot get the first simple example working.

  int main()
  {
      while (1) {
          if (fork())
              _exit(0);
      }   
      return 0;
  }

If I run `htop` immediately after typing `./a.out`, it shows a error,

  Could not create child process - exiting
  fork: Resource temporarily unavailable

However, if I wait a second or more, I cannot find `a.out` in the process list.

nytopop · on Sept 25, 2018

You can apply a filter in htop before running your binary, it'll only show matching processes.

I think the hotkey is F4.

taejo · on Sept 25, 2018

Our product occasionally does this at the moment. We've set up asserts in our application to launch a "phone-home" process (on the theory that launching a new process is more likely to succeed than trying to phone home from the borked process). However, the phone-home program uses the same assert library, so the rare assertion failure in the phone-home will launch a new phone-home.

CamperBob2 · on Sept 25, 2018

How is that not a fork bomb? ("The author disagrees with this characterization, which was added by an editor late in the publication process.")

CGamesPlay · on Sept 25, 2018

Because the number of live processes never exceeds 2, and fork bombs typically refer to DoS attacks by exhausting the PID space.

nneonneo · on Sept 25, 2018

The parent always exits (fork() returns the nonzero child pid in the parent, and 0 in the child), meaning that at any point there’s only one process that is actively forking. A typical fork bomb, on the other hand, floods the system with processes that are all attempting to call fork, resulting in an exponentially growing set of processes until the system runs out of resources.

userbinator · on Sept 25, 2018

I propose the term "forkworm", or perhaps "forxolotl"[1] for these processes that continuously reproduce but don't grow exponentially.

[1] https://en.wikipedia.org/wiki/Axolotl#Regeneration

caf · on Sept 25, 2018

It already has a name - processes like this were historically called "comets".

OskarS · on Sept 25, 2018

This is a fork bomb:

   while (1) { 
       fork(); 
   }

This will fork forever, creating children that fork forever, exhausting the system's resources. The example in the article only ever has at most 2 processes, and they don't exhaust anything.

glandium · on Sept 25, 2018

Note yours is an exponential fork bomb: both parent and child processes keep forking.

RickJWagner · on Sept 25, 2018

Hacker News paydirt!

I love an interesting programming article, this is a nice read. Kudos to the author.

mjevans · on Sept 25, 2018

A namespace freeze option for halting app process might be a helpful feature in these cases.

However this still doesn't solve the Zombie process problem where something has blocked on IO/another resource. Failure should absolutely always be an option.

mnarayan01 · on Sept 25, 2018

https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-s...

emmelaich · on Sept 25, 2018

A comment in the article has it right -- use SIGSTOP not SIGKILL.

In a tight loop.

MaxBarraclough · on Sept 25, 2018

Can't say I took much away from that article.

Next time my RabbitMQ broker goes nuts and refuses to die (thank you Erlang), I'm still going to go ahead and reboot the whole machine.

known · on Sept 25, 2018

   sudo dmesg -w -H --level=err

displays the exact problem

the_ed · on Sept 25, 2018

* 'Kill Dash Nine' by Monzy starts playing

NullPrefix · on Sept 25, 2018

Can it kill sync process?

daleon · on Sept 25, 2018

Is there really no cross-platform way to do this that works o both *nix and windows?

lozenge · on Sept 25, 2018

There isn't even a cross platform way to start a process, never mind killing one.

In Windows you can launch the process in a paused state then assign it a job object which all sub processes will inherit. Then the job object has a function to kill all processes in it.

boomskats · on Sept 25, 2018

I'm not sure you fully appreciate the context of the article is around the linux kernel, on LWN.net (a linux news site). The article has no reason to talk about Windows.

If you're really that keen on Windows and need to implement process management functionality in a crossplatform tool, take a look at the Python psutil library, or the all- but-obsoleted SIGAR. That's not really what this article is about though.

dagenix · on Sept 25, 2018

I dunno what you are going for - processes function pretty differently on those systems, so, no, there is not a cross-platform way to handle that case since process handling is inherently system specific.

If you are asking if there is some library to do this type of thing - maybe? I dunno. But, even if there is, the articles is talking about how such functionality can be implemented on Linux - which, if such a library existed, would be something that said library would implement.