This is why cgroups were invented. They solve this problem. Start a process in i...

menage · on Aug 3, 2019

[CGroups original developer]

Yes, for tracking processes and reliable resource control. Prior to cgroups, in Google's Borg cluster management daemon the best strategy I was able to come up with for reliably and efficiently tracking all the processes in a job was:

- assign each job a supplementary group id from a range reserved for Borg, and tag any processes that were forked into that job with that group id

- use a kernel netlink connector socket to follow PROC_EVENT_FORK events to find new processes/threads, and assign them to a job based on the parent process; if the parent process wasn't found for some reason then query the process' groups in /proc to find the Borg-added group id to determine which job it's a part of.

- if the state gets out of sync (due to a netlink queue overflow, or a daemon restart) do a full scan of /proc (generally avoided since the overhead for continually scanning /proc got really high on a busy machine).

That way we always have the full list of pids for a given group. To kill a job, nuke all the known processes and mark the group id as invalid, so any racy forks will cause the new processes to show up with a stale Borg group id, which will cause them to be killed immediately.

This approach might would have had trouble keeping up with a really energetic fork bomb, but fortunately Borg didn't generally have to deal with actively malicious jobs, just greedy/misconfigured ones.

Once we'd developed cgroups this got a lot simpler.

dekhn · on Aug 3, 2019

cgroups was extremely useful for a system I built that ran on Borg, Exacycle, which needed to reliably "kill all child processes, recursively, below this process". I remember seeing the old /proc scanner and the new cgroups approach and being able to get the list of pids below a process and realizing- belatedly, that UNIX had never really made this easy.

the8472 · on Aug 3, 2019

Was giving each job its own UID not an option? users are the original privilege separation after all and kill -1 respects that.

menage · on Aug 3, 2019

No, because multiple jobs being run by the same end-user could share data files on the machine, in which case they needed to share the same uid. (Or alternatively we could have used the extra-gid trick to give shared group access to files, but that would have involved more on-disk state and hence be harder to change, versus the job tracking which was more ephemeral.) It's been a while now, but I have a hazy memory that in the case where a job was the only one with that uid running on a particular machine, we could make use of that and avoid needing to check the extra groups.

dividuum · on Aug 3, 2019

That's definitely the correct way to do this today. But even then `kill -9 $(< /sys/fs/cgroup/systemd/tasks)` is not enough if your goal is to reliably kill all processes because that's not atomic. Instead you'll have to freeze all processes, send SIGKILL and then unfreeze.

mongol · on Aug 3, 2019

Can freezing be done atomically?

dividuum · on Aug 3, 2019

Not sure to be honest. From the documentation: "Writing "FROZEN" to the state file will freeze all tasks in the cgroup". Even if not, it should still be sufficient once all tasks are frozen: If you then send SIGKILL to all processes in the group, no fork bomb or similar process kerfuffle will be able to avoid being killed once they get unfrozen.

cyphar · on Aug 3, 2019

Unfortunately in cgroupv1, the freezer cgroup could put the processes into an unkillable state while frozen. This is fixed in cgroupv2 (which very recently got freezer support) but distros have yet to switch wholesale to cgroupv2 due to lack of adoption outside systemd.

MertsA · on Aug 3, 2019

Is that really an issue though? Unkillable is fine so long as it immediately handles the kill -9 as soon as it's unfrozen without running any additional syscalls.

cyphar · on Aug 3, 2019

There are cases where signals might be dropped (though I'm not sure if SIGKILL has this problem off the top of my head -- some signals have special treatment and SIGKILL is probably one of them). And to be fair this is a more generic "signals have fundamental problems" issue than it is specifically tied to cgroups.

It depends what you need. If you don't care that the kill operation might not complete without unfreezing the cgroup, then you're right that it's not an issue. But if the signal was lost (assuming this can happen with SIGKILL), unfreezing means that the number of processes might not decrease over time and you'll have to retry several times. Yeah, it'd be hard to hit this race more than ~5 times in a row but it still makes userspace programs more complicated than they need to be.

cyphar · on Aug 3, 2019

Yes, the freezer cgroup can be used to "atomically" put an entire cgroup tree into a frozen mode. However, unless you're using cgroupv2, the process might be stopped in an unkillable state (defeating the purpose). So this is not an ideal solution.

Really the best way to do it is to put it inside a PID namespaces and then kill the pid1. Unfortunately, most processes don't act correctly as a pid1 (the default signal mask is different for pid1, causing default "safe exit" signal behaviour to break for most programs). You could run a separate pid1 that just forwards signals (this is what Docker does with "docker run --init" and similar runtimes do the same thing). But now the solution has gotten significantly more complicated than "use PID namespaces".

Arguably the most trivial and workable solution is process groups and using a negative pid argument to kill(2), but that requires the processes to be compliant and not also require their own process groups. (I also haven't yet read TFA, it might say that this approach is also broken for reasons I'm not familiar with.)

geofft · on Aug 3, 2019

Wait, what does cgroupv2 do with unkillable processes?

Maybe I'm misreading - is it that cgroupv1's freezer puts processes in an unkillable state? Or does cgroupv2's freezer have a way of rescuing processes already in uninterruptible sleep?

cyphar · on Aug 3, 2019

The if you freeze a cgroupv1 feeezer, the processes may be frozen at a point within their in-kernel execution such that they are in an uninterruptible sleep. The reason is that the cgroupv1 freezer basically tried to freeze the process immediately without regard to it's in-kernel state.

Fixing this, and making the freezer cgroup more like SIGSTOP on steroids (where the processes were put into a killable state upon being frozen, if possible) was the main reason why cgroupv2 support for freezer was delayed for so many years.

So the answer is "both, kinda". I'm not sure how it'd deal with legit uninterruptible sleep (dead-or-live locked) processes but I'll look into it.

dataflow · on Aug 3, 2019

Semantically it shouldn't be necessary I think?

yrro · on Aug 3, 2019

I think the freezer cgroup does this, but I do t think systemd uses it.

rkeene2 · on Aug 4, 2019

Exactly. Solaris implemented this as "contracts" to make its service management framework (SMF, which is similar to systemd, but came out first and is superior in many ways).

hnarn · on Aug 3, 2019

systemd uses cgroups, correct? just wondering what the options are for learning more about this, would it be enough, assuming you'd only be working with systemd operating systems, to learn the systemd concepts of slices etc.?

lclarkmichalek · on Aug 3, 2019

slices generally map 1:1 with cgroups. Try running systemd-cgtop and you can see the resource usage of each of the cgroups

lkurusa · on Aug 3, 2019

systemd uses cgroups, yes.