Hacker News new | past | comments | ask | show | jobs | submit login

[CGroups original developer]

Yes, for tracking processes and reliable resource control. Prior to cgroups, in Google's Borg cluster management daemon the best strategy I was able to come up with for reliably and efficiently tracking all the processes in a job was:

- assign each job a supplementary group id from a range reserved for Borg, and tag any processes that were forked into that job with that group id

- use a kernel netlink connector socket to follow PROC_EVENT_FORK events to find new processes/threads, and assign them to a job based on the parent process; if the parent process wasn't found for some reason then query the process' groups in /proc to find the Borg-added group id to determine which job it's a part of.

- if the state gets out of sync (due to a netlink queue overflow, or a daemon restart) do a full scan of /proc (generally avoided since the overhead for continually scanning /proc got really high on a busy machine).

That way we always have the full list of pids for a given group. To kill a job, nuke all the known processes and mark the group id as invalid, so any racy forks will cause the new processes to show up with a stale Borg group id, which will cause them to be killed immediately.

This approach might would have had trouble keeping up with a really energetic fork bomb, but fortunately Borg didn't generally have to deal with actively malicious jobs, just greedy/misconfigured ones.

Once we'd developed cgroups this got a lot simpler.




cgroups was extremely useful for a system I built that ran on Borg, Exacycle, which needed to reliably "kill all child processes, recursively, below this process". I remember seeing the old /proc scanner and the new cgroups approach and being able to get the list of pids below a process and realizing- belatedly, that UNIX had never really made this easy.


Was giving each job its own UID not an option? users are the original privilege separation after all and kill -1 respects that.


No, because multiple jobs being run by the same end-user could share data files on the machine, in which case they needed to share the same uid. (Or alternatively we could have used the extra-gid trick to give shared group access to files, but that would have involved more on-disk state and hence be harder to change, versus the job tracking which was more ephemeral.) It's been a while now, but I have a hazy memory that in the case where a job was the only one with that uid running on a particular machine, we could make use of that and avoid needing to check the extra groups.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: