Hacker News new | past | comments | ask | show | jobs | submit login
The history of sending signals to Unix process groups (utcc.utoronto.ca)
129 points by r4um on Sept 6, 2022 | hide | past | favorite | 60 comments



If anyone can suggest an update to https://github.com/oconnor663/duct.py/blob/master/gotchas.md..., please let me know. As far as I'm aware, there's still no reliable way to kill a tree of processes on Unix that's suitable for library code.


Thank you for documenting those gotchas.

I know not being a library has different considerations, but some ideas I used in timeout(1) to kill a process group may be useful. Tricky things like using sigsuspend() to avoid signal handling races.

https://github.com/coreutils/coreutils/blob/master/src/timeo...

cgroups might be another avenue to explore, being more modern and so having less compatibility baggage


I wrote a 'subreaper' program to capture and kill such things some time ago. It's pretty basic, having options to pass on signals, start killing subprocesses if it receives a signal, or killing subprocesses after the initial subprocess dies. There is an undocumented --all that prints logs, kills on stats and doesn't start killing when the main program dies. I use it for killing test subprocesses or capturing things to terminals that would otherwise double-fork out of them, allowing me to kill them and all their subprocesses with a ctrl-c.

https://github.com/knome/pj

>Linux is in the middle of adding new APIs like pidfd_send_signal, but none of them are aimed at improving the situation with grandchildren.

While my subreaper is vulnerable to pid reuse, but I think it could be fixed by having it do this:

    loop
        children = scan-for-children
        for each child
            if no pidfd for given child
                pidfd open child ( dropping if error )
        still-children = scan-for-children
        close-and-drop any pidfd's that are no longer children
        safely-kill-children-via-pidfds
If you kill the direct children, the grandchildren will become direct children since you are a subreaper, and then so on.

In summary, a helper process using subreaper+pidfd should be able to properly and safely contain and kill grandchild processes.



I can suggest an update. jdebp.eu is replaceable by jdebp.info. 17 million people voted to deprive me of a domain name some years ago. I'd suggest using jdebp.uk as an alternative, but Scottish Independence and the Northern Ireland border problem are real things and once bitten, twice shy and all that. (-:


When killing an entire cgroup, you might avoid the PID reuse race by getting a list of PIDs in a cgroup, then for each PID, first opening a PIDFD to that PID, check via that PIDFD whether it’s a member of the correct cgroup, and then killing the PIDFD using pidfd_send_signal(). This way, no process can be killed inadvertently by PID reuse.

There may be a better way to do this, but one way might be to open() /proc/<PID> and use the resulting FD to both to check /proc/<PID>/cgroup (using openat(FD, "cgroup", O_RDONLY|O_NOCTTY)) and then use the same FD as a PIDFD when calling pidfd_send_signal().

ISTM that systemd should use this method if its current method is just looping over PIDs.


Might be easier to freeze the cgroup, send the signals, then unfreeze. Only danger there is that it needs to not be reentrant.

Edit: Indeed both your and my methods were proposed to systemd: https://github.com/systemd/systemd/issues/13101 It's just waiting for someone to implement it.


This won't solve your problem especially for library code, but prctl(2) with PR_SET_CHILD_SUBREAPER is worth mentioning: by setting the flag your process becomes a subreaper, and will become the parent of any orphaned child processes. Unlike cgroups, PR_SET_CHILD_SUBREAPER can be used without special privileges in commonly deployed Linux versions.


Drop them into cgroup so the new pids can't run away is I think only reliable one


Solid approach for desktop Linux. Not possible on Unix in a general sense, because cgroups are a Linux-specific thing. They also depend on having certain kernel options enabled.

So if you want to write software that can target MacOS or any of the BSDs, then you can't use cgroups.


As explained in the link, although constrained processes can not escape the cgroup, they can fork off another PID indefinitely. Also, the old PID the forking leaves behind might be re-used by some other new process in another cgroup, and now you’ve killed the wrong process. Oops!


I think it's a non-problem TBH. You can use process groups (yes like shells do). Alternatively you can track all child process ids and kill them all one by one explicitly and do that recursively in your process tree. Each sub-process in the tree that starts child processes is expected to do also this reliably.


The problem is, the children programs you launch may also use process groups to implement job control for their children, and process groups are not nested. Which means it's all very brittle, and e.g. sending SIGKILL will actually leave lots of orphans around instead of cleaning everything.


OK yeah I get that. Back in the day you used to we used to processes like containers. The system programmer would write the process executables or at least a process manager. If a process ever called setpgrp itself it did it for a controlled reason and because it was supposed to. You were in control of that because you wrote it so it's a non problem.

In modern times wer'e doing more thing like containerizing a bunch of processes they are weakly related instead of using a VM for that so the use case became more pressing. There is cgroups for managing such a group of processes. I'm suprised to learn there is no way to reliably send a signal to all processes in a cgroup though.

> SIGKILL will actually leave lots of orphans around instead of cleaning everything.

Well actually orphan processes exist for a reason. Your supposed to "wait" / reap them to make them non orphan.


Yeah and similarly, if you're a terminal program, using process groups for your children means that they'll keep running if the user presses Ctrl-C. Process groups really just weren't designed for anything that's not like a system shell.


> Linux is in the middle of adding new APIs like pidfd_send_signal, but none of them are aimed at improving the situation with grandchildren.

Maybe Linux should implement something like pidfd_getpfd() (returning the PID FD of the parent process) and pidfd_fork() (returning the PID FD of the child process).


It's not possible at all from arbitrary userspace, for the simple reason that you aren't any more privileged than the processes you're trying to kill. They can preempt you and race against you while creating new processes to kill.

You need to use a kernel feature intended for this purpose, which is what things like process groups (linked article), jobs (80's BSD feature) and cgroups (modern generalization of the idea) were designed for.


Use the freezer cgroup to freeze everything, then kill it all off however you like.


All I ever wanted was to type ^C (or perhaps ^\) in my terminal and kill in one fell swoop all the shite that my previous shell-typed command spawned directly or indirectly.

This, IIRC used to be the standard behavior back in the days, but in recent years, for some reasons, things have not been so simple.

Looking at the man page of ps(1) sort of starts to elucidate the effing mess that the name "group" summons the in unix process world.

Are we talking about the real group id (RGID), the effective group id (EGID), the controlling progress group id (TPGID), the control group (CGROUP) the textual group id (EGROUP), the filesystem group id (FGID), the textual filesystem group id (FGID), the process id of the process group leader (PGID), the saved group id (SGID), the session id (SID), the supplementary group id (SUPGID), the thread group id (TGID) ?

I'm pretty sure I'm missing some.

What an effing mess.



I rest my case.


> This, IIRC used to be the standard behavior back in the days,

Catching SIGINT and detaching from the controlling terminal are pretty old concepts.


> kill [...] shell-typed command spawned directly or indirectly

are you sure? let's assume you typed "service httpd start". it starts the init script which starts the httpd in the background. IMO most in use-cases you don't want to kill everything ever forked from the current shell.


As well as getting how "service" commands operate wrong, you are conflating a "background" process associated with a TTY with a daemon. Some "service" commands work the way that you describe. But that is actually a bug. A properly designed "service" command does not execute as some direct or indirect parent process of the eventual "httpd" process. The "service" command runs in a login session. A daemon has (at least as an intention) no association with a TTY. The daemon process runs in a context that has never been tainted by any of the various things associated with login sessions.

* http://jdebp.info./Softwares/nosh/bsd-service-command.html


The word "group" smells of an anti-pattern in any context to me.


I agree in spirit, but here's one of the most successful abstractions of all time: https://en.m.wikipedia.org/wiki/Group_(mathematics)


If one is familiar with POSIX and why it is how it is, one knows that the reason that POSIX has both sessions and process groups is that "sessions" pretty much descend from AT&T System 3 process groups and "process groups" descend from BSD process groups; so one would indeed have expected that there were two separate ideas of "process group".

Things that this history does not cover include System 5 Release 2 "shell layers". Shell layers multiplexed virtual terminals onto a single physical terminal, preceding its obvious successors screen by somewhere around three years, and tmux by about quarter of a century. The popularity of tmux and screen nowadays shows that the BSD job control mechanism did not entirely kill off the vision of shell layers, as I recall people thinking it had at the time.


Process groups and job control is a huge fucking mess on Unix. Process groups, job control, and signals are probably the biggest mistakes of Unix's design. Arguably fork(2) as well, but that'll probably raise some eyebrows. The blame for Berkley sockets and ioctls can't entirely be laid at Unix's feet, but they're also quite the mess.


Huh, I can understand labeling all of those are messes, but what's wrong with Berkeley sockets? The API seems fairly straightforward, and has managed to withstand the test of time pretty well. I'm curious what grievances you have with it?


The fact that strict aliasing has to be broken to use the API shows that, amongst other things, the API is from a different era of practicality over correctness.

In the past, people have popped up to say “but that’s only in the bind() implementation side” (when talking about `sockaddr`, for instance) but I find this argument amusing. At the end of the day, if something breaks because of this, that nuance is lost amidst the mass of frustration.


Are you sure? My impression from e.g. https://stackoverflow.com/questions/13210239/can-a-c-compile... is that it’s alright to cast between a pointer to a struct and a pointer to its first member.


You can also say that strict aliasing is incompatible with the original design of C (i.e. it's a huge mess). We should have never had undefined behaviour in C to begin with, that should have been a separate language (possibly one that can still #include C headers and talk the same ABI).


Ah, I hadn't considered the whole sockaddr/sockaddr_storage/sockaddr_in/sockaddr_un... thing, yeah that's a mess. I guess a more modern API would either use opaque pointers or an union (though that has its own share of disadvantages). Thanks for bringing this up!


Note that it wouldn’t be a strict aliasing violation, for example, if the argument to bind() were typed const sa_family_t *. The annoying cast would still be required, but very much legal (as long as the other side hasn’t screwed up).


I cannot speak for M. DeVault, but one of the interesting things about the API is that accept() embodies two distinct things. A hypothetical alternative API would have enabled such things as filtering and rejecting undesirable incoming connections before a handshake completes with application-mode mechanisms instead of with kernel-mode mechanisms.

People often point to the function calling conventions of the API, such as the "sockaddr" structures and the "errno" mechanism (that has historically been tricky for non-Unix operating systems and for systems where there are multiple C compilers with multiple C runtime libraries). But there are actual architectural decisions that had fairly reasonable alternative paths. I wouldn't say that it's a "mess" because of these alternatives, but it is definitely not the sole and unequivocal way of approaching things; and there were and are tradeoffs to be had.


The most obvious design wart to me is this:

If you want to establish bi-directional communication with some process on the same host, that process should create two rendezvous points with mkfifo(2). Your process opens the read FIFO for reading, and the write FIFO for writing, and you're done. If you finish writing and want to just read, you close(2) the write file descriptor, and keep reading the read file-descriptor until you hit EOF.

If you want to establish bi-directional communication with some process on a remote host over TCP/IP, that process needs to listen(2). Your process connects to it with connect(2), and gets back a single-file descriptor that (unlike any other file-descriptor) is both writable and readable. If you finish writing and want to just read, you have to use the special shutdown(2) function, a quirk that only exist to work around the quirk of TCP sockets being both readable and writable.

Some might argue that this is a pretty minor wart, all things considered, and sure, it's nowhere near as confusing as the mess of terminal job control. But it also seems like it they could have implemented it in a non-quirky way if they'd just spent thirty seconds thinking about it beforehand.


FDs to files are both readable and writeable, although in fairness the underlying stream is shared between the read and write side.


It's very un-Unix like. In Unix, everything is a file, right? But sockets are entirely managed via syscalls. There are also some very nasty skeletons lying in wait in more obscure parts of the API, such as cmsg.

Consider an alternative design (Plan 9 is something like this, it's been a while so I'm short on details):

1. Open /net/dns and write the desired hostname, then read back the resolved IP address

2. Open /net/tcp/clone and write the address and port, then read back a connection number

3. Open /net/tcp/$conn_id and the file is now a full duplex TCP stream

Compare this to BSD sockets: use getaddrinfo for the DNS lookup (gross!), create a socket with a syscall, then connect the socket with another syscall using sockaddr (gross!). Much worse and much less Unixy.


In unix, everything is a file descriptor [1]. The existence of an actual file entry on the filesystem namespace is not really needed.

[1] that's the ideal of course, reality is different.


Not really. Modern Unixes have evolved towards everything is a file descriptor, but the original design very much called for everything to be a file. Bell Labs started Plan 9 in part as a response to this very issue.


There was no original design for Unix. It never was 1 thing, but a bunch of loose ideas wired (and rewired) together pragmatically over time by different people, usually scratching an itch. I highly recommend Khernigan’s UNIX: A History and a Memoir to get a glimpse of the dynamic.

All the “it’s all a file” idea was added as an ideal along the way because the interface was so dang convenient. But I’ve never read it in any documentation.

Even pipes were added as a “that’s a cool idea.” There was a conversation, a late night coding session and pipes came into existence.

Plan 9 was a response to the evolved (vs designed) Unix. Took “the best” from it and made those design principles. It was an “if we could do it all over again, knowing what we know now” project.

What’s interesting to me is that evolved systems seem to dominate design-first systems in adoption. Maybe it’s that they’re pragmatic. I don’t know. Or maybe my observation is just wrong.


Eyebrows risen, please elaborate.


Unix's original design dates back from the late 1960s, where processor speed was measured in megahertz and memory size was measured in kilowords. Design choices that were reasonable then do not necessarily translate well to the 21st century.

One example among many warts: in traditional Unix, user programs manipulate process IDs in a shared PID address space, not process handles private to one process (as in file descriptors). Therefore, nearly any operation done on a process through a PID is inherently racy. That's because between acquiring the PID of a process you want to interact with and performing an operation by PID, the PID could no longer refer to the process you were interested in (for example, the process could die and the kernel could recycle the PID for a new, unrelated process). There are ways to fix that (pidfd_open on Linux for example), but the sheer amount of legacy code out there means that these warts will stay around in Unix-like systems for a very, very long time.

Even if you somehow fix every single design issue with Unix and remove all the warts, the end result would not really look like Unix anymore. One example of a legacy-free, capability-based operating system with no implicit ambient authority is Fuchsia, whose kernel interfaces do not resemble traditional Unix syscalls at all.


Everything with Unix sits on the foundation that multiple users from various terminals (requiring different driver implementations) are typing text on them simultaneously. The system consists of little programs that users combine in creative ways to run a process hierarchy. And this is temporally-confined within a session, so nothing of that “playing” should remain running once the session is over, unless explicitly stated with setsid() and friends.


Copying the entire address space into a second process just so it can be overwritten by exec a few instructions later is a big mess, and the implementation problems raised by "what if the program doesn't call exec soon?" are severe. A spawn model is much better for creating new processes.


This is a deliberate design decision. Shells do important stuff in the child process before calling exec - mostly redirecting file descriptors to pipes or files, and setting up environment variables. Exec replaces the program image, but retains these. The beauty of this is that for the exec’d program, IO is still stdin and stdout, no matter if it’s a pipe or a file.


You can do all of that stuff in a spawn model as well.


You can, but the interface of spawn(…) would grow enormous and complex. In the fork/exec model, you can execute arbitrary syscalls affecting the environment of the child, using all the inherited data from the parent.


There's a third possible API model, which is sort of between the spawn() model (everything for the new process configured in a single call) and the fork/exec model (the child process can run arbitrary code to set up the new process): the process creation API could create the new process in a "suspended" state (like CREATE_SUSPENDED does on Windows), the parent process could then manipulate the new process as desired, and finally tell the kernel to start it.


The vfork syscall first appeared in BSD 3.0 and did it the other way around. When calling vfork, the parent is suspended until the child terminates or calls exec. The child is given read-only access to the parent's memory space, or it traps on an attempt to write.

This is how they optimised it in the days before COW, by fully preserving the intended behavioural semantics of fork/exec. You shouldn't modify memory in the child and the OS therefore shouldn't copy the memory space.


You could "spawn" a separate minimal binary that only has to set up the proper environment, then have that binary "exec" into the intended process.


But that "minimal" binary would be a static and independent piece of code nevertheless. It wouldn't have access to the memory space of the parent so as to work with the already open descriptors without probing for them, it cannot use data already processed by the parent in the form of paths, etc.

Essentially you run into the same problem - now the complexity of the spawn(...) interface is shifted to that binary.


> It wouldn't have access to the memory space of the parent so as to work with the already open descriptors without probing for them, it cannot use data already processed by the parent in the form of paths, etc.

You could simply use ordinary IPC for these things, though. They need not be implemented as part of the OS.


Job control was designed before SMP was generally available.

Processor affinity, sequential execution on available/shared resources, hold queues for non-fatal errors, and policy enforcement are missing from job control.


It’s a mess that originates from the dated concept of terminals and that users are necessarily bound with interactive sessions from them. Without the notion of terminals and special control sequences, process hierarchy control can be much simpler and more orthogonal. A “signal” would be sent to the leaf processes in the hierarchy first, then they terminate, notifying their parent and so on.


> It’s a mess that originates from the dated concept of terminals and that users are necessarily bound with interactive sessions from them.

The concept has been revived (in a notoriously clunky way that's likely not directly compatible with previous facilities) via multiseat support in more recent Linux distributions.


Question: can we control a Linux process (with all its threads, signal handlers etc.) in a completely deterministic way, e.g. by using ptrace() on it?


Yes. Though I'm not sure I see the connection to the OP...?

The example I'm most familiar with, because I work on it, is Shadow. We used ptrace for a bit but now use seccomp.

https://github.com/shadow/shadow/


Sorry for no references, it's off the top of my head. There were past attempts at confining / "jailing" processes based on ptrace, it ended up being racey and/or escapeable. If you're relying on cooperation for isolation, you're back to square 1.


killpg is my favorite syscall




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: