In my copious free time, I have a vague idea of designing and implementing a mechanism called kpid1.
Basically, a task running as pid 1 (including in a container) could call a new kpid1() syscall, which would cause the kernel to completely take it over. The kernel would take care of all the usual init work, and it would expose the minimal API (presumably using a new kind of fd) to allow a different task to give it instructions and manage zombies as needed. And that’s it.
It’s worth noting that the entire concept of pid 1 is very unixy, but not in a good way. Reasonable modern designs (e.g. all the Windows variants) don’t have any real equivalent.
Zombie reaping could have a reasonable API. (Signals are miserable.)
PID 1 is magic in problematic ways. In particular, if PID 1 crashes, the whole system goes down with it. And having PID 1 be a normal program running from a normal ELF file means that that ELF file is pinned for the life of the system or at least until it execs something else. So handoff from initramfs to a real fs either involves PID 1 calling execve() or involves leaving the init process around. Upgrading the package containing PID 1 requires execve(). Running PID 1 from a network filesystem or an unreliable device risks a kernel panic for no good reason.
With PID 1 moved to the kernel, the actual service management job is no longer coupled to PID 1’s legacy. A service manager could hand off to another one by saving its state to disk and exiting, by running the new one and moving its state after the new one starts, or by any other ordinary means. And if it crashes, you can ssh in, read logs, save work, and then restart it or the whole system as appropriate.
As a minor additional benefit, having PID 1 in the kernel could enable some optimizations. Right now, a process must enter the zombie state when it exits, and it must stay in that state until its parent wakes up and reaps it. So a service exiting fundamentally involves some complex bookkeeping and a context switch to a single, unrelated process. If the kernel knew that kpid1 was in use and that nothing in the system actually needs to be notified of exiting children of pid 1, then a child of pid 1 that exits could simply go away, as it would on a sensible system like Windows.
(Yes, it's okay to admit that, in some respects, Windows is substantially better than Linux/Unix.)
Not OP, but the whole business of PID 1 having to reap orphan PIDs seems like something the kernel should have to do. Is there a good reason for when a process exits that not other process is waiting on, that a user-mode PID 1 process has to observe that exit?
My understanding is that reaping children is a normal thing for most processes to do, and it's only orphans that fall through to PID 1, at which point it's easier to deal with it there rather than need to do anything special in ring zero.
Reaping children is "normal" in a universe where processes have numeric ids that can't be reused for unrelated processes until some handshake occurs that frees the id for reuse.
If you take anything resembling a fresh look at this concept, it's absurd. Imagine if every open file had a systemwide unique id, and one specific process owned that id and would continue to own it until it released it.
Reasonable designs use weak references that don't have values that can be compared across processes. These are usually called "handles" or "file descriptors", and they don't have this problem at all. Nothing reaps sockets, for example, and nothing needs to.
Basically, a task running as pid 1 (including in a container) could call a new kpid1() syscall, which would cause the kernel to completely take it over. The kernel would take care of all the usual init work, and it would expose the minimal API (presumably using a new kind of fd) to allow a different task to give it instructions and manage zombies as needed. And that’s it.
It’s worth noting that the entire concept of pid 1 is very unixy, but not in a good way. Reasonable modern designs (e.g. all the Windows variants) don’t have any real equivalent.