Hacker News new | past | comments | ask | show | jobs | submit login
Why threads can't fork (thorstenball.com)
147 points by akerl_ on Oct 13, 2014 | hide | past | favorite | 68 comments



Sun did try multi-thread fork semantics, but that didn't help.

UNIX "fork" started as a hack. The reason UNIX originally used a fork/exec approach to program launch was to conserve memory on the PDP-11. "fork" originally worked by swapping the process out to disk. Then, at the moment there was a good copy in both memory and on disk, the process table entry was duplicated, with one copy pointing to the swapped-out process image and one pointed to the in-memory copy. The regular swapping system took it from there.

Then, as machines got bigger, the Berkeley BSD crowd, rather than simply introducing a "run" primitive, hacked on a number of variants of "fork" to make program launch more efficient. That's what we mostly have today. Plan 9 supported more variants in a more rational way; you could choose whether to share or copy code, data, stack, and I think file opens. The Plan 9 paper says that most of the variants proved useful. But that approach ended with Plan 9. UCLA Locus just had a "run" primitive; Locus ran on multiple machines without shared memory, so "fork" wasn't too helpful there.

Threads came long after "fork" in the UNIX world. (Threads originated with UNIVAC 1108 Exec 8 (called OS 2200 today), first released in 1967, where they were called "activities"). Exec 8 had "run", and an activity could fork off another activity, but not process-level fork. Activities were available both inside and outside the OS kernel, decades before UNIX.

That's why UNIX thread semantics have always been troublesome, especially where signals are involved. They were added on, not designed in.


> share or copy code, data, stack, and I think file opens

Sounds like linux's "clone" system call. Which is the underlying syscall which clib's fork() uses.

You can do just about anything imaginable with it: http://linux.die.net/man/2/clone

For example: you could create a child process-like-thing which shares nothing but the signal handler table. No idea what that would be good for.


Not all combinations are allowed. In this specific case, if you specify CLONE_SIGHAND then you must also specify CLONE_VM (so the processes share a virtual memory space, and are essentially threads).


Ah good catch, sorry I just skimmed the man page for an interesting sounding feature.


For that reason, clone(2) always felt like an overgenerality -- an attempt to decompose something into orthogonal parts, most combinations of which actually don't make sense.


Even when you have a more formal "run" primitive, you still want a lot of inheritance; file handles (console programs and redirection), security identity and authorizations, current working directory.

There's also the matter of API complexity. If you want to configure the context in which a program runs, there are roughly two different API approaches you can take. You can deeply parameterize the 'run' command, or you can configure the "current" process and then hand that off to the new process. But you usually need the second API (configuring the current process) anyway.

So I don't see fork(), per se, as a particularly crufty API. Slightly more configurability, like clone() or Plan 9 is better.

Multi-threaded signals are pretty dreadful, though. In a multi-threaded system, signals should be delivered on a separate thread, not via an existing thread getting hijacked.


Multi-threaded signals are pretty dreadful, though. In a multi-threaded system, signals should be delivered on a separate thread, not via an existing thread getting hijacked.

That is basically the recommended way to handle signals in a pthreaded program - have one dedicated signal-handling thread that calls `sigwait()` in a loop, and block all signals in the signal masks of the other threads.


"fork()" didn't start as a hack. There is no documentation to suggest that one return from fork() was a copy now on-disk while the other remained in ram.

http://cm.bell-labs.com/cm/cs/who/dmr/man21.pdf http://www3.alcatel-lucent.com/bstj/vol57-1978/articles/bstj...


Yes, there is.

John Lion's "A commentary on the Sixth Edition UNIX Operating System", goes through the PDP-11 UNIX kernel line by line.

http://www.lemis.com/grog/Documentation/Lions/book.pdf

Page 37, Section 7.13, "newproc":

1906: Call "xswap" (4368) to copy the data segments into the disk swap area. Because the second parameter is zero, the main memory area will not be released.

1907: Mark the new process as "swapped out".

1908: Return the current process to its normal state.


From what I can tell, that code is in a conditional which checks whether the new processes can fit in main memory. If it can fit, it jumps to line 1913.

Anyhow, there's a difference between the fork interface being a hack, and the fork implementation being a hack. Unix is is a cornucopia of implementation hacks. That doesn't mean the interfaces weren't deliberately and thoughtfully designed.

Much like C, what makes Unix unique and still relevant is that the deliberate design took into account practical implementation considerations. Unix and C are most elegant from an engineer's perspective. It's an interesting balance of interface complexity and implementation complexity. This is why some people claim that the Unix design philosophy epitomizes "worse is better".


plan9 rfork() never shares the stack segment. the other segments can be shared or copied depending on the RFMEM flag. theres no special voodoo needed for "thread" local storage, its just memory reserved at the top of the stack.


In my opinion this is fundamentally a problem of mixing coprocessing metaphors. The whole thread vs process vs container differences are tied up in leakage between the permissions model (process) and computation model (threads and containers). The thread equivalent of 'fork' would be some form of promotion to a 'root' thread (which is to say one where instance data about computation limits can be changed independently) Processes, which were the traditional collection point of resources under an identity, ideally sit apart from computation constraints. And if you follow that path you realize that resource allocation (which is one of the three key parts of OS management) then need to both be computation aware, and identity aware. In the example of the article, malloc would break the lock such that its validity would be related to the identity it was associated with, so if you promoted a thread to the 'identity' level you would invalidate any locks visible to it that were attached to identity.

There have been discussions about this in OS design for almost forever.


The fish shell http://fishshell.com is multithreaded and calls fork, so it can be done. But it is difficult, even if you just call execve().

An example of a problem we encountered: what if execve() fails? Then we want to print an error message, based on the value of errno. perror() is the preferred way to do this. But on Linux, perror() calls into gettext to show a localized message, and gettext takes a lock, and then you deadlock.

This is a hard problem to solve because it requires knowing the status of all locks. This breaks the abstractions that library authors present, where locks are internal and not exposed.


Isn't it easier to have the parent print the error message by passing yet another pipe to the child that closes on exec and can be used to pass up errno?


That's a good idea! I'm not sure if it's easier but it would effectively sidestep this class of issues.


You could also use an inherited shared memory segment (an anonymous mmap with MAP_SHARED) to report failure codes. Since you only need to set this up once, it could result in lower overhead than creating pipes for every exec.


You give people an amazing implementation of M:N concurrency and the people want to fork. You give people concurrency via forking, and they want N:M concurrency.


Did Go ever fix M:N threading? From what I've read there's significant performance degradation when you use M:N threading as opposed to 1:N.

If Go figured out how to efficiently detect data dependencies and relationships, and then automatically move goroutines around, that would be exceptionally noteworthy. Everybody starts out thinking they can do this, which is why Solaris, NetBSD, Linux, Java, et al all started with out M:N threading. But then when they figure out that it's a Really Hard(tm) problem, they invariably shift to 1:1.

I've found that it's better to leave it to the developer to choose whether to run an OS thread or coroutine, just as the developer chooses between a process and OS thread. So in my project[1] I don't spend much time trying to automate that.

[1] http://25thandclement.com/~william/projects/cqueues.html


Google are working on some kernel help for userspace threads there was an article on lwn.net a while ago. Someone told me yesterday he was seeing a lot of spurious wakeups but hadn't debugged them yet. So I think there is room for improvement.


In other words: Whenever people get a nice tool for something, they want to use the tool for other things as well.


The pthread_atfork() function shall declare fork handlers to be called before and after fork(), in the context of the thread that called fork(). The prepare fork handler shall be called before fork() processing commences. The parent fork handle shall be called after fork() processing completes in the parent process. The child fork handler shall be called after fork() processing completes in the child process. If no handling is desired at one or more of these three points, the corresponding fork handler address(es) may be set to NULL.

The order of calls to pthread_atfork() is significant. The parent and child fork handlers shall be called in the order in which they were established by calls to pthread_atfork(). The prepare fork handlers shall be called in the opposite order.

I'm not sure if that's the best approach, but it's an attempt at least.


I just had a look at IEEE Std 1003.1, 2013 Edition for pthread_atfork().

It has a few corner cases across POSIX systems. I wouldn't bet it works 100% the same way in all UNIXes.


Of course it doesn't, very few things do. Cross platform is hard, but that doesn't mean you don't use features.


Of course, but sometimes it is a huge pain.

I used to do cross platform across Aix, HP-UX, Solaris, GNU/Linux, FreeBSD and Windows NT/2000 back in the .COM days.


It's much easier in 2014 than it was in 2004. POSIX has evolved, and POSIX conformance has substantially improved. Most systems are, in practice, nearly 100% conformant to POSIX-2001.

Excepting Windows, I rarely run into difficult portability problems except when I deliberately use non-POSIX functionality or newer POSIX functionality.

I target Linux, OS X, OpenBSD, NetBSD, FreeBSD, Solaris, and AIX. The biggest laggard was OpenBSD, particularly wrt to threading, signal handling, and real-time extensions. But in the past couple of years that's been substantially addressed.

One of my biggest headaches now is OS X. They appear to have stopped trying to track POSIX, so while everybody else is busily implementing POSIX-2004, POSIX-2008, and tentative POSIX features, OS X is nearly at a stand-still. OS X hasn't fixed any significant conformance issues, adopted real-time extensions, nor adopted any POSIX-2008 features for several years, now.


Thanks for the update, however there seems they still a long way to go until most systems reach UNIX V7 X1201 compliance.


This is an issue in common lisp as well; you generally have the choice of either fork or threads. In fact ClozureCL, a popular lisp implementation launches an extra thread at startup (For I/O IIRC), and for a while someone maintained a fork of it that did not do so to allow usage of the fork syscall.


Interesting. I noticed that with RESTAS (a CL web server/API server library) they make a big deal of daemonization and not requiring the usual hacks to daemonize a CL server (tmux/screen/dtach/etc).

Was this a common (no pun intended) problem among CL implementations and why server daemonization is an issue? I am just learning CL, and noticed RESTAS only really supports this daemonization feature in SBCL, if I believe the documentation.

I guess I am going to have to dive into the source this weekend and check it out.


I use daemontools to monitor my daemons, so I've never tried to daemonize a lisp daemon. With SBCL you can still fork so long as you do it before you spawn any threads. I'm guessing that RESTAS uses the sb-posix:fork function to daemonize, which would explain why it only works on sbcl.


Neat. Thanks for the explanation. I had installed RESTAS inside of a Clozure CL image but had not gotten as far as daemonization and forking.

Judging from the one sentence on the landing page, I supposed it would blow up. Haha.


That's quite a copout with forkall. Presumably you have some idea what your threads are doing if you want to fork the entire process, so don't do it in the middle of writing to a file.


How many libraries do you use? Are you sure they don't create their own threads?

Also, just imagine something like this:

    file.write(blah);
    if (has_forked) thread_exit();
    file.write(blah2);
    if (has_forked) thread_exit();
    file.write(blah3)
    if (has_forked) thread_exit();
Note the race conditions in the above code... forkall() is a disaster. Presumably you'd combine it with pthread_atfork(), but it's still a total mess.


How could you know what your threads are doing at the time you call fork? That doesn't sound safe. I write such code in a way that threads communicate with each other when they need to, but if I start assuming what those threads are doing that seems like it could get really complicated really fast.


Sure it can get complicated, but there are also some safe ways to fork. Imagine you fork at the very beginning of your program before starting any threads. No danger in that. So the language could at least allow it (but provide the proper warnings around it).


the issue arises when the programming language doesn't let you run code before the threads split out, like golang.


I don't know exactly what they're doing, but being able to reason that they are not performing file IO is a vastly easier problem than figuring out a completely safe point to kill them all.


Does forkall exist on Linux? I was under the impression it's a nonstandard Solaris extension.


That's a much better reason. But such a problem isn't set in stone.


I've always thought fork() was a hack anyway. 99% of the time, fork() is followed by a call to an exec() variant. The actual process forking behavior seems useful in only a small set of cases that could be handled by spawning a process and passing state explicitly. The benefits of fork() don't seem to offset its costs or justify its widespread use.

fork() is one of those things that's conceptually quite elegant but in practice has too many problematic edge cases. (Signals seem to fall into the same group.)


Spawning a process by passing state explicitly is what Windows does. The CreateProcess call takes 10 parameters, several of which are structures containing even more parameters - http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...

I do like that fork, then setup functions, then exec keeps the code simpler. If everything is done as a single call like CreateProcess then it needs to have a way of passing what the setup functions need as data - rather a lot of it.


That's conflating separate issues. Windows could have a CreateProcessSimple() call that spawns a new process without so many options. CreateProcess is complex not so much because it is one call instead of two, but because it has all options jumbled together.

Meanwhile the Linux/Unix decision was to have about a dozen different APIs to do variants of the same thing (fork, vfork, and clone plus execl, execlp, execle, execv, execvp, execvpe, fexecve, posix_spawn, posix_spawnp, and probably more). This isn't really any less complex.

The argument complexity of CreateProcess could not possibly derive from merging fork() and exec() because fork() has no arguments.


CreateProcess is complex because spawning a new process is essentially complex.

The brilliance of fork/exec is in the realization that the tasks you do during process creation are the same as the tasks you do during normal execution, and therefore you can just re-use those functions. For example, CreateProcess has a parameter for the current directory; fork/exec doesn't need one because you just use chdir on the child side of fork.

The proliferation of exec* is indeed baffling, but these are just dumb variants trying to make the exec call itself "easier." They don't reflect any essential complexity.


It's not enough to merge fork() and execve(), because what actually happens is some variant of:

fork(); open(); close(); dup2(); fdcntl(); sigprocmask(); sigaction(); setrlimit(); prctl(); chdir(); chroot(); capset(); execve();

(and perhaps more besides, with new ones potentially added with new features, for example seccomp();)


That seems like compelling evidence that the Linux method is more complex than the Windows method (or at least no less complex). Calling a bunch of complex functions is not simpler than setting up a bunch of complex params and then calling one function.


Approximately zero programs will call all of those functions, but each will call its own required subset of them. If you don't care about running the child with an altered root directory, you don't need to call chroot().


Of course. Most programs also don't need all the esoteric options that can be passed into CreateProcess either.


I don't think anyone who's thought about it for a moment disagrees that starting a process has a large degree of potential complexity, regardless of the operating system.

The question is whether it's better (more elegant) to load that complexity onto one complex function call, or farm it out to multiple simple function calls. It is apparent that this is a question in which personal taste is involved.


In the motivating Go example, implementing a fork doesn't seem like it would be particularly challenging since the runtime can pause all executing Go routines and reschedule them in the same place on the other side. Bit easier than pthreads.


fork() is a dumb interface, and non-portable anyway. I've yet to see a use case that couldn't be handled with either threads, or spawning another process - after all, those are the only APIs you get elsewhere.

If you need to use a language with a runtime (not just Go by any means, the likes of Python also suffer from this issue) from two processes that need to be separate but communicate with each other, do the fork first, then start the language runtime (i.e. embed the language in your parent program).


"Spawning another process" is a very complicated task. There's tons of process state that you need to set before the child starts: process group, file descriptors, signal mask, foreground status, etc.

You can try to bundle it all up into a single API like posix_spawn, but the API becomes large and it's hard to cover everything. posix_spawn sure doesn't.

fork is an elegant solution to this problem: you can run code as the child, before the child gets to start. I am not aware of any alternative that's as flexible.


> I've yet to see a use case that couldn't be handled with either threads, or spawning another process - after all, those are the only APIs you get elsewhere.

Android uses this to pre-load framework resources and code in a way that lets all applications share the backing memory. And when applications crash, they don't bring down the initial process that preloaded everything (so it can continue spawning new apps).

How would you handle that with only threads or spawning another process?


It's possible to share memory between processes that weren't originally forks - consider e.g. X clients using XShm to communicate with the server, or jk for fast communication between apache and tomcat. I guess forking lets you do "share everything, COW", which is kind of handy, but it's also a very lazy way of programming; you get access to the whole address space, so it relies on the other processes to not reuse data that doesn't make sense when shared. Better to only share memory that processes explicitly want to share, and make it clear which one owns any given region of memory.


Shared memory also means that any modifications are also shared, which is really not good in this case. We could mark the sections RO I suppose, but then we have them occasionally copying things out of the RO pages to modify them which just bloats the address space (though doesn't change the number of backing pages). It's also slightly more brittle because you have to be careful about marking everything shared RO.

> Better to only share memory that processes explicitly want to share, and make it clear which one owns any given region of memory.

We are only sharing memory that we explicitly want to share: we load only what we care about, then fork.


Uh, dynamic linking?


That requires that any shared data be included as static data inside the compiled object file, right? Very often, you want to load "code" that is really data in an application-specific format, eg. DEX files on Android, .class files for Java, script text for Python, or templates for a webserver.


It's not dumb, it's just low level. Just like theads. Applications should be programmined in application level programming languages providing higher level notions such as monitors or actors, not threads and fork(2). A forking operation, at the appplication level, is a semantic operation that must be processed explicitely and specifically by the application (by each object having underlying threads).

The proof that the problem vs. forking is not threads, is that the worse example given in the article referenced, that of file I/O, occurs as well in processes without threads (or "single-thread" processes if you want): if you write to a file from both the parent and the child, you must take precautions at the application level. This has nothing to do with threads.


It's much less of a problem in a single-threaded program because no other threads can be running at the point where you call fork(). So there's no worry about a mutex being held (assuming you don't fork with a mutex held, but don't do that), and you can know exactly which files are open at that point in time.


> I've yet to see a use case that couldn't be handled with either threads, or spawning another process

fork() allows you to set up file descriptors such as stdin and stdout before execing another process. This is essential for pipelines.


Really? Writing a wrapper around fork with a mutex doesn't seem like that huge of an issue. This combined with pthread_atfork() should provide a way to make this work, no?


A mutex around fork() buys you nothing. The problem is other locks, which often are not even part of your code.

The article has perhaps the most likely example. Imagine malloc() has a lock. Another thread happens to be inside malloc() at the moment that you fork(), and therefore owns the lock and might be in the middle of manipulating shared data structures. Now suddenly the child cannot malloc(), because that thread [suspended in the middle of its execution] isn't going to be carried over to it, and will never be able to clean up its intermediate state and release the lock.


It buys you plenty if you prevent other threads from entering until you all meet in a rendezvous lock.


The worker thread that was kicked off by some random library dependency (acquiring a lock you didn't know about) isn't going to care about your rendezvous lock. Even if you do control all the threads getting them to do what you're suggesting may be nontrivial and costly.

Edit: IMO it's better to just admit the programming model is thorny and move on. I feel similarly about signals. Restrict what you do after fork() and generally be cautious, the same way you'd be cautious reacting to a signal.


Wrap the thread create calls and make them care. :)


Threads don't work the same way across all UNIX systems, there are little differences on how signals are handled, also POSIX functions tend to have implementation specific semantics.


Yes cross platform is hard. You can still use whatever primitives are available to cobble this together. pthreads support or equiv isn't exactly crazy to expect.

Otherwise you wouldn't offer locks in languages either!

This is a feature that can be implemented.


Sometimes those primitives have so different set of corner cases and implementation specific semantics, that they could just have different names across "compatible" OS.

I used to do cross platform across Aix, HP-UX, Solaris, GNU/Linux, FreeBSD and Windows NT/2000 back in the .COM days.


Yes of course. When you code the implementation for each platform you always have to keep that in mind. Same is true for locks or exposing any low level systems features.

See also: high performance event handling, etc. But does anyone suggest you just can't ever use lots of sockets? (It is a bad idea sometimes, to be fair!)

We write the code necessary to do it.

(Since we're apparently going into personal history, I used to be a platform manager for a ~million line cross platform codebase that ran on most of the things in your list plus IRIX and OS X. Most of my responsibilities were really about keeping the build environment working enough to run one binary across all the Linux distributions, but I was also involved in issues that specific to my platform. That is one of many projects where I've dealt with platform specific issues on a variety of codebases, several of which I've written in their entirety.

Anyway.

I have a passing familiarity with some of these issues.)


Thanks for the overview.

I didn't want to do some kind of credential call, it was more to explain where I got my experience from, as many in forums tend to think all UNIXes are 100% alike, or worse, GNU/LINUX === UNIX.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: