It's not so much the fork but the memory cost. Each of those subprocesses has at least one call stack = 2 megabytes of memory. 2 megabytes per connection is many many orders of magnitude more that you would use in an asynchronous server.
1) that's virtual size, and most likely (depending on OS/cfg) COW (assuming no call to execve).
2) that's a default - most systems allow tuning
You can have pretty decent performance with forking models if you 1) have an upper bound for # of concurrent processes 2) have an input queue 3) cache results and serve from cache even for very small time windows. Not execve'ing is also a major benefit, if your system can do that (e.g. no mixing of threads with forks). In forking models, execve+runtime init is the largest overhead.
It will not beat other models, but forking processes offer other benefits such as memory protection, rlimits, namespace separation, capsicum/seccomp-bpf based sandboxing, ...
I think you guys are both right. Back in the days when I measured UNIX performance, it was fork that was expensive due to memory allocation - but not the memory itself. It takes time to allocate all the page tables associated with the memory when you are setting up for the context switch. But I should admit that it was a long time ago that I traced that code path.