Nice write-up. I suspect the poor implementation of async I/O suggests how often it is actually used in practice. Signal handling definitely feels like the wrong design here, especially for a library author.
I'm also not surprised that Windows fared better here; with IOCP they had a chance to redo async I/O completely.
It's also the extremely limited use case. There's not that many I/O requests to a disk that you can do at a time while getting a performance speed-up. Having to use a thread pool ends up not being such a problem -- you don't really hurt your performance benchmarks. On the other hand, a system that talks on a network to thousands or millions of clients will benefit greatly from avoiding 4-8KB of stack overhead per connection.
4-8kB? Maybe physical memory overhead, if your code isn't too deep. But userspace thread stacks are anywhere from 128kB (FreeBSD) to 8MB (Linux) of virtual memory overhead.
Depends on what libc (or other) routines you call... going over the end of the stack is no fun. Lots of code seems to be written to rely on deep stacks in userspace.
I'm also not surprised that Windows fared better here; with IOCP they had a chance to redo async I/O completely.