Tokio's focus is on low tail-latencies for networking applications (as mentioned). But it doesn't employs yield_now for waiting on a concurrent condition to occur, even as a backoff strategy, as that fundamentally kills tail-latency under the average OS scheduler.
> Green threads give you all the advantages of async
They require more memory over stackless coroutines as it stores the callstack instead of changing a single state. They also allow for recursion, but its undelimited meaning you either 1) overrun the guard page and potentially write to another Green thread's stack by just declaring a large local variable 2) enable some form of stack-probing to address that (?) or 3) Support growable stacks which requires a GC to fixup pointes (isn't available in a systems lang).
> green threads should run faster, as storing state on a stack is generally faster than malloc.
Stackless coroutines explicit don't malloc on each call. You only allocate the intial state machine (stack in GreenThread terms).
> The primary objection seems to be speed
It's compatibility. No way to properly set the stack-size at compile time for various platforms. No way to setup guard pages in a construct that's language-level so should support being used without an OS (i.e. embedded, wasm, kernel). The current async using stackless coroutines 1) knows the size upfront due to being a compiler-generated StateMachine 2) disallows recursion (as that's a recursive StateMachine type, so users must dynamically allocate those however appropriate) which works for all targets.
> They require more memory over stackless coroutines as it stores the callstack instead of changing a single state.
True, but in exchange you don't have to fight the borrow checker because things are being moved from the stack. And the memory is bounded by the number of connections you are serving. The overheads imposed by each connection (TCP Windows, TLS state, disk I/O buffers) are likely larger than the memory allocated to the stack. In practice on the machines likely to be serving 1000's of connections, it's not going to be a concern. Just do the arithmetic. If you allowed a generous 64KB for the stack, and were serving 16K connections, it's 1GB of RAM. A Raspberry PI could handle that, if it wasn't crushed by the 16K TCP connections.
> They also allow for recursion, but its undelimited meaning you either 1) overrun the guard page and potentially write to another Green thread's stack by just declaring a large local variable 2) enable some form of stack-probing to address that (?) or 3) Support growable stacks which requires a GC to fixup pointes (isn't available in a systems lang).
All true, but also true for the main stack. Linux solved it by using 1MB guard area. On other OS's gcc generates probes if the frame size exceeds the size of the guard area. Lets say the guard area is 16KB. Yes, that means any function having than 16KB of locals needs probes - but no function below that does. Which in practice means they are rarely generated. Where they are generated, the function will likely be running for a long time anyway because it takes a while to fill 16KB with data, so the relative impact is minimal. gcc allows you to turn such probes off for embedded applications - but anybody allocating 16KB on the stack in an embedded deserves what they get.
And again the reality is a machine that's serving 1000's of connections is going to be 64bit, and on a 64bit machine address space is effectively free so 1MB guard gaps, or even 1GB gaps aren't a problem.
> No way to properly set the stack-size at compile time for various platforms.
Yet, somehow Rust manages that for it's main stack. How does it manage that? Actually I know how - it doesn't. It just uses whatever the OS gives it. On Windows that's 1MB. 1000 1MB stacks is 1GB. That's 1GB of address range, not memory. Again, not a big deal on a modern server. On embedded systems memory is more constrained, of course. But on embedded systems the programmer expects to be responsible for the stack size and position. So it's unlikely to be a significant problem in the real world. But if does become a problem because your program is serving 10 of 100's of concurrent connections, I don't think many programmers would consider fine tuning the stack size to be a significant burden.
> No way to setup guard pages in a construct that's language-level so should support being used without an OS (i.e. embedded, wasm, kernel).
There is no way to set up the main stack without the kernel's help, and yet that isn't a problem? That aside are you really saying replacing a malloc() with mmap() with the right flags is beyond the ken of the Rust run time library authors? Because that is all it takes. I don't believe it.
> The current async using stackless coroutines 1) knows the size upfront due to being a compiler-generated StateMachine 2) disallows recursion (as that's a recursive StateMachine type, so users must dynamically allocate those however appropriate) which works for all targets.
All true. You can achieve a lot by moving the burden to the programmer. I say the squawks you see about async show that burden is considerable. Which would be fine I guess, if there was a large win in speed, or run time safety. But there isn't. The win is mainly saving on some address space for guard pages, for applications that typically run on 64bit machines where that address space address space is effectively an unlimited resource.
The funny thing is, as an embedded programmer myself who has fought for memory I can see the attraction of async being more frugal than green threads. A compiler that can do the static analysis to calculate the stack size a number of nested calls would use, set the required memory aside and then general code that so all the functions use it instead of the stack sounds like it could be really useful. It certainly sounds like an impressive technical achievement. But it's also true I've never had it before, and I've survived. And I struggle to see it being worth the additional effort it imposes outside of that environment.
Preemption simulates localized concurrency (running multiple distinct things logically at the same time) not parallelism (running them physically at the same time). You can have concurrency outside continuations. OS threads for example are not continuations, but still express concurrency to the kernel so that it can (not guaranteed) express concurrency to the physical CPU cores which hopefully execute the concurrent code in parallel (not guaranteed again, due to hyperthreading).
getaddrinfo() is a synchronous function that can do network requests to resolve DNS. The network property isn't reflected in its function signature becoming async. You can have an async_getaddrinfo() which does, but the former is just a practical example of network calls in particular being unrelated to function coloring.
> The main problem is that tokio futures are 'static
An important distinction to make is that tokio Futures aren't 'static, you can instead only spawn (take advantage of the runtime's concurrency) 'static Futures.
> This implies that you can't statically guarantee that a future is cleaned up properly.
Futures need to be Pin'd to be poll()'d. Any `T: !Unpin` that's pinned must eventually call Drop on it [0]. A type is `!Unpin` if it transitively contains a `PhantomPinned`. Futures generated by the compiler's `async` feature are such, and you can stick this in your own manually defined Futures. This lets you assume `mem::forget` shenanigans are UB once poll()'d and is what allows for intrusive/self-referential Future libraries [1]. The future can still be leaked from being kept alive by an Arc/Rc, but as a library developer I don't think you can/would-care-to reasonably distinguish that from normal use.
Tokio and glommio using interrupts is ironically another misconception. They're cooperatively scheduled so yes, a misbehaving blocking task can stall the scheduler. They can't really interrupt an arbitrary stackless coroutine like a Future due to having nowhere to store the OS thread context in a way that can be resumed (Each thread has its own stack, but now it's stackful with all the concerns of sizing and growing. Or you copy the stack to the task but now have somehow to fixup stack pointers in places the runtime is unaware).
The cost of switching goroutines, rust Futures, Zig async Frames, or fibers/userspace-tasks in general is on the other of a few nano-seconds whereas it's in the micro-second range for OS threads. This allows you to spawn tons of tasks and have them all communicate with each other very quickly (write to queue; push receiver to scheduler runqueue; switch out sender; switch to receiver) whereas doing so with OS threads would never scale (write to queue; syscall to wake receiver; syscall to switch out sender). Any highly concurrent application (think games, simulations, net services) uses userspace/custom task scheduling for similar reasons.