Re: async, rust is a down to the metal language. IE library yes, runtime no. Asy...

rstuart4133 · on Oct 26, 2023

> I still like threads, but the I'm old and uncool.

Hey, I resemble that remark.

But I disagree with it. Green threads give you all the advantages of async, but with less of the hairs. In particular no special syntax or change of programming style is required. Yet underneath green threads and async just different styles of event driven I/O, so both run at similar speeds and excel at the same tasks. (Actually green threads should run faster, as storing state on a stack is generally faster than malloc.)

I have no idea why Rust abandoned green threads in favour of async. Actually, that's a partial lie - there have been far too many words wasted on explaining why. The problem is the reasons they give look to be an caused by design decisions they made in their implementation. The primary objection seems to be speed. The current async is indeed faster than their old green thread implementation. But that was caused by their choosing to avoid coloured code in their green threads (maybe they were copying Go?). Other objections were similarly to do with the implementation they threw away, not green threads themselves.

kprotty · on Oct 27, 2023

> Green threads give you all the advantages of async

They require more memory over stackless coroutines as it stores the callstack instead of changing a single state. They also allow for recursion, but its undelimited meaning you either 1) overrun the guard page and potentially write to another Green thread's stack by just declaring a large local variable 2) enable some form of stack-probing to address that (?) or 3) Support growable stacks which requires a GC to fixup pointes (isn't available in a systems lang).

> green threads should run faster, as storing state on a stack is generally faster than malloc.

Stackless coroutines explicit don't malloc on each call. You only allocate the intial state machine (stack in GreenThread terms).

> The primary objection seems to be speed

It's compatibility. No way to properly set the stack-size at compile time for various platforms. No way to setup guard pages in a construct that's language-level so should support being used without an OS (i.e. embedded, wasm, kernel). The current async using stackless coroutines 1) knows the size upfront due to being a compiler-generated StateMachine 2) disallows recursion (as that's a recursive StateMachine type, so users must dynamically allocate those however appropriate) which works for all targets.

rstuart4133 · on Nov 1, 2023

> They require more memory over stackless coroutines as it stores the callstack instead of changing a single state.

True, but in exchange you don't have to fight the borrow checker because things are being moved from the stack. And the memory is bounded by the number of connections you are serving. The overheads imposed by each connection (TCP Windows, TLS state, disk I/O buffers) are likely larger than the memory allocated to the stack. In practice on the machines likely to be serving 1000's of connections, it's not going to be a concern. Just do the arithmetic. If you allowed a generous 64KB for the stack, and were serving 16K connections, it's 1GB of RAM. A Raspberry PI could handle that, if it wasn't crushed by the 16K TCP connections.

> They also allow for recursion, but its undelimited meaning you either 1) overrun the guard page and potentially write to another Green thread's stack by just declaring a large local variable 2) enable some form of stack-probing to address that (?) or 3) Support growable stacks which requires a GC to fixup pointes (isn't available in a systems lang).

All true, but also true for the main stack. Linux solved it by using 1MB guard area. On other OS's gcc generates probes if the frame size exceeds the size of the guard area. Lets say the guard area is 16KB. Yes, that means any function having than 16KB of locals needs probes - but no function below that does. Which in practice means they are rarely generated. Where they are generated, the function will likely be running for a long time anyway because it takes a while to fill 16KB with data, so the relative impact is minimal. gcc allows you to turn such probes off for embedded applications - but anybody allocating 16KB on the stack in an embedded deserves what they get.

And again the reality is a machine that's serving 1000's of connections is going to be 64bit, and on a 64bit machine address space is effectively free so 1MB guard gaps, or even 1GB gaps aren't a problem.

> No way to properly set the stack-size at compile time for various platforms.

Yet, somehow Rust manages that for it's main stack. How does it manage that? Actually I know how - it doesn't. It just uses whatever the OS gives it. On Windows that's 1MB. 1000 1MB stacks is 1GB. That's 1GB of address range, not memory. Again, not a big deal on a modern server. On embedded systems memory is more constrained, of course. But on embedded systems the programmer expects to be responsible for the stack size and position. So it's unlikely to be a significant problem in the real world. But if does become a problem because your program is serving 10 of 100's of concurrent connections, I don't think many programmers would consider fine tuning the stack size to be a significant burden.

> No way to setup guard pages in a construct that's language-level so should support being used without an OS (i.e. embedded, wasm, kernel).

There is no way to set up the main stack without the kernel's help, and yet that isn't a problem? That aside are you really saying replacing a malloc() with mmap() with the right flags is beyond the ken of the Rust run time library authors? Because that is all it takes. I don't believe it.

> The current async using stackless coroutines 1) knows the size upfront due to being a compiler-generated StateMachine 2) disallows recursion (as that's a recursive StateMachine type, so users must dynamically allocate those however appropriate) which works for all targets.

All true. You can achieve a lot by moving the burden to the programmer. I say the squawks you see about async show that burden is considerable. Which would be fine I guess, if there was a large win in speed, or run time safety. But there isn't. The win is mainly saving on some address space for guard pages, for applications that typically run on 64bit machines where that address space address space is effectively an unlimited resource.

The funny thing is, as an embedded programmer myself who has fought for memory I can see the attraction of async being more frugal than green threads. A compiler that can do the static analysis to calculate the stack size a number of nested calls would use, set the required memory aside and then general code that so all the functions use it instead of the stack sounds like it could be really useful. It certainly sounds like an impressive technical achievement. But it's also true I've never had it before, and I've survived. And I struggle to see it being worth the additional effort it imposes outside of that environment.