Some notes: - C++20 coroutines are *stackless*, meaning that multiple coroutines...

_huayra_ · on Feb 22, 2021

To clarify what stackless means when you're writing code:

* calling into the coroutine from the caller uses the caller's stack (i.e. it just pushes on another stack frame). * the lack of a coroutine stack ("stackful" coroutine) means that the coroutine can only "yield" a value from a single method; it cannot call a function F, and then have F yield back to the original coroutine's caller. * In other words: you can think of it like a "coroutine with one stack frame for coroutine yielding semantics"

The compiler does some magic to slice up the coroutine into segments (between co_ statements) and stores state that must persist between these segments in a coroutine frame (in addition to the "slice" that the coroutine should continue at once it is called again).

The real tricky part is the lack of library support. From what I've seen, it seems like a naming convention is all that defines what various coroutine functions are, e.g. `promise` or `promise_type`. This is very much like iterators, which can be written via structural subtyping: anything that has the special static type values and/or methods can be used as one.

mazieres · on Feb 22, 2021

You are right that you are effectively stuck in a single coroutine (non-stack) frame. But you can chain multiple such coroutine frames, because one coroutine can co_await another.

_huayra_ · on Feb 22, 2021

Thanks for this great article! I've been circling my way around understanding coroutines and this really put it in place.

I think this co_awaiting is the most confusing part for folks: in most languages with stackful coroutines, it makes sense how one can

1. call "co_await" on a coroutine at a high-level API boundary 2. have that coroutine execute normal code all the way down to some expensive part (e.g. I/O) 3. at that particular point "deep" down in the expensive parts, that single point can just call "yield" and give control all the way back to the user thread that co_awaited at 1, usually with some special capability in the runtime.

I believe the way you can do this in C++20 concepts is to co_yield a promise all the way back to the originating "co_await" site, but I may be confused about this still...

It's totally clear to me why they didn't choose this for C++: keeping track of some heap-allocated stack frames could prove unwieldy.

I wish more effort went into explaining and promoting coroutines. Right now the advice seems to be "be brave and DIY, or just use cppcoro until we figure this out".

BenFrantzDale · on Feb 22, 2021

Also, if it’s not clear, within a coroutine, you can call any function (or coroutine) you want. It’s just that to co_yield, you have to be in the coroutine not deep in the stack.

captainmuon · on Feb 22, 2021

Isn't it like that is most languages? I'm thinking about Python, C#, JS. If you call a blocking function from an async function, you cannot yield from deep inside the blocking function.

Why is this a big deal in C++? Am I missing anything that makes c++ coroutines less powerful than other mainstream solutions? Or are people comparing its power with e.g. Lisp or go?

BenFrantzDale · on Feb 22, 2021

It's a big deal because, while it has some downsides, being stalkless means they can have next to no overhead, meaning it can be performant to use coroutines to write asynchronous code for even very fast operations. The example given https://www.youtube.com/watch?v=j9tlJAqMV7U&t=13m30s is that you can launch multiple coroutines to issue prefetch instructions and process the fetched data, so you can have clean code that issues multiple prefetches and process the results. Whereas in Python (don't get me wrong, I love Python) you might use a generator to "asynchronize" slow operations like requesting and processing data from remote servers, C++ coroutines can be fast enough to asynchronously process "slow" operations like requesting data from main memory.

mazieres · on Feb 23, 2021

Wow, that talk is a fantastic link. He actually gets negative overhead from using coroutines, because the compiler has more freedom to optimize when humans don't prematurely break the logic into multiple functions.

gpderetta · on Feb 22, 2021

All those languages have also stackless coroutines. Notably Lua and Go have stackful coroutines.

It is sort of a big deal because the discussion of wether adding stackful or stackless coroutines in C++ was an interminable and very visible one. Stackless won.

captainmuon · on Feb 22, 2021

I'm trying to wrap my head around what the implication of stackless coroutines is.

Can I use them like `yield` in Python+Twisted, i.e. to implement single-threaded, cooperative parallelism? It would not expect to be able to plainly call a function with a regular call, and have that function `yield` for me - but can I await a function, which awaits another, and so on?

As far as I understand, C++20 coroutines are just a basis and you can build a library on top of it. It is possible to build single-threaded async code (python or JS style) as well as multi-threaded async code (C# style where a coroutine can end up on a different context), right?

Is there anything like an event loop / Twisted's `reactor` already available yet?

I'm really looking forward to rewrite some Qt callback hell code into async...

nly · on Feb 22, 2021

Stackless means when you resume a coroutine you're still using the same OS thread stack. Coroutine contexts/activation records are conceptually heap allocated (although in some cases that can be optimized away).

You can use coroutines for what you say, but there are no execution contexts (like a thread pool or an event loop) in C++20 standard library in which to execute coroutines, so to do networking and async I/O you need to use a library or DIY. Hopefully standard execution will come later as part of the Executors proposal.

You can currently use C++ native coroutines with the ASIO library, but this is probably subject to quite a bit of API churn in the future:

https://www.boost.org/doc/libs/1_75_0/doc/html/boost_asio/ov...

You can also wrap things like ASIO yourself. I did this in 2018 when I was learning about C++ coroutines to create a simple telnet based chatroom:

https://github.com/heavenlake/coro-chat/blob/master/chat.cpp

Note that this code is likely garbage by todays standards. One thing i can't remember is why i needed to use the cppcoro library in the end.

mazieres · on Feb 22, 2021

> Stackless means when you resume a coroutine you're still using the same OS thread stack

This is confusing, because it begs the question "same as what?" In fact, you can migrate a coroutine across threads, or even create a thread specifically for the purposes of resuming a coroutine.

But I suppose it is true that from the point at which you resume a coroutine to the point at which it suspends itself, it will use a particular stack for any function calls made within the coroutine. That's why you can't suspend a coroutine from within a function call, because the function call will use the stack.

gpderetta · on Feb 22, 2021

C++, C#, Python coroutines are all stackless and pretty much equivalent. Lua, Go have stackful coroutines (i.e. one-shot continuations). There are libraries for stackful coroutines in C++ of course.

There is a proposal for executors that would add event loops to C++. In the meantime boost.Asio (also available outside of boost) is one of the best options (the standard proposal was originally based on Asio, and Asio is tracking the standard proposal closely).

hctaw · on Feb 22, 2021

> but can I await a function, which awaits another, and so on?

Whether a coroutine is stackful or stackless is largely an implementation detail that has some tradeoffs either way, in either case coroutine semantics can allow you to write efficient asynchronous code imperatively or do your callback-to-async transformation.

gpderetta · on Feb 23, 2021

it is not an implementation detail at all. The syntax and semantics are very different. Stackfull coroutines allow suspending more than one activation frame (i.e. a subset of the callstack) in one go. You can of course "emulate" it with stack less coroutines by converting each frame in the call stack to a coroutine, but it is a manual and intrusive process.

hctaw · on Feb 23, 2021

I think if your syntax and semantics imply the stack-ness of your coroutine implementation then you you have a language design problem. Coroutines are subroutines with two additional primitives (resume after call, and yield before return). Whether `yield`/`resume` imply stack swapping or a state transition between activation frames is just an implementation detail. Both have tradeoffs, to be sure.

gpderetta · on Feb 26, 2021

> I think if your syntax and semantics imply the stack-ness of your coroutine implementation then you you have a language design problem

A lot of languages, including C#, python, C++, rust and many others specify the stackless-ness of coroutines. So it is not just an implementation detail. You could argue that specifying these semantics is a problem, and I would actually agree, but obviously it is not something many, if not most, language designers agree.

jcelerier · on Feb 22, 2021

> I'm really looking forward to rewrite some Qt callback hell code into async...

last time I tried it was pretty easy to make C++ coroutines fit into Qt's event loop.

hctaw · on Feb 22, 2021

Are C++20 coroutines allowed to be recursive? Or does recursing require boxing?

For a stackless coroutine the compiler normally has to build a state machine to represent the possible execution contexts, but if the state machine includes itself then it has indeterminate size. Normally you solve this by boxing the state machine at yield points and using dynamic dispatch to call or resume a pending coroutine - which may be significantly slower than using a stackful coroutine to begin with (in which case, the stack is allocated on the heap up front, leading to a higher price to spawn the coroutine, but lower to resume or yield).

c-cube · on Feb 22, 2021

I'm curious. Intuitively, a non-recursive coroutine would yield a state machine that only goes "forward". If you add tail recursion into the mix, you could have cycles in the state machines (going back to the begin state), correct? Of course non-tail recursion would not work within a single frame.

hctaw · on Feb 22, 2021

Yes, a tail recursive coroutine could reuse its previous frame context across yields.

With a state machine transform the `resume()` method on a coroutine is a state transformation, it doesn't necessarily know what is "forward" or "backward" in the control flow graph. There are some tricky bits though, since tail recursive functions can have multiple exits but single entries. A recursive coroutine might have multiple exits and multiple entries, so it's not always clear what is "forward" and what is "backward."

BenFrantzDale · on Feb 23, 2021

The state is allowed to be heap-allocated but can be optimized onto the stack. But if it calls recursively without control flowing back out again, then I’d think the nested state could live on the stack so long as the compiler knows the nested state never has to be held across yield/resume.

gpderetta · on Feb 22, 2021

> These yield contexts are not stack frames though. Every coroutine you invoke from a coroutine has its own context that can outlive the caller.

Well, they are obviously not stack frames because they do not follow a stack discipline, but they certainly are activation frames. I guess that's the point you were trying to make?

nly · on Feb 22, 2021

dieters · on Feb 22, 2021

There's none in the standard library, but e.g. Seastar already has coroutine support implemented for its future<>s and it works really well - the code looks clearer and in many cases it reduces the number of allocations to 1 (for the whole coroutine frame).

snarfy · on Feb 22, 2021

It seems really dumb that they are stackless. If you are saving/restoring the stack pointer anyway in your yield routine it's trivial to set it to a block of memory you allocated in the initial coroutine creation.

Is there no setjmp/longjmp happening? Are C++ 20 coroutines all just compiler slight-of-hand similar to duff's device with no real context switching?

account4mypc · on Feb 22, 2021

> It seems really dumb that they are stackless

Why? C/C++ already has stackful coroutines. And that seems extremely wasteful unless you know you'll run the coroutines at the same time... with single threaded stackful coroutines, you'd get two stacks and only ever use one at a time. that wastes megabytes, requires heavier context switching, makes cache misses more likely, etc.

hctaw · on Feb 22, 2021

Modern stackful coroutines don't (or shouldn't) use context switching (at least not anything like ucontext_t, just use threads if you're going to spend the cycles to preserve the signal mask) or setjmp/longjmp. Those tricks are slow and hazardous, especially when code needs to cross language boundaries or remain exception safe.

gpderetta · on Feb 22, 2021

The compilation model is implementation defined, but yes, the expectation is that they compile down to a switch-like explicit state machine.

The advantage is that the worst case frame size is bounded, small and computable at compiletime.

Stackfull coroutines are of course more powerful and have, IMHO, better ergonomics.