Hacker News new | past | comments | ask | show | jobs | submit login
C++ Coroutines (nigeltao.github.io)
160 points by todsacerdoti on Feb 22, 2023 | hide | past | favorite | 120 comments



> Asynchronous code (e.g. callbacks) is more efficient (letting you do other work while waiting for things) but is also more complicated (manually saving and restoring state).

This is a persistent misunderstanding about async... it is not generally more efficient than non-async code. In fact, the overhead of async may slow things down. (There can be other downsides as well. E.g. debugging may be more difficult.)

For it to help, you need to have something better for the current thread to be doing than waiting for whatever it is you're waiting for.

Whatever that is, another worker thread could be doing it, so what we're really talking about is that async may let you reduce the number of threads you're using.

(Async is so useful in JS because it lets you reduce the number of threads to one, which lets you sidestep all kinds of hairy issues.)

That might be good enough, depending on your situation, to overcome the downsides of async worthwhile. But it may very well not.

Anyway: don't use async because it's "more efficient".


You are leaving out an important aspect: context switches are expensive, especially with all the mitigations introduced after Spectre et al.

Doing more work in a single thread saves a lot of context switches, especially if IO makes up a large portion of the workload. And even more so with other mechanisms to minimize syscalls, like io_uring on Linux.

Async overhead is highly dependent on the way the async runtime is implemented.

It's true that async requires scheduling and task switching to be implemented in userspace, but assuming that async is used as an alternative to threads, then the OS would be doing a similar amount of work anyway.

The OS may be able to do that more efficiently because it has more information about the system. Or it may be the opposite, because the application has more information about the workloads.

"async = overhead" as a general statement is not correct.


Thread context switches are about scheduling threads on physical cores. Async is about executing in a scope on a thread. There's no direct conversion here. Packing a thread with async tasks could let you reduce the number of threads you have, which will probably reduce the thread context switching if there is contention for cores. Whether that's significantly better than the cost of async context switches really depends on specifics (even if async context switch cost is near zero, though we're no longer talking about C++ coroutines in that case). But keep in mind: If your threads are mostly waiting, they aren't triggering a lot of context switches. More async means fewer busier threads, less async means more less busy threads. To run into a problem with less async your threads need to be forcing switching at pretty fine-grained level. (BTW, most blocking for IO isn't fine-grained from the perspective of instructions executing on a core) That kind of thing happens, but has solutions other than async.


Context switches happen regardless of whether you're using kernel mode ("threads") or user mode ("async") scheduling.

And kernel mode actually has much more efficient and performant context switches, because the kernel can access hardware directly.

(It is sometimes useful to be able to do custom user mode scheduling, but certainly not because of "context switches".)


Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?

In my experience, the exact opposite is true, particularly in the era of CPU mitigations that require TLB flushes upon every kernel-mode context switch.


You're right, kernel-level context switching is much slower than user-level context switching.

User-level can also have the advantage of having more actual context about the task that is running, meaning that it's often able to avoid saving/restoring as much data as a kernel-level switch would. See Go's green threads for a great example of this kind of cooperation between runtime and language.

> Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?

The closest thing to this that I can think of is on 32-bit x86 which did have hardware assisted context switching via TSRs.

As it happens, everybody stopped using it because it was too slow, and a bit painful unless you fully bought into x86's awful segmentation model. Early Linux kernels use it if you want to see it in action.


I don't quite follow your argument there.

This is unrelated to kernel threading.

If you have 1 thread handling 1000 requests with some async io mechanism (epoll, io_uring, ...) ,instead of 1000 threads each handling one request, there are much fewer threads fighting over CPU cores and the 1 thread can stay active much longer, hence reducing the amount of context switches.

Especially with a mechanism like io_uring, which helps minimize syscalls (and hence switching to kernel threads).


False. Both epoll and thread switching use the exact same kernel scheduling mechanisms under the hood.

(When the kernel decides which thread gets to run it's doing the equivalent of an epoll call.)


Asynchronous execution is more efficient if and only if the time spent waiting for a call to complete is longer than the time spent checking if the call has completed plus the time spent dispatching to another asynchronous call (and waiting vs checking if it is completed, and so on).

Remember that concurrency is not parallelism, it's not about the number of threads being used.


> Anyway: don't use async because it's "more efficient".

I don't think efficiency has only one meaning. For instance, you can create millions of coroutines while you're limited to only thousands of threads. Doesn't that mean your program is making better use of your hardware and thus is more efficient?


No, your processor can only do n things at a time, where n is the number of cores (probably like 8) (simplifying hugely of course). It's pretty easy to fill out 8 cores with 1000s of threads, the coroutines only become useful when almost all of them are blocking on a socket or something, periodically switching on and off to handle it.


Yes, but contexts where you're creating a million coroutines are almost certainly I/O bound (like a web server).


Sure, they can be more efficient in particular situations.

But coroutines have overhead (so do threads). Your millions of coroutines might be worse than your thousands of threads in terms of making efficient use of the cpu cores your app or service has available.


You can actually create millions of threads. Threads are just coroutines managed by the kernel after all.


If you can create millions of threads with some hardware, then you can create billions of coroutines on that hardware. It's just a fact that you can create 2-3 orders of magnitude more coroutines than you can kernel threads on any given hardware.


Coroutines and threads are not that different.

Threads usually preallocate some stack space, while coroutines usually don't. But that isn't such a big deal.


That's not the only difference. Coroutines also typically don't have nearly as many concurrency hazards because you can control when you yield.


That's an implementation detail.

Pthreads work differently, but historically kernel-mode cooperative threads were also popular at one time.

(We decided to not go there because cooperative parallelism sucks balls.)


coroutines can be either cooperative or not, depending on implementation, so it depends


And so can threads.


Coroutines also don’t need to flush the TLB when they get scheduled. This is a pretty big deal for performance, when you have a lot of them.


Thread context switches don't flush the TLB either. This is an important performance optimisation in the kernel context switch code.


Agreed in general, that async code is generally less efficient than epoll/kqueue/poll/select/etc. non-blocking I/O. However, note that async code is often quite a bit more efficient than naive Java 1.3-style (pre-NIO) tons-of-blocked-threads code.

Now, io_uring async code is yet another story.

I was working at LimeWire when Apple finally shipped a 1.4 JVM as their system JVM. The whole thing got much lighter weight when we could use non-blocking I/O.


Eh? Async code almost always uses the APIs you describe.


The async runtime is almost always built on top of non-blocking I/O APIs, and the extra layer of abstraction adds overhead. You're rarely calling epoll in your async code, but rather your runtime is handling all of that for you.

That is, unless you're using io_uring, POSIX AIO, etc., in which case your async code might not have the extra layer of abstraction on top of the system APIs.


io_uring isn't typically used directly, you're most likely using liburing which is already an abstraction. I don't really see your point.


But a single thread with async can, and for many tasks, is more efficient that a single thread without async, because you aren’t waiting on requests.

One point also is that until recently most personal computers did not have multiple cores, aka you could think at the lowest level that even multiple threads were just more dangerous async lol

Even making new threads in the JVM isn’t guaranteed to result in more than a single OS thread.


> One point also is that until recently most personal computers did not have multiple cores,

core 2 duo has been released 17 years ago


Dual core started coming out ~2005. It took a few more years for it to dominate the product lineup. Computers tend to have a replacement cycle of ~5 years. Consequently, it's not until about 2012-ish that you could reasonably expect a personal computer to have a multicore computer.

(Yeah, that's still a decade ago, but that's a good deal more recent than 17 years.)


A decade is a long time for computers


There was plenty of async JavaScript code and multithreaded Java code out there in 2013, I promise


There is another use case for async in client-side JavaScript that often gets overlooked: modeling the flow of UI.

So, let's say you need to show a "dialog" which, upon exit, should execute a piece of code (e.g. to show another piece of UI). Maybe that dialog can be invoked from multiple places, and the code to execute (after the dialog) changes accordingly.

You could do that via callbacks, but that is not very composable, and is generally messy.

Better solution: a Promise which gets resolved when the user exits the dialog! Composing multiple pieces of UI then becomes equivalent to composing multiple (async) function calls, you can pass output from one as the input to another, use exceptions to "circuit break" the UI flow, etc.


I'm not convinced of the dialog use-case. Seems rather sketchy to await a user to click OK. What happens when you add another button or two to the dialog?


In this model, you wouldn't await the user clicking OK, but exiting the modal. The promise only resolves "user took a decision", not the actual choice they took.

The result then contains if the user clicked OK, Cancel, [X] to close the modal or any other state transition that you may have added to that dialog. The recipient (not necessarily the caller that initiated the dialog, in case you chain the responses) can then act accordingly and presumably triggers a different action on OK then the other state transitions.


You return a different result (i.e. the Promise resolves to a different result). The calling code can then decide what to do with it.


Why implement coroutines when you have multithreading?

Coroutines are to me a brain-twisty way to make the most out of a single core, because your language doesn't support multithreading. For example: JavaScript with a single thread in the browser, Python with the GIL.

C++ has threads which ACTUALLY run in parallel on the CPU. Why bother complicating the language further?


Because native multithreading is expensive when it comes to (a) memory consumption per thread of execution and (b) context switching. Also, if your program spends a lot of time just waiting on outside resources (most notably slow/blocking IO), coroutine concurrency is bother faster and more lightweight. If your program is CPU bound, then coroutines/async have little-to-no value over native multithreading.

That being said, yeah, most async workflows are obtuse and difficult to follow and they force all consumers of your interface to also use the async primitives if they want the benefits. Go seems to be one of the few languages that does async/coroutines right.

See: https://eli.thegreenplace.net/2018/go-hits-the-concurrency-n...


What I like about Go is that each co-routine is entirely synchronous, and the only concurrency primitives you're given are channels and mutexes. This makes your code so simple and easy to understand.

I just looked up Kotlin's co-routines, and there are a number of annoying hoops you need to jump through:

1. Co-routines can only be built by calling runBlocking or coroutineScope.

2. Functions that are async buy should act like blocking calls in a co-routine need to be marked with `suspend`

3. There are a bunch of primitives you need to use like `launch` or `async`.

4. There are built-in options to make the co-routines lazy, but it makes the API feel sloppy.

5. Rather than locks, Kotlin encourages explicit thread dispatching. Some times you need to do this in Go too, like when using a C library, but in general it's rare.


The other consequence, however, is that Go coroutines only work with other Go code, and bespoke Go threads and stacks require the runtime to jump through a lot of hoops to do FFI correctly, with the result that the ecosystem tends to rewrite rather than reuse existing code in C and other languages.

In contrast, the approach with explicit promises and await can be mapped all the way down to a C struct with a function and a data pointer representing a callback. That is, it is C ABI compatible, and thus it can be easily composed across any boundary that respects that ABI.


The only runtime hoop I've ever had to jump through was to invoke runtime.LockOSThread() inside a go-routine. In a lot of ways:

    go func() {
        runtime.LockOSThread()
        // ...
    }()
Is simpler than creating a thread in other languages.


The runtime jumps through those hoops for you as needed, mostly, but there are consequences to that wrt performance.


`launch` is just the `go` keyword. `async` stores the result in a `Deferred` (or `Promise`), which is an abstraction for parallel decomposition.

Other than confining UI work to the main thread, I don't think I have seen thread confinement as an alternative to locks.


That's kind of my point. Go has the `go` keyword. Kotlin has runBlocking, coroutineScope, launch, async, and suspend. Go's simplicity feels nicer to me.


`runBlocking` and `suspend` is the coloured function point. Some hate it, I don't mind it. It has its benefits.

---

A quick search on SO gives this example of parallel decomposition. https://stackoverflow.com/questions/57662739/pattern-for-fet...

Starting several `async` tasks, then awaiting them is cleaner for me. But that's subjective.

---

Kotlin's structured concurrency handles the context stuff by default. When you handle those concerns (cancellation for example) in go, it's just as complex, but more verbose.


Go's context object is used for cancellation. I think this is pretty simple and has the benefit of avoiding any type of exception handling while still giving you the ability to clean up anything you were doing.

In general, if you like Kotlin, I can see why you'd like their approach to concurrency. Kotlin really likes a large standard library with generic blocks that act as syntactic sugar. I used to like that too, but as I've gotten older I've gotten lazier. Kotlin now how too much mental overhead for my old brain.


The coroutines support in the stdlib is here: https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.coroutin...

You can build exactly Go's coroutines API on top of these primitives, by encapsulating the Continuation object in Channel and Mutex types, and you would not have to touch the `suspend` keyword (which is sugar for CPS-transforming the function).


I don’t think you can build Go’s coroutine API in Kotlin because Go ensures you can never starve the scheduler. For instance, blocking system calls in Go spawn new OS threads so that they can be pre-empted by other running go routines, and most of the standard API is built with stuff like select, poll, and epoll so you never have to worry about blocking the scheduler. That’s not how Kotlin works, right?

So, while you can have a similar syntax in Kotlin, the overall systems will have different semantics.


Project Loom will get us that behavior for the JVM, as long as the default dispatcher is modified to support virtual threads. For now, we schedule on an unlimited thread pool using Dispatchers.IO, which is not as cool.


The cost of context switching between threads sharing the same address space is much lower than between tasks, because there's no need for a TLB flush. It just comes down to the cost of saving/restoring register state (more or less), and that's a nice known fixed cost on any given processor.

Context switching between tasks has a variable cost, because the implications of the TLB flush depend on the working set size of the switched-to task. If it doesn't touch much memory, the variable cost is low; if it touches a lot, the variable cost can be much, much higher than the register save/restore (fixed cost).


> It just comes down to the cost of saving/restoring register state (more or less), and that's a nice known fixed cost on any given processor.

There's also the cost of cache eviction. Not something you have to manually manage, but it's a cost you pay nonetheless. Maybe.


Indeed, I ought to have mentioned the cost of cache misses post-context switch too. It too is working set size dependent, but the relationship looks different from the cost of TLB misses. However, my understanding is that an inter-task thread switch would also not cause cache invalidation (there's no need; the address space remains the same).


Would you need to flush a TLB on a regular context switch? That seems inefficient as threads are being scheduled per quanta in Linux. TLB entries are only evicted on page faults right?


> C++ has threads which ACTUALLY run in parallel on the CPU. Why bother complicating the language further?

Actual parallelism is surprisingly slow and resource heavy in practice. When everything is on one core, you keep things local in L3, L2, L1.

However, add on a 2nd core, then you suddenly have "ping ponging". Lets say core#0 has data in L1 cache, and now Core#1 needs to read-modify-write to it. That means the cores need to:

1: Core#1 begins to mark that cache-line as exclusive.

2. Core#0 needs to then respond by marking the line as invalid. Then Core#0 ejects the data from L1, and passes it to Core#1.

3. Core#1 can finally begin to work on the data.

4. If Core#0 uses the data again, Core#1 must do step#2 again.

--------

This is called "Ping-ponging". L1 cache read/writes is 1-nanosecond, but ping-pongs can be 30nanoseconds or slower, 30x slower for no reason.

You don't want to add another core to your problem unless you're _actually_ getting a speed benefit. Its more complex than it may seem at first glance. You very well could add a bunch of cores and then suddenly your program is way slower because of these kinds of issues.

Another problem: False sharing. Core#0 is working on "int x", and Core#1 is working on "int y", but x and y are on the same cacheline. So they have to ping-pong even though they never actually touch the same data.


Your example implies some combination of (a) excessive data sharing between threads (b) possible need for core pinning.

If thread A is going to need to read all the data touched by thread B, then it's unclear why you've split the task across threads. If there are still good reasons, then probably pin A and B to core N (never pin anything to core 0, unrelated story), and let them interleave there.

If that doesn't make sense, then yep, you'll have to face the cost of ping-ponging.


In my benchmark of a simple ledger I generate and do 25,021,365 (25 million) random transactions per second. (Withdraw and deposit over 80,000 accounts)

Paralellise it via sharding and I get 80,958,379 transctions per second.

If I remove the randomness, I get 134,600,233 transactions per second. If I parallelise with 12 threads I get 931,024,042 transactions per second.

My point being, if you paralellise properly and do not use shared data, then you can boost performance by a multiple of the number of threads.

https://github.com/samsquire/multiversion-concurrency-contro...

For the technique I use for this scalability, see my journal entries on sharded structs and sharded integers

https://github.com/samsquire/ideas4#565-sharded-struct-pseud... https://github.com/samsquire/ideas4#608-sharded-integers


This is not so likely in most workloads unless multiple threads are hitting very close or adjacent regions in memory. Most multithreading workloads work on memory not so close together. It's good to be aware of it, but to suggest it is the most likely outcome from multithreading is misleading.


Coroutines don't really have anything to do with threads. They are just a way for you to model suspending and resuming functions.

If you have every tried to implement a complicated iterator you will find that you model a for loop in your head but that the iteration of the loop is controlled from outside using the ++ operator. The ++ operator allows the iterator loop to go one more time round. Normally you have to manage this with state stored in some object. With coroutines you can actually write a loop and co_yield from the middle of the loop to implement the ++ operator.


Co-routines can be a great structured-programming tool for implementing state machines. A good article which explains this idea explicitly: https://eli.thegreenplace.net/2009/08/29/co-routines-as-an-a...

That said, overhead always needs to be considered; the overhead cost of using a CPP co-routine would be considerable for implementing a state machine that transitions rapidly.


This is actually really convenient.

In our code we have an abstraction that wraps a set using a variant. Instead of implementing an iterator/begin/end based on which data the variant holds, we have an iter() method that unwraps the variant and loops through and co_yields. Since you can consume a generator in a for-each loop, you use it the exact same way (except calling iter).

In our case it doesn't save that much complexity, but it's definitely some. I could imagine more complicated cases where it is even more convenient.


Everyone who ever programmed a bit in Python knows that. Generators simplify your code and remove unnecessary boilerplate.

However, async/await is real nice as well.


Real threads can too expensive sometimes. A single coroutine in Go just use few kb of memory. So we can easily spawn hundred of thousands of coroutines but not possible using threads


Your point is correct but your magnitudes are off.

IIRC, a thread takes up ~32kB or so worth of data in pthreads/Linux, or at least... somewhere around there. So on a rather cheap server with 64GB of RAM, you can easily fit 1-million threads. More expensive servers have 512GB, 1TB or more RAM and can easily handle this kind of load.

Of course, coroutines are even smaller/more efficient and can go into 10-million, 100-million or higher.


A bigger issue is creating and destroying threads, which can be costly if you don’t do a lot of work per each. Of course, people have worked around this problem for ages with thread pools, but then you’re already breaking up ownership, code flow and screwing up the stack for debugging and panicking. So something like coroutines or green threads is still motivated, it is just extremely different across languages and is almost always a leaky abstraction.


A thread on Windows is MUCH heavier than Linux. 1 MB stack from what I remember plus takes a lot longer to create.


dwStackSize can be set as low as 64kB though. So if you're in the habit of making millions-of-threads, maybe make them smaller on Windows?

It does default to 1MB, but these sorts of things are configurable.


Ugh I really need to finish my blog post where we had async with only a few dozen bytes of overhead per task.

Cooperative async runtimes are obscenely efficient.


> Why implement coroutines when you have multithreading?

Because coroutines are easier to think about than threads, and avoid entire (large) classes of bugs.

Because coroutines (typically) have a lower overhead than threads.

Because if you are IO bound and not CPU bound, and therefore you don't need to spread work across cores, threads may not provide a given workload with much, or any, benefit.


I've benefited from cooperative userspace context switching with CPU bound workloads as well. Imaging simulating a vehicle's worth of embedded software, with several dozen tasks originally intended to run at 100Hz asynchronously across multiple different computers. Now, imagine running them faster than real time, deterministically. Now imagine doing several thousand of those simulations. The overhead of OS threads, context switching and synchronization starts to add up, and is 10-100x slower for that sort of workload.


Cooperative multitasking systems having more deterministic performance profiles is often overlooked.

I'm working on a post right now covering this in detail, but I spent the last 2 hours being reminded that explaining concurrency stuff in simple terms is hard. :-D


It'd be nice if we had system-level "lightweight threads." The explanation I've heard before is that "user space knows more about the program." So people write clever runtimes that do kernel-like task scheduling multiplexed over CPUs, in user space.

I can buy that. Think about a system thread. It's pretty much its own process that shares an address space with others. Its interface is syscalls and memory. A user thread (coroutine, goroutine, greenlet, whatever) is... whatever you want it to be. It can be small and special-purpose.


I agree, but I think as you’re alluding to, the deeper issue is that the syscall boundary is too big of a fence to jump over for small operations. Hence, io_uring and similar, which pools multiple logical syscalls into one. The alternative way would be to have non-syscalls but kernel-aware libraries that share memory with the kernel, I guess? It’s a much bigger change, philosophically, so I think the io_uring model + user space scheduling will take the throne for the next decade or so at least. It’s a bit unfortunate because we’ll likely see a lot of bloated and fragmented language runtimes, but it’s still vastly superior to what we’ve seen over the last decade+.


Coroutines are about decoupling OS-level threads from logical threads (where OS-level threads are already decoupled from physical threads). The reasoning is actually very similar to why OS-level threads do not correspond directly to physical threads.

You may wonder why OS threads are not themselves sufficient, but it turns out there is a decent amount of overhead involved in managing huge numbers of OS threads. There are limits at which Linux does not perform well anymore, leading to problems like live locking. Also, if you can avoid context switching/cache invalidation by keeping a single OS-level thread busy, you get performance benefits even before you hit OS limits.

So the userspace scheduling and green threading that coroutines enable allow you to increase concurrency without hitting these limits.


Why use a hammer when you already have screwdrivers?


For an I/O heavy application (not in C++) I avoid threads for the main "loop" as there is a central cache that is accessed often. Synchronizing access to it is likely more expensive and error-prone than using just a single thread. It is also possible a few hundred or thousand tasks are running concurrently, which becomes very expensive very quickly with threads.

There are a few CPU-intensive tasks but those are easy to isolate and send to a threadpool.


Coroutines are for when you would normally use callbacks or a state machine, but handling the callback’s state or the state machine is complex enough that it becomes much simpler when expressed as a coroutine (simpler to reason about). It isn’t really an alternative to multithreading.


Co-routines (when used as an async primitive) are an an alternative to callbacks.

Callbacks exist to avoid the overhead of marshalling async results back to the original calling thread: instead of having the calling thread poll until the child thread(s) pass their results back, the calling function returns after kicking off the children, and one of the children directly resumes the work directly when the results are present. This is a significant performance gain.

However, callbacks have famously poor ergonomics, so many structured alternatives have been created: Promises/Futures, Async/Await, and CPP Coroutines.


Coroutines are when you don't want to run in parallel, because sequence matters. A typical example is iterators, you want the loop and iterator (routine and coroutine) to run in lockstep.

You could use threads instead, but then you risk getting in all sorts of concurrency problems. And if you synchronize everything, then great, you have reinvented coroutines. In fact, you can think of coroutines as threads with built-in synchronization, which may be how they are implemented.


To more elegantly write code that is async in nature like IO/Web requests? Let stuff happen and just await where needed?

Seems like different type of concern than what you need multiprocessing for.


Good for embedded devices (e.g. a single core ESP32).


even without any threads or async, coroutines are very nice for writing generators and super useful when you want generic code which can iterate, without leaking the container implementation to client code


Coroutines This is another can be used It as another way is still possible to language construct like manage object lifetime, and make some to use threads control flow much easier if you have coroutines to write. lambdas.


perhaps the take here is that many features of modern c++ are not intended for application writers, but more for library writers?


That's definitely a thing. A whole lot of the features added since C++11 have been specifically intended to enable library writers to do very complex things under the hood so they can present powerful yet simple interfaces to users.


Users of coroutine-friendly libraries will have to adopt the async function coloring (co_await, co_yield, co_return, Task<T>).

So saying that coroutines are low-level / not application code is misguided.


Not necessarily. Coroutines are invisible at the ABI level (that is coroutines are just functions).

It's pretty easy to write a library with them (using your own executor) and not reveal their use to the user.

Obviously sharing an executor between application and library is another matter.


> sharing an executor between application and library is another matter.

Especially when there is not one, but several libraries, and the application itself doesn't actually has or uses any executors, now that's a fun challenge.


Good point.

Although doesn't that defeat the purpose of structured concurrency?

i.e. creating tasks that are plumbed to some background thread/executor.

By directly invoking the coroutine, lifetime issues become much easier to avoid.


They serve both to a great extent, as long as code owners are willing to refactor their codebases. The changes do actually make code significantly safer and more pleasant to work with. My favourite ones are the concepts and class template argument deductions.


I believe this is correct. Anecdotally, at FB, there was quite a push to get coroutine-supporting-features into the standard because we had a C++ coroutine library built through non-standard means, which meant that compilers didn't fully support them (leading to a bunch of annoying things for all of those writing code on top of the library).

Now the folly/experimental/coro library can be compiled by anything that claims to support C++20, which is a pretty big win for FB if nothing else.


I can't speak to modern C++ specifically, but in many, many languages, more complex features are often for library writers than application writers.


Article is pretty good but C++ coroutines sound like a contorted version of protothreads. Not that useful in the first place, and made way more painful for the sake of enabling some hypothetical compiler optimizations that aren't done in practice. The two specific pain points are

1) stacklessness, which means coroutines can only yield from the top-level function that is called by the coroutine, sort of like the "simple generators" introduced in Python 2. Some people would consider these to not be real coroutines at all, and Python 3 later introduced actual (stackful) coroutines aka async functions. The C++ devs say that if their coroutines had stacks they would be called fibers instead of coroutines, but I had always thought fiber (used that way) was a Windows-specific term.

2. Opaqueness of the coroutine frame object: you can't get at its size or contents. That lets the compiler optimize its layout or even in some cases eliminate it. But at least for now, real compilers don't seem able to do that, and s0me LLVM devs aren't sure it's even feasible. Rust doesn't attempt this. It just uses an ordinary struct to hold the frame data, so you can do normal struct things with it.

Above is my impression from reading the linked article and some related ones. I have not yet tried to use C++ coroutines, or looked into Rust's even slightly. Protothreads (a C header library) are cool though. They don't do that much, but they are very simple in implementation compared to this C++ mess.

I think these C++ coroutines don't reach the usefulness of a lightweight cooperative Forth multitasker from 50 years ago. Those were stackful (typically a few dozen cells in each stack, so a task cost on a 16-bit cpu might have cost 200 or so bytes including stacks, user variables, and other data) and ridiculously simple.

Goroutines aren't comparable. They are way more powerful but also way more expensive than C++ coroutines.

There were apparently some other C++ coroutine proposals that sounded a lot better, but that didn't have the backing to get through the standardization process.

I don't know how Boost coroutines compare. I tried to use them once but didn't get them working.


I recently asked how to use C++ thread pools with coroutines. there is code as an answer on my stack overflow question.

https://stackoverflow.com/questions/74520133/how-can-i-pass-...

I recently ported the coroutine code from

https://blog.dziban.net/coroutines/

To GNU Assembler.

https://GitHub.com/samsquire/assembly see coroutines.S

I think the best combination is threads + coroutines with a IO event loop. This gives throughout for CPU tasks and IO scalability.


C++20 coroutines didn't include the facilities to get a coroutine directly from writing a loop with the `yield` keyword like in Python. You can see all the boilerplate needed in this post's `Generator` class. For me, it made the feature not worth using.

I'm excited that C++23 is getting `std::generator`, so we can finally write those simple loops: https://en.cppreference.com/w/cpp/coroutine/generator


TBH, everytime I read about coroutines and their example usages I get stuck on the "why" question. If the example is a function waiting for data, a std::future seems a lot easier. If the example is a generator, a range/transform/filter or just a class that returns a std::default_sentinel iterator seems more appropriate.

The only reason I can see is lowering the overhead of creating threads, at the cost of harder to read controlflow.


One reason is to reduce callback hell where you have to pass a continuation function to an asynchronous operation to run after it has completed. Coroutines can be used to make your code look more like blocking code. ASIO’s coroutine support does this and it makes async operations much easier to read

Also, generators can be used to turn traversal functions into iterators, which is often tedious to do by hand.


EDIT: I think I follow how this works now. I didn't understand the sieve itself, so the co-routines was throwing me for a loop.

We get all the numbers from 2 to 40, print out 2, then eliminate all the entries divisible by 2. Now we're left with 3, 5, 7, 9, etc.

We print out 3, then remove all the entries divisible by 3.

This repeats until we've hit the end (40). The next number after a round of eliminations will always be a prime.


Yeah, the sieve use-case is a bit too clever IMO to be a good introductory example.


I'm only sitting here, waiting for my thread_locals to get broken.... or maybe they are handled magically somehow


In Win32 at least, there is "FLS", Fiber Local Storage. But Fibers (Windows speak for coroutines, but not sure how they relate to C++ coroutines) are generally considered to be not worth the complexity.


Fibers would be stackful coroutines (every coroutine gets its own stack memory, and local variables inside a coroutine are placed onto that stack). Whereas C++ coroutines are stackless - local data in those gets placed into a magic object that is created somewhere by the compiler.

Stackful vs stackless comes with a few pro's and con's - e.g. it's often argued that stackless requires less memory. However stackful coroutines can allow for preemption at arbitrary points instead of just at yield points.


To expand on what you said, they recently decided to work on stackful coroutines for C++26 now that C++23 is final. You can just use one of the many libraries if you don't want to wait for the standard library version.

Stackful coroutines didn't require any compiler support, unlike stackless so they've be available for a long time (stackless have also been around since C++20).


.. why would they be broken ? coroutines are not related to threads


I think the point is that if the coroutine suspends it could be resumed on a different thread. So you don't really want to mix thread locals and coroutines.


I mean, for sure but that's not a property of coroutines, any callback system can do that

e.g. if you have this code with boost:

    #include <boost/asio.hpp>
    #include <chrono>
    #include <vector>
    #include <thread>
    #include <iostream>
    std::mutex print_mutex;
          
    int main()
    {
      using namespace boost::asio;
      using namespace std::literals;
    
      io_context io_context(20);
      auto work_guard = make_work_guard(io_context); 
      for(int i = 0; i < 100; i++) {
        post(io_context, [&] {
          thread_local int x = 0;
          int z = x++;
    
          std::lock_guard _{print_mutex};
          std::cout << z << "\n";
        });
      }
    
      std::vector<std::jthread> v;
      for(int i = 0; i < 20; i++) 
        v.emplace_back([&] { io_context.run_for(2s); });
    } 
i hope you don't expect the output to be "0, 1, 2, ..., 99"


state, e.g. assuming that there is single flow of operations, and nothing gets swapped out (logically). This breaks with co-operative scheduling systems - e.g. task thread while waiting on I/O (for example) gets reused for some other task, hence they get to see the same thread_local, and somewhere someone assumed "thread_local" == "task_local" of sorts (but not).


My understanding is C++ coroutines are implemented as threads, which seems to take away the point of coroutines.


where did you get this understanding? this is definitely, 100% wrong, C++ coroutines are not backed by threads at all (or by anything really since as of C++20, the language only provides the building blocks for making your own coroutine / generator types and not actual implementations).



coroutines (https://en.cppreference.com/w/cpp/language/coroutines) are unrelated to std::async


In other languages they are tightly related to async/await and I made the mistake of assuming they were connected.


I think even gccgo used threads as coroutines, not sure if it's still the case


I haven't read much about them, but it looks like they can be driven by any event-loopy thing. It's not like creating a coroutine creates a thread.

https://en.cppreference.com/w/cpp/language/coroutines


The fact that they went for stackless is a testament to how bad committee-driven design process is. just a bunch of bureaucrats playing important actors, similarly to stackoverflow mods.


Can you ELI5 why stackless is bad?


There are pros and cons, but here are a few things which favour stackful.

Function coloring is a disadvantage of stackless (assuming it's mixed with regular, non-async C++ code). Async and non-async libraries can't interoperate the way they can with stackful. Arbitrary functions cannot block the calling coroutine with stackless, but they can with stackful. Some people consider the distinction a feature (a bit like checked exceptions, it's debated as to whether it's helpful or adds brittleness), but it's often a problem in large codebases that weren't written to be entirely async (including libraries) from the start. Anyway, if you want coloring as a distiction in your type system you can still have it. But with stackless, you have no choice.

Memory allocation patterns are different with stackless, sometimes worse. While a stackful coroutine system requires stacks to be allocated for each new coroutine, obviously, in a stackless async/await system there is typically a higher rate of memory allocation, and with varying sizes, to hold the temporary states of each coroutine which can occur at each await site. There are usually many more of those sites than coroutines.

In addition, those temporary states being stored in heap-allocated memory are likely to have a lower CPU cache hit ratio than stack memory during the run of a particular coroutine.

Something like the Linux kernel is very difficult to write in an explicit stackless style. The Linux kernel design uses stackful coroutines pervasively. This is not theoretical: It has come up in practice. Years ago there were a number of attempts to change the Linux filesystem and block I/O code to have async code paths (i.e. stackless style, state machines), so that a proper async I/O ("AIO") could be offered to userspace. Every attempt failed because it was too much work or too difficult to make all the filesystems code and everything they call, every path, fully async. Some of those changes went in, but the result was Linux AIO requests were not reliably async (and still are not), as they could sometimes block when they hit some less common paths in filesystem code, e.g. doing things like updating a block extent map, b-tree, or LVM/RAID corner case. In the end, the async I/O designs that worked well and reliably didn't block on request submission all ended up delegating I/O requests to scheduled stackful coroutines, i.e. kernel threads. These also turned out to perform well, which is not a coincidence, as stackful context switching is efficient. Of these designs, the one which ended up being adopted and well known is called io_uring.

For an example of less common paths that still need to be async, some operations have to allocate memory in all sorts of ways, including temporary memory when calling library functions or traversing data structures. In a kernel, memory allocation (i.e. malloc/new) has to be able to block the calling task temporarily, so that when there isn't enough free memory immediately available, the allocating task will wait for another task to free some, as that is preferable to failing an entire operation. Try doing that with C++ coroutines and idiomatic object allocation, and you will hit the function color problem: You can't async allocate a new object. You could of course write in a non-idiomatic style, not using new or constructors or anything which calls those, doing your own "await my_async_new" everywhere, but having to do that utterly consistently throughout a large codebase, and requiring every little function (including all libraries) to work that way as well, would be not really using C++ as it is meant to be used, and comes with its own risks. Alternatively you can block the thread the executor is running on, but that defeats the point of async coroutines.

With stackful coroutines, those kinds of operations Just Work(tm).

You can achieve the same thing with stackless by allowing such operations to block the executor thread, and spawn new executor threads which do work-stealing to ensure other coroutines are able to make progress while the first one is blocked. I believe that is what Go and Rust's Tokio do. The effect is to allow coroutines to be a mixture of stackless and stackful as needed, optimising for both worlds at the same time. It has the particular benefit of improving performance when the executor needs to call an operation which blocks in a library function or in the kerne. Howver, as with stackless, to ensure every coroutine can progress in an async manner without being stalled by coroutines that are blocked, this also needs every code path and library function to buy into doing that (if only by annotating "I may do something that will block now" regions). So it's also not suited for retrofitting to all codebases.


For coroutines to be really useful, they have to be stackful. The guy that originally did the proposal for C++ is also an author of multiple coro libraries and did research on this. You can emulate stackful coros with stackless via trampoline, but heck, you can emulate stackless coros with closures and trampoline too! The requirement of this extra trampoline makes their use extra convoluted and negates the very slight advantages that such functionality may really bring. Making stackful "right" was hard, so a useless compromise was made, which is basically like adding a syntactic sugar to the language.


It may be a compromise, but what exactly makes it "useless"? A similar style of async has been in use in e.g. C# for over a decade now, and it has been very successful there. Yes, it is syntactic sugar - but so are e.g. lambdas. What matters in practice is that it makes some kinds of code much shorter and easier to read.


stackful coroutines is a feature that is on par with power of delimited continuations (proven fact in academia, you can easily find somewhat easy to follow papers on this topic), stackless coroutines is a stupid gimmick. You see the "compromise"? You have argued to have a case to add a car, but after long debate, politics and "compomsie" you got a TOY car instead.


what kind of things does a stackful coroutine practically enable that would be impossible with stackless design?


Theoretically everything is possible with a NAND gate, so the question is badly formulated from the beginning. There is a sibling comment from @jlokier who did a gread job of providing a sane summary, and not just some incoherent rambling like I did.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: