> Asynchronous code (e.g. callbacks) is more efficient (letting you do other work while waiting for things) but is also more complicated (manually saving and restoring state).
This is a persistent misunderstanding about async... it is not generally more efficient than non-async code. In fact, the overhead of async may slow things down. (There can be other downsides as well. E.g. debugging may be more difficult.)
For it to help, you need to have something better for the current thread to be doing than waiting for whatever it is you're waiting for.
Whatever that is, another worker thread could be doing it, so what we're really talking about is that async may let you reduce the number of threads you're using.
(Async is so useful in JS because it lets you reduce the number of threads to one, which lets you sidestep all kinds of hairy issues.)
That might be good enough, depending on your situation, to overcome the downsides of async worthwhile. But it may very well not.
Anyway: don't use async because it's "more efficient".
You are leaving out an important aspect: context switches are expensive, especially with all the mitigations introduced after Spectre et al.
Doing more work in a single thread saves a lot of context switches, especially if IO makes up a large portion of the workload. And even more so with other mechanisms to minimize syscalls, like io_uring on Linux.
Async overhead is highly dependent on the way the async runtime is implemented.
It's true that async requires scheduling and task switching to be implemented in userspace, but assuming that async is used as an alternative to threads, then the OS would be doing a similar amount of work anyway.
The OS may be able to do that more efficiently because it has more information about the system. Or it may be the opposite, because the application has more information about the workloads.
"async = overhead" as a general statement is not correct.
Thread context switches are about scheduling threads on physical cores. Async is about executing in a scope on a thread. There's no direct conversion here. Packing a thread with async tasks could let you reduce the number of threads you have, which will probably reduce the thread context switching if there is contention for cores. Whether that's significantly better than the cost of async context switches really depends on specifics (even if async context switch cost is near zero, though we're no longer talking about C++ coroutines in that case). But keep in mind: If your threads are mostly waiting, they aren't triggering a lot of context switches. More async means fewer busier threads, less async means more less busy threads. To run into a problem with less async your threads need to be forcing switching at pretty fine-grained level. (BTW, most blocking for IO isn't fine-grained from the perspective of instructions executing on a core)
That kind of thing happens, but has solutions other than async.
Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?
In my experience, the exact opposite is true, particularly in the era of CPU mitigations that require TLB flushes upon every kernel-mode context switch.
You're right, kernel-level context switching is much slower than user-level context switching.
User-level can also have the advantage of having more actual context about the task that is running, meaning that it's often able to avoid saving/restoring as much data as a kernel-level switch would. See Go's green threads for a great example of this kind of cooperation between runtime and language.
> Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?
The closest thing to this that I can think of is on 32-bit x86 which did have hardware assisted context switching via TSRs.
As it happens, everybody stopped using it because it was too slow, and a bit painful unless you fully bought into x86's awful segmentation model. Early Linux kernels use it if you want to see it in action.
If you have 1 thread handling 1000 requests with some async io mechanism (epoll, io_uring, ...) ,instead of 1000 threads each handling one request, there are much fewer threads fighting over CPU cores and the 1 thread can stay active much longer, hence reducing the amount of context switches.
Especially with a mechanism like io_uring, which helps minimize syscalls (and hence switching to kernel threads).
Asynchronous execution is more efficient if and only if the time spent waiting for a call to complete is longer than the time spent checking if the call has completed plus the time spent dispatching to another asynchronous call (and waiting vs checking if it is completed, and so on).
Remember that concurrency is not parallelism, it's not about the number of threads being used.
> Anyway: don't use async because it's "more efficient".
I don't think efficiency has only one meaning. For instance, you can create millions of coroutines while you're limited to only thousands of threads. Doesn't that mean your program is making better use of your hardware and thus is more efficient?
No, your processor can only do n things at a time, where n is the number of cores (probably like 8) (simplifying hugely of course). It's pretty easy to fill out 8 cores with 1000s of threads, the coroutines only become useful when almost all of them are blocking on a socket or something, periodically switching on and off to handle it.
Sure, they can be more efficient in particular situations.
But coroutines have overhead (so do threads). Your millions of coroutines might be worse than your thousands of threads in terms of making efficient use of the cpu cores your app or service has available.
If you can create millions of threads with some hardware, then you can create billions of coroutines on that hardware. It's just a fact that you can create 2-3 orders of magnitude more coroutines than you can kernel threads on any given hardware.
Agreed in general, that async code is generally less efficient than epoll/kqueue/poll/select/etc. non-blocking I/O. However, note that async code is often quite a bit more efficient than naive Java 1.3-style (pre-NIO) tons-of-blocked-threads code.
Now, io_uring async code is yet another story.
I was working at LimeWire when Apple finally shipped a 1.4 JVM as their system JVM. The whole thing got much lighter weight when we could use non-blocking I/O.
The async runtime is almost always built on top of non-blocking I/O APIs, and the extra layer of abstraction adds overhead. You're rarely calling epoll in your async code, but rather your runtime is handling all of that for you.
That is, unless you're using io_uring, POSIX AIO, etc., in which case your async code might not have the extra layer of abstraction on top of the system APIs.
But a single thread with async can, and for many tasks, is more efficient that a single thread without async, because you aren’t waiting on requests.
One point also is that until recently most personal computers did not have multiple cores, aka you could think at the lowest level that even multiple threads were just more dangerous async lol
Even making new threads in the JVM isn’t guaranteed to result in more than a single OS thread.
Dual core started coming out ~2005. It took a few more years for it to dominate the product lineup. Computers tend to have a replacement cycle of ~5 years. Consequently, it's not until about 2012-ish that you could reasonably expect a personal computer to have a multicore computer.
(Yeah, that's still a decade ago, but that's a good deal more recent than 17 years.)
There is another use case for async in client-side JavaScript that often gets overlooked: modeling the flow of UI.
So, let's say you need to show a "dialog" which, upon exit, should execute a piece of code (e.g. to show another piece of UI). Maybe that dialog can be invoked from multiple places, and the code to execute (after the dialog) changes accordingly.
You could do that via callbacks, but that is not very composable, and is generally messy.
Better solution: a Promise which gets resolved when the user exits the dialog! Composing multiple pieces of UI then becomes equivalent to composing multiple (async) function calls, you can pass output from one as the input to another, use exceptions to "circuit break" the UI flow, etc.
I'm not convinced of the dialog use-case. Seems rather sketchy to await a user to click OK. What happens when you add another button or two to the dialog?
In this model, you wouldn't await the user clicking OK, but exiting the modal. The promise only resolves "user took a decision", not the actual choice they took.
The result then contains if the user clicked OK, Cancel, [X] to close the modal or any other state transition that you may have added to that dialog. The recipient (not necessarily the caller that initiated the dialog, in case you chain the responses) can then act accordingly and presumably triggers a different action on OK then the other state transitions.
Why implement coroutines when you have multithreading?
Coroutines are to me a brain-twisty way to make the most out of a single core, because your language doesn't support multithreading. For example: JavaScript with a single thread in the browser, Python with the GIL.
C++ has threads which ACTUALLY run in parallel on the CPU. Why bother complicating the language further?
Because native multithreading is expensive when it comes to (a) memory consumption per thread of execution and (b) context switching. Also, if your program spends a lot of time just waiting on outside resources (most notably slow/blocking IO), coroutine concurrency is bother faster and more lightweight. If your program is CPU bound, then coroutines/async have little-to-no value over native multithreading.
That being said, yeah, most async workflows are obtuse and difficult to follow and they force all consumers of your interface to also use the async primitives if they want the benefits. Go seems to be one of the few languages that does async/coroutines right.
What I like about Go is that each co-routine is entirely synchronous, and the only concurrency primitives you're given are channels and mutexes. This makes your code so simple and easy to understand.
I just looked up Kotlin's co-routines, and there are a number of annoying hoops you need to jump through:
1. Co-routines can only be built by calling runBlocking or coroutineScope.
2. Functions that are async buy should act like blocking calls in a co-routine need to be marked with `suspend`
3. There are a bunch of primitives you need to use like `launch` or `async`.
4. There are built-in options to make the co-routines lazy, but it makes the API feel sloppy.
5. Rather than locks, Kotlin encourages explicit thread dispatching. Some times you need to do this in Go too, like when using a C library, but in general it's rare.
The other consequence, however, is that Go coroutines only work with other Go code, and bespoke Go threads and stacks require the runtime to jump through a lot of hoops to do FFI correctly, with the result that the ecosystem tends to rewrite rather than reuse existing code in C and other languages.
In contrast, the approach with explicit promises and await can be mapped all the way down to a C struct with a function and a data pointer representing a callback. That is, it is C ABI compatible, and thus it can be easily composed across any boundary that respects that ABI.
That's kind of my point. Go has the `go` keyword. Kotlin has runBlocking, coroutineScope, launch, async, and suspend. Go's simplicity feels nicer to me.
Starting several `async` tasks, then awaiting them is cleaner for me. But that's subjective.
---
Kotlin's structured concurrency handles the context stuff by default. When you handle those concerns (cancellation for example) in go, it's just as complex, but more verbose.
Go's context object is used for cancellation. I think this is pretty simple and has the benefit of avoiding any type of exception handling while still giving you the ability to clean up anything you were doing.
In general, if you like Kotlin, I can see why you'd like their approach to concurrency. Kotlin really likes a large standard library with generic blocks that act as syntactic sugar. I used to like that too, but as I've gotten older I've gotten lazier. Kotlin now how too much mental overhead for my old brain.
You can build exactly Go's coroutines API on top of these primitives, by encapsulating the Continuation object in Channel and Mutex types, and you would not have to touch the `suspend` keyword (which is sugar for CPS-transforming the function).
I don’t think you can build Go’s coroutine API in Kotlin because Go ensures you can never starve the scheduler. For instance, blocking system calls in Go spawn new OS threads so that they can be pre-empted by other running go routines, and most of the standard API is built with stuff like select, poll, and epoll so you never have to worry about blocking the scheduler. That’s not how Kotlin works, right?
So, while you can have a similar syntax in Kotlin, the overall systems will have different semantics.
Project Loom will get us that behavior for the JVM, as long as the default dispatcher is modified to support virtual threads. For now, we schedule on an unlimited thread pool using Dispatchers.IO, which is not as cool.
The cost of context switching between threads sharing the same address space is much lower than between tasks, because there's no need for a TLB flush. It just comes down to the cost of saving/restoring register state (more or less), and that's a nice known fixed cost on any given processor.
Context switching between tasks has a variable cost, because the implications of the TLB flush depend on the working set size of the switched-to task. If it doesn't touch much memory, the variable cost is low; if it touches a lot, the variable cost can be much, much higher than the register save/restore (fixed cost).
Indeed, I ought to have mentioned the cost of cache misses post-context switch too. It too is working set size dependent, but the relationship looks different from the cost of TLB misses. However, my understanding is that an inter-task thread switch would also not cause cache invalidation (there's no need; the address space remains the same).
Would you need to flush a TLB on a regular context switch? That seems inefficient as threads are being scheduled per quanta in Linux. TLB entries are only evicted on page faults right?
> C++ has threads which ACTUALLY run in parallel on the CPU. Why bother complicating the language further?
Actual parallelism is surprisingly slow and resource heavy in practice. When everything is on one core, you keep things local in L3, L2, L1.
However, add on a 2nd core, then you suddenly have "ping ponging". Lets say core#0 has data in L1 cache, and now Core#1 needs to read-modify-write to it. That means the cores need to:
1: Core#1 begins to mark that cache-line as exclusive.
2. Core#0 needs to then respond by marking the line as invalid. Then Core#0 ejects the data from L1, and passes it to Core#1.
3. Core#1 can finally begin to work on the data.
4. If Core#0 uses the data again, Core#1 must do step#2 again.
--------
This is called "Ping-ponging". L1 cache read/writes is 1-nanosecond, but ping-pongs can be 30nanoseconds or slower, 30x slower for no reason.
You don't want to add another core to your problem unless you're _actually_ getting a speed benefit. Its more complex than it may seem at first glance. You very well could add a bunch of cores and then suddenly your program is way slower because of these kinds of issues.
Another problem: False sharing. Core#0 is working on "int x", and Core#1 is working on "int y", but x and y are on the same cacheline. So they have to ping-pong even though they never actually touch the same data.
Your example implies some combination of (a) excessive data sharing between threads (b) possible need for core pinning.
If thread A is going to need to read all the data touched by thread B, then it's unclear why you've split the task across threads. If there are still good reasons, then probably pin A and B to core N (never pin anything to core 0, unrelated story), and let them interleave there.
If that doesn't make sense, then yep, you'll have to face the cost of ping-ponging.
In my benchmark of a simple ledger I generate and do 25,021,365 (25 million) random transactions per second. (Withdraw and deposit over 80,000 accounts)
Paralellise it via sharding and I get 80,958,379 transctions per second.
If I remove the randomness, I get 134,600,233 transactions per second. If I parallelise with 12 threads I get 931,024,042 transactions per second.
My point being, if you paralellise properly and do not use shared data, then you can boost performance by a multiple of the number of threads.
This is not so likely in most workloads unless multiple threads are hitting very close or adjacent regions in memory. Most multithreading workloads work on memory not so close together. It's good to be aware of it, but to suggest it is the most likely outcome from multithreading is misleading.
Coroutines don't really have anything to do with threads. They are just a way for you to model suspending and resuming functions.
If you have every tried to implement a complicated iterator you will find that you model a for loop in your head but that the iteration of the loop is controlled from outside using the ++ operator. The ++ operator allows the iterator loop to go one more time round. Normally you have to manage this with state stored in some object. With coroutines you can actually write a loop and co_yield from the middle of the loop to implement the ++ operator.
That said, overhead always needs to be considered; the overhead cost of using a CPP co-routine would be considerable for implementing a state machine that transitions rapidly.
In our code we have an abstraction that wraps a set using a variant. Instead of implementing an iterator/begin/end based on which data the variant holds, we have an iter() method that unwraps the variant and loops through and co_yields. Since you can consume a generator in a for-each loop, you use it the exact same way (except calling iter).
In our case it doesn't save that much complexity, but it's definitely some. I could imagine more complicated cases where it is even more convenient.
Real threads can too expensive sometimes. A single coroutine in Go just use few kb of memory. So we can easily spawn hundred of thousands of coroutines but not possible using threads
Your point is correct but your magnitudes are off.
IIRC, a thread takes up ~32kB or so worth of data in pthreads/Linux, or at least... somewhere around there. So on a rather cheap server with 64GB of RAM, you can easily fit 1-million threads. More expensive servers have 512GB, 1TB or more RAM and can easily handle this kind of load.
Of course, coroutines are even smaller/more efficient and can go into 10-million, 100-million or higher.
A bigger issue is creating and destroying threads, which can be costly if you don’t do a lot of work per each. Of course, people have worked around this problem for ages with thread pools, but then you’re already breaking up ownership, code flow and screwing up the stack for debugging and panicking. So something like coroutines or green threads is still motivated, it is just extremely different across languages and is almost always a leaky abstraction.
> Why implement coroutines when you have multithreading?
Because coroutines are easier to think about than threads, and avoid entire (large) classes of bugs.
Because coroutines (typically) have a lower overhead than threads.
Because if you are IO bound and not CPU bound, and therefore you don't need to spread work across cores, threads may not provide a given workload with much, or any, benefit.
I've benefited from cooperative userspace context switching with CPU bound workloads as well. Imaging simulating a vehicle's worth of embedded software, with several dozen tasks originally intended to run at 100Hz asynchronously across multiple different computers. Now, imagine running them faster than real time, deterministically. Now imagine doing several thousand of those simulations. The overhead of OS threads, context switching and synchronization starts to add up, and is 10-100x slower for that sort of workload.
Cooperative multitasking systems having more deterministic performance profiles is often overlooked.
I'm working on a post right now covering this in detail, but I spent the last 2 hours being reminded that explaining concurrency stuff in simple terms is hard. :-D
It'd be nice if we had system-level "lightweight threads." The explanation I've heard before is that "user space knows more about the program." So people write clever runtimes that do kernel-like task scheduling multiplexed over CPUs, in user space.
I can buy that. Think about a system thread. It's pretty much its own process that shares an address space with others. Its interface is syscalls and memory. A user thread (coroutine, goroutine, greenlet, whatever) is... whatever you want it to be. It can be small and special-purpose.
I agree, but I think as you’re alluding to, the deeper issue is that the syscall boundary is too big of a fence to jump over for small operations. Hence, io_uring and similar, which pools multiple logical syscalls into one. The alternative way would be to have non-syscalls but kernel-aware libraries that share memory with the kernel, I guess? It’s a much bigger change, philosophically, so I think the io_uring model + user space scheduling will take the throne for the next decade or so at least. It’s a bit unfortunate because we’ll likely see a lot of bloated and fragmented language runtimes, but it’s still vastly superior to what we’ve seen over the last decade+.
Coroutines are about decoupling OS-level threads from logical threads (where OS-level threads are already decoupled from physical threads). The reasoning is actually very similar to why OS-level threads do not correspond directly to physical threads.
You may wonder why OS threads are not themselves sufficient, but it turns out there is a decent amount of overhead involved in managing huge numbers of OS threads. There are limits at which Linux does not perform well anymore, leading to problems like live locking. Also, if you can avoid context switching/cache invalidation by keeping a single OS-level thread busy, you get performance benefits even before you hit OS limits.
So the userspace scheduling and green threading that coroutines enable allow you to increase concurrency without hitting these limits.
For an I/O heavy application (not in C++) I avoid threads for the main "loop" as there is a central cache that is accessed often. Synchronizing access to it is likely more expensive and error-prone than using just a single thread. It is also possible a few hundred or thousand tasks are running concurrently, which becomes very expensive very quickly with threads.
There are a few CPU-intensive tasks but those are easy to isolate and send to a threadpool.
Coroutines are for when you would normally use callbacks or a state machine, but handling the callback’s state or the state machine is complex enough that it becomes much simpler when expressed as a coroutine (simpler to reason about). It isn’t really an alternative to multithreading.
Co-routines (when used as an async primitive) are an an alternative to callbacks.
Callbacks exist to avoid the overhead of marshalling async results back to the original calling thread: instead of having the calling thread poll until the child thread(s) pass their results back, the calling function returns after kicking off the children, and one of the children directly resumes the work directly when the results are present. This is a significant performance gain.
However, callbacks have famously poor ergonomics, so many structured alternatives have been created: Promises/Futures, Async/Await, and CPP Coroutines.
Coroutines are when you don't want to run in parallel, because sequence matters. A typical example is iterators, you want the loop and iterator (routine and coroutine) to run in lockstep.
You could use threads instead, but then you risk getting in all sorts of concurrency problems. And if you synchronize everything, then great, you have reinvented coroutines. In fact, you can think of coroutines as threads with built-in synchronization, which may be how they are implemented.
even without any threads or async, coroutines are very nice for writing generators and super useful when you want generic code which can iterate, without leaking the container implementation to client code
Coroutines This is another can be used It as another way is still possible to language construct like manage object lifetime, and make some to use threads control flow much easier if you have coroutines to write. lambdas.
That's definitely a thing. A whole lot of the features added since C++11 have been specifically intended to enable library writers to do very complex things under the hood so they can present powerful yet simple interfaces to users.
> sharing an executor between application and library is another matter.
Especially when there is not one, but several libraries, and the application itself doesn't actually has or uses any executors, now that's a fun challenge.
They serve both to a great extent, as long as code owners are willing to refactor their codebases. The changes do actually make code significantly safer and more pleasant to work with. My favourite ones are the concepts and class template argument deductions.
I believe this is correct. Anecdotally, at FB, there was quite a push to get coroutine-supporting-features into the standard because we had a C++ coroutine library built through non-standard means, which meant that compilers didn't fully support them (leading to a bunch of annoying things for all of those writing code on top of the library).
Now the folly/experimental/coro library can be compiled by anything that claims to support C++20, which is a pretty big win for FB if nothing else.
Article is pretty good but C++ coroutines sound like a contorted version of protothreads. Not that useful in the first place, and made way more painful for the sake of enabling some hypothetical compiler optimizations that aren't done in practice. The two specific pain points are
1) stacklessness, which means coroutines can only yield from the top-level function that is called by the coroutine, sort of like the "simple generators" introduced in Python 2. Some people would consider these to not be real coroutines at all, and Python 3 later introduced actual (stackful) coroutines aka async functions. The C++ devs say that if their coroutines had stacks they would be called fibers instead of coroutines, but I had always thought fiber (used that way) was a Windows-specific term.
2. Opaqueness of the coroutine frame object: you can't get at its size or contents. That lets the compiler optimize its layout or even in some cases eliminate it. But at least for now, real compilers don't seem able to do that, and s0me LLVM devs aren't sure it's even feasible. Rust doesn't attempt this. It just uses an ordinary struct to hold the frame data, so you can do normal struct things with it.
Above is my impression from reading the linked article and some related ones. I have not yet tried to use C++ coroutines, or looked into Rust's even slightly. Protothreads (a C header library) are cool though. They don't do that much, but they are very simple in implementation compared to this C++ mess.
I think these C++ coroutines don't reach the usefulness of a lightweight cooperative Forth multitasker from 50 years ago. Those were stackful (typically a few dozen cells in each stack, so a task cost on a 16-bit cpu might have cost 200 or so bytes including stacks, user variables, and other data) and ridiculously simple.
Goroutines aren't comparable. They are way more powerful but also way more expensive than C++ coroutines.
There were apparently some other C++ coroutine proposals that sounded a lot better, but that didn't have the backing to get through the standardization process.
I don't know how Boost coroutines compare. I tried to use them once but didn't get them working.
C++20 coroutines didn't include the facilities to get a coroutine directly from writing a loop with the `yield` keyword like in Python. You can see all the boilerplate needed in this post's `Generator` class. For me, it made the feature not worth using.
TBH, everytime I read about coroutines and their example usages I get stuck on the "why" question. If the example is a function waiting for data, a std::future seems a lot easier. If the example is a generator, a range/transform/filter or just a class that returns a std::default_sentinel iterator seems more appropriate.
The only reason I can see is lowering the overhead of creating threads, at the cost of harder to read controlflow.
One reason is to reduce callback hell where you have to pass a continuation function to an asynchronous operation to run after it has completed. Coroutines can be used to make your code look more like blocking code. ASIO’s coroutine support does this and it makes async operations much easier to read
Also, generators can be used to turn traversal functions into iterators, which is often tedious to do by hand.
In Win32 at least, there is "FLS", Fiber Local Storage. But Fibers (Windows speak for coroutines, but not sure how they relate to C++ coroutines) are generally considered to be not worth the complexity.
Fibers would be stackful coroutines (every coroutine gets its own stack memory, and local variables inside a coroutine are placed onto that stack). Whereas C++ coroutines are stackless - local data in those gets placed into a magic object that is created somewhere by the compiler.
Stackful vs stackless comes with a few pro's and con's - e.g. it's often argued that stackless requires less memory. However stackful coroutines can allow for preemption at arbitrary points instead of just at yield points.
To expand on what you said, they recently decided to work on stackful coroutines for C++26 now that C++23 is final. You can just use one of the many libraries if you don't want to wait for the standard library version.
Stackful coroutines didn't require any compiler support, unlike stackless so they've be available for a long time (stackless have also been around since C++20).
I think the point is that if the coroutine suspends it could be resumed on a different thread. So you don't really want to mix thread locals and coroutines.
state, e.g. assuming that there is single flow of operations, and nothing gets swapped out (logically). This breaks with co-operative scheduling systems - e.g. task thread while waiting on I/O (for example) gets reused for some other task, hence they get to see the same thread_local, and somewhere someone assumed "thread_local" == "task_local" of sorts (but not).
where did you get this understanding? this is definitely, 100% wrong, C++ coroutines are not backed by threads at all (or by anything really since as of C++20, the language only provides the building blocks for making your own coroutine / generator types and not actual implementations).
The fact that they went for stackless is a testament to how bad committee-driven design process is. just a bunch of bureaucrats playing important actors, similarly to stackoverflow mods.
There are pros and cons, but here are a few things which favour stackful.
Function coloring is a disadvantage of stackless (assuming it's mixed with regular, non-async C++ code). Async and non-async libraries can't interoperate the way they can with stackful. Arbitrary functions cannot block the calling coroutine with stackless, but they can with stackful. Some people consider the distinction a feature (a bit like checked exceptions, it's debated as to whether it's helpful or adds brittleness), but it's often a problem in large codebases that weren't written to be entirely async (including libraries) from the start. Anyway, if you want coloring as a distiction in your type system you can still have it. But with stackless, you have no choice.
Memory allocation patterns are different with stackless, sometimes worse. While a stackful coroutine system requires stacks to be allocated for each new coroutine, obviously, in a stackless async/await system there is typically a higher rate of memory allocation, and with varying sizes, to hold the temporary states of each coroutine which can occur at each await site. There are usually many more of those sites than coroutines.
In addition, those temporary states being stored in heap-allocated memory are likely to have a lower CPU cache hit ratio than stack memory during the run of a particular coroutine.
Something like the Linux kernel is very difficult to write in an explicit stackless style. The Linux kernel design uses stackful coroutines pervasively. This is not theoretical: It has come up in practice. Years ago there were a number of attempts to change the Linux filesystem and block I/O code to have async code paths (i.e. stackless style, state machines), so that a proper async I/O ("AIO") could be offered to userspace. Every attempt failed because it was too much work or too difficult to make all the filesystems code and everything they call, every path, fully async. Some of those changes went in, but the result was Linux AIO requests were not reliably async (and still are not), as they could sometimes block when they hit some less common paths in filesystem code, e.g. doing things like updating a block extent map, b-tree, or LVM/RAID corner case. In the end, the async I/O designs that worked well and reliably didn't block on request submission all ended up delegating I/O requests to scheduled stackful coroutines, i.e. kernel threads. These also turned out to perform well, which is not a coincidence, as stackful context switching is efficient. Of these designs, the one which ended up being adopted and well known is called io_uring.
For an example of less common paths that still need to be async, some operations have to allocate memory in all sorts of ways, including temporary memory when calling library functions or traversing data structures. In a kernel, memory allocation (i.e. malloc/new) has to be able to block the calling task temporarily, so that when there isn't enough free memory immediately available, the allocating task will wait for another task to free some, as that is preferable to failing an entire operation. Try doing that with C++ coroutines and idiomatic object allocation, and you will hit the function color problem: You can't async allocate a new object. You could of course write in a non-idiomatic style, not using new or constructors or anything which calls those, doing your own "await my_async_new" everywhere, but having to do that utterly consistently throughout a large codebase, and requiring every little function (including all libraries) to work that way as well, would be not really using C++ as it is meant to be used, and comes with its own risks. Alternatively you can block the thread the executor is running on, but that defeats the point of async coroutines.
With stackful coroutines, those kinds of operations Just Work(tm).
You can achieve the same thing with stackless by allowing such operations to block the executor thread, and spawn new executor threads which do work-stealing to ensure other coroutines are able to make progress while the first one is blocked. I believe that is what Go and Rust's Tokio do. The effect is to allow coroutines to be a mixture of stackless and stackful as needed, optimising for both worlds at the same time. It has the particular benefit of improving performance when the executor needs to call an operation which blocks in a library function or in the kerne. Howver, as with stackless, to ensure every coroutine can progress in an async manner without being stalled by coroutines that are blocked, this also needs every code path and library function to buy into doing that (if only by annotating "I may do something that will block now" regions). So it's also not suited for retrofitting to all codebases.
For coroutines to be really useful, they have to be stackful. The guy that originally did the proposal for C++ is also an author of multiple coro libraries and did research on this. You can emulate stackful coros with stackless via trampoline, but heck, you can emulate stackless coros with closures and trampoline too! The requirement of this extra trampoline makes their use extra convoluted and negates the very slight advantages that such functionality may really bring.
Making stackful "right" was hard, so a useless compromise was made, which is basically like adding a syntactic sugar to the language.
It may be a compromise, but what exactly makes it "useless"? A similar style of async has been in use in e.g. C# for over a decade now, and it has been very successful there. Yes, it is syntactic sugar - but so are e.g. lambdas. What matters in practice is that it makes some kinds of code much shorter and easier to read.
stackful coroutines is a feature that is on par with power of delimited continuations (proven fact in academia, you can easily find somewhat easy to follow papers on this topic), stackless coroutines is a stupid gimmick. You see the "compromise"? You have argued to have a case to add a car, but after long debate, politics and "compomsie" you got a TOY car instead.
Theoretically everything is possible with a NAND gate, so the question is badly formulated from the beginning. There is a sibling comment from @jlokier who did a gread job of providing a sane summary, and not just some incoherent rambling like I did.
This is a persistent misunderstanding about async... it is not generally more efficient than non-async code. In fact, the overhead of async may slow things down. (There can be other downsides as well. E.g. debugging may be more difficult.)
For it to help, you need to have something better for the current thread to be doing than waiting for whatever it is you're waiting for.
Whatever that is, another worker thread could be doing it, so what we're really talking about is that async may let you reduce the number of threads you're using.
(Async is so useful in JS because it lets you reduce the number of threads to one, which lets you sidestep all kinds of hairy issues.)
That might be good enough, depending on your situation, to overcome the downsides of async worthwhile. But it may very well not.
Anyway: don't use async because it's "more efficient".