The cost of context switching in "async" code is very rarely smaller than the cost of switching OS threads. (Exception is when you'ree using a GC language with some sort of global lock.)
"Async" in native code is cargo cult, unless you're trying to run on bare metal without OS support.
The cost of switching goroutines, rust Futures, Zig async Frames, or fibers/userspace-tasks in general is on the other of a few nano-seconds whereas it's in the micro-second range for OS threads. This allows you to spawn tons of tasks and have them all communicate with each other very quickly (write to queue; push receiver to scheduler runqueue; switch out sender; switch to receiver) whereas doing so with OS threads would never scale (write to queue; syscall to wake receiver; syscall to switch out sender). Any highly concurrent application (think games, simulations, net services) uses userspace/custom task scheduling for similar reasons.
Nodejs is inherently asynchronous and the JavaScript developers bragged during its peak years how it was faster than Java for webservers despite only using one core because a classic JEE servlet container launches a new thread per request. Even if you don't count this as "context switch" and go for a thread pool you are deluding yourself because a thread pool is applying the principles of async with the caveat that tasks you send to the thread pool are not allowed to create tasks of their own.
There is a reason why so many developers have chosen to do application level scheduling: No operating system has exposed viable async primitives to build this on the OS level. OS threads suck so everyone reinvents the wheel. See Java's "virtual threads", Go's goroutines, Erlang's processes, NodeJS async.
You don't seem to be aware what a context switch on an application level is. It is often as simple as a function call. There is no way that returning to the OS, running a generic scheduler that is supposed to deal with any possible application workload that needs to store all the registers and possibly flush the TLB if the OS makes the mistake of executing a different process first and then restore all the registers can be faster than simply calling the next function in the same address space.
Developers of these systems brag about how you can have millions of tasks active at the same time without breaking any sweat.
"Async" in native code is cargo cult, unless you're trying to run on bare metal without OS support.