Hi - I am confident that 16 workers was the right number for that application deployed on that machine. The machine is described in the article. If you took this app and put it on a machine with 8 cores clearly it would make sense to try 32 workers - but in practice I think few Python apps are so IO bound as this one. Most of the time, just over 2 * cpu count is about the right number.
I suspect that scheduler overhead is not a realistic consideration for a Python program. My understanding is that switching executing process takes microseconds at worst, which would be too small to notice from the point of view of a Python programmer.
On "it would be the job of the programmer to yield after some time" - I'm always personally suspicious of any technique that rests on programmer diligence. My experience suggests not to require (or even expect!) programmer diligence, even from my own (I assure you, god like) programming abilities. Secondly, yielding more often probably would not help (and in fact I half-suspect part of the problem is the frequent yielding at every async/await keyword!).
Edit: I've been downvoted so I'll add a precision: Usually, it is believed that async shines against other models once you reach a certain scale (https://en.wikipedia.org/wiki/C10k_problem). This benchmark shows than async app frameworks are slower than the sync ones when running at a given scale, and since the article doesn't give many details on the incomming traffic, I can only assume that it's low, since it saturates 4 cores.
I believe that your conclusion that "Async python is not faster" is an over generalization of your use case.
I'm not saying that the configuration in your benchmark is not correct, I am saying that this benchmark may not yield the same results if you try to scale it on bigger hardware.
I believe that scheduler overhead can't be ruled out (not for python nor any other program) on a server since we've sometimes observed that the scheduler could be the bottleneck under some circumstances. For instance, some Linux schedulers used to show poor perfs when using nested cgroups with resources quota enabled.
Also, I'd like to state my first point again: you need to see how the number of workers will influence the memory usage on your system. Especially with python, if you've got a lot of workers, you can expect some memory fragmentation that can impact the perf of your system.
I suspect that scheduler overhead is not a realistic consideration for a Python program. My understanding is that switching executing process takes microseconds at worst, which would be too small to notice from the point of view of a Python programmer.
On "it would be the job of the programmer to yield after some time" - I'm always personally suspicious of any technique that rests on programmer diligence. My experience suggests not to require (or even expect!) programmer diligence, even from my own (I assure you, god like) programming abilities. Secondly, yielding more often probably would not help (and in fact I half-suspect part of the problem is the frequent yielding at every async/await keyword!).