Hacker News new | past | comments | ask | show | jobs | submit login
How Python asyncio works: recreating it from scratch (jacobpadilla.com)
282 points by jpjacobpadilla 8 months ago | hide | past | favorite | 57 comments



Asyncio allows you to replace the event loop with an implementation of your own. For Temporal Python we represent workflows as a custom, durable asyncio event loops so things like asyncio.sleep are durable timers (i.e. code can resume on another machine, so you can sleep for weeks). Here is a post explaining how it's done: https://temporal.io/blog/durable-distributed-asyncio-event-l....

The biggest problem with asyncio is how easy/common it is in Python to be able to block the asyncio thread with synchronous calls, gumming up the whole system. Python sorely needs a static analysis tool that can build a call graph to help detect if a known thread-blocking call is called directly or directly from an async def.


Maybe this is a bad idea, but... What if, instead of the current way where every call (including native stuff) is sync by default, it were the other way around? You'd quickly whitelist basic things like arithmetic and data struct access as sync, then you could maybe detect other things that should be sync if the event loop is spinning suspiciously fast.


Temporal is seriously cool.

When I found out how it had implemented the asyncio event loop it was a real mind blown moment.


how? is there a source to read?


Why not use threads ? I'm still trying to understand if Python is really meant for concurrency. Asyncio always felt like it's barely hanging on, maybe it's me, by C# has a cleaner async implementation in my opinion.


We need hundreds or thousands and they need to be cheap and, in Temporal's case, the cooperative multitasking needs to be deterministic which native threads are not. Temporal's C# SDK uses .NET tasks the same way and tasks have a lot of parallels with asyncio (though they combine with threads in one static thread pool which can be confusing in itself).


Is C# generally faster or more efficient here?

Forgive my ignorance, but I always figured Python just doesn't scale well for this use case.


Massively so. You can spawn 1 million tasks waiting for a Task.Delay and threadpool will trivially survive the punishment. Not so much with Python, which has a lot of problems scaling to all cores and also deals with interpreter lock (I assume it is better now and no longer pure GIL style?). The advice for Python seems to be to use multi-processing and even then it is orders of magnitude of difference in CPU time spent per operation. One language compiles to codegen that is not far from what you see out of GCC or Clang, another one is purely interpreted which means significant difference in resource utilization.

Although top positions do a lot of "cheating", take a look where most Python entries sit: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...

Fastest Python entry starts at 234th place while .NET one is at 16th (and that with still somewhat heavy impl., upper bracket of Fortunes is bottle-necked by DB driver as well which is what top entries win with).


What a great response.

I had the same feeling here, .net is just faster , but I think a lot of teams get into this workflow where they already have a Python solution and they don't want to rewrite it.

I'm not saying everything needs to be written in a systems language like Rust, but Python always strikes me as a weird choice where performance is a concern.

I'm pretty amazed to see a JavaScript runtime ranking so high here


> > I'm pretty amazed to see a JavaScript runtime ranking so high here

> The bodies of many, many PhD students have been hurled unto a big mound to optimize JavaScript within an inch of it's life.

https://youtu.be/v1CmGbOGb2I?si=8aGIDjI44pZGiTAU&t=234


Make sure to examine code of top entries if this interests you - a lot of shenanigans there. Though some are much more well-behaved than others (.NET's one does look sketchy unfortunately, even if it doesn't have to...). Also, Solid.js is pretty much a thin wrapper on top of syscalls, so it's best to look at other JS submissions.


Do you have a source with more realistic benchmarks ?

I think an argument could be made that without real life use cases, these metrics are useless.


These, technically, are realistic benchmarks as they execute your average web application code...or what used to remotely resemble one.

Comparing purely interpreted languages or interpreted + weak JIT compiler + dynamically typed (Python, JavaScript, Ruby, PHP, BEAM family) to the ones that get compiled in their entirety to machine code (C#, Java, Go, C/C++/Rust, etc.) is not rocket science.

There is a hard ceiling as to how fast an interpreter can go - it has to parse text (if it's purely scripting language) first and then interpret AST, or it has to interpret bytecode, but, either way, it means spending dozens, hundreds or even thousands of CPU cycles in a worst case scenario per each individual operation.

Consider addition, it can be encoded in bytecode as, let's say, single 0x20 byte followed by two numeric literals, each taking 4 bytes (i32). In order to execute such operation, an interpreter has to fetch opcode, its arguments, then it has to dispatch on a jump table (we assume it's an optimized interpreter like Python's) to specific opcode handler, which will then load these two numbers into CPU registers, do addition and store the evaluation result, and then jump back to fetch the next opcode, and then dispatch it, and so on and so forth.

Each individual step like this takes at least 10 cycles (or 25 or 100, depends on complexity) if it's a fancy new big core and can also have hidden latency due to how caches and memory prefetching works. Now, when CPU executes machine code, a modern core can execute multiple additions per cycle, like 4 or even 6 on the newest cores. This alone, means 20-60 times of difference (because IPC is never perfect), and this is with the simplest operation that has just two operands and no data dependencies or other complex conditions.

Once you know this, it becomes easier to reason about overhead of interpreted languages.


So I take it unless you're doing something which requires quick prototyping or a specific machine learning library one should always avoid interpreted languages when performance is a concern ?

I love C sharp and I'm really productive in it, but I've worked at so many places which have tried to get performance out of Python.


Yes, C# is faster just because it's a faster runtime, but Python does scale reasonably well for this use case.


Thanks for the info!


In the workflow example, is ` Purchaser.purchase` supposed to be `do_purchase`?


Fixed, thanks! (it had undergone revisions hence the function name mismatch)


Good luck with that. A simple read() might block or not, depending on what the descriptor is and how it's configured. How would you statically detect this?


Like many static analyzers in ambiguous situations, you have to decide whether the possibility of false positives is worth the safety. In this case I'd flag it knowing that it could be marked ignored as needed (same with open() if it was used).

I have toyed with a MyPy plugin to do this, but the arbitrary state storage/cache is a bit limited to do good call graph and function coloring (metadata can be on types but not functions). And there's just not many other good _extensible_ static analysis libraries. Language like Go have static analysis libraries that let you store/cache arbitrary state/facts on all constructs.


You must also detect "too long loops"… which is dangerously close to solving the halting problem.


This and other similar CPU-bound operations may be unreasonable to detect at analysis time. I would encourage users to run with asyncio debug mode in tests to help catch these: https://docs.python.org/3/library/asyncio-dev.html#debug-mod.... Granted that's runtime, but there are limitations on what analysis can perform, it's just meant to help catch obvious issues.


This seems to "busy wait" when sleeping, i.e. the event loop keeps running even if nothing is currently runnable. I remember reading about another toy implementation which handled it correctly (the way I believe it's actually done in asyncio) by tracking the next-runnable times of a task in sorted order, and if nothing is currently runnable you can sleep the event loop. And then this was extended so that instead of just next-runnable depending on wall-clock time, the task could have a dependency on a socket or something, so that you can use select w/timeout.


In asyncio itself you can have custom event loop implementations in addition to the out of the box one.

One popular implementation is uvloop (https://github.com/MagicStack/uvloop) which basically just implements the loop using libuv which takes care of doing stuff like `select` as you describe.


I've encountered bugs with it. Unix sockets not being created at all. So I went back to the regular loop.


This sounds a lot like SimPy [1] / Simpy.io [2], but SimPy predates asycio by many years.

[1] https://simpy.readthedocs.io/en/latest/

[2] https://gitlab.com/team-simpy/simpy.io


There's nothing really wrong with this on one level. Your event loop doesn't even have to loop, it can just start by running your main and end when that ends. Imagine starting a server which has a while true loop that waits over a socket, and on some shutdown condition or interrupt ends, ending the program. No busy waiting, no messing with sleeps or sockets from the event loops perspective. Run until complete vs run forever.

Better off, if making some toy loop, to not bother with the forever case.


how do you wait for a socket if the event loop doesn't support waiting for sockets?


Whats the best book to learn about things like these?


There is a brilliant talk by David Beazley on asyncio. I used it to write a Discrete Even Simulation tool. Pretty cool to be able to make your own implementation of asyncio and replace the system clock with a simulated time


Link to David Beazley’s asyncio talk: https://youtube.com/watch?v=Y4Gt3Xjd7G8


dabeaz's async talks are imo the most important things you can consume to understand events loops and async, followed closely by njsmith's analysis and criticism of different methods of async invocation and call structures: https://vorpus.org/blog/notes-on-structured-concurrency-or-g...


This is brilliant stuff, a very nice high level explanation that skips everything that would make it too boring for a first time reader.

Adding other resources at the end which explain how it _really_ works under the hood would be great.


This is a similar article that I thought was pretty good: https://tenthousandmeters.com/blog/python-behind-the-scenes-...


I would have preferred if the article did the same without using yield at all, because frankly that's where I feel the real magic is.

See here for a much more in-depth description of coroutines in python,

https://aosabook.org/en/500L/a-web-crawler-with-asyncio-coro...


No mention of poll()? It's not at all how asyncio works then.


Completely insane that python decided to hijack "def" to be used to create an object that very much is not a function. At the least they could have gone was create a different keyword


It's a function, it's just been transformed into a function with a different return type (that of a generator or coroutine, you can see the type signatures here [1]). You could do something similar in vanilla Python rather than language-level sugar, eg using decorators.

[1] https://docs.python.org/3/library/typing.html#typing.Generat...

As a caveat, I point out the type signature for academic interest, I prefer to use the simpler Iterable and Awaitable types in type signatures.


I think the type signature and example makes the parent's point, frankly. It has three types, on of which is contravariant, two different usage of the yield keyword, and two exit points.

Now granted most uses of generators are simpler than that, but still that's a lot of special syntax. And if you are writing Python without type-checking, that's a lot of bookkeeping you'd have to do when reading code.


That type signature is gnarly, I agree. I've seen people get tripped up because they think the function is the generator, and are confused about what behavior to expect from concurrent readers of different instances of the generator (because they don't realize that they are different instances in the first place). So my position is more about how to work with Python effectively/understand it accurately than in defense of it's syntax.


What's the actual problem with it? Or is this just a language "purity" thing?


What? There is no tragedy here. It sounds like it just put DX semantics ahead of…what? some sort of language purity? I can’t say that I’ve ever thought “gosh! that async function is…an async function, and that’s tripped me up!” before. Not even once.


The bigger problem with Python to begin with was that you have to use def for functions, meaning you can't inline them like with JS arrow funcs or C++ lambdas. And this isn't a language purity thing, it's just inconvenient.


Are you talking about decorators? Very confused trying to parse what you're saying in the context of python.


They somewhat fixed that with "async def" but the whole thing is a huge tragedy anyway.


Do you mean they fixed the hijacking by adding the "async" keyword?

By the whole thing do you mean the async functionality in Python in general? What is the tragedy?


Yes, I mean the async keyword fixes the hijacking although the hijacked version is still there for compatibility. So much for Python 3 breaking stuff for the purpose of fixing old cruft.

The tragedy is the whole Python 3 debacle, and if they were going to make a break like that, they should have introduced BEAM-like concurrency instead of async.

For reasons I don't understand, the Python community seems to have an intense phobia and hatred towards threads. I've always used threads instead of async though, and by writing in an Erlang-like style (message passing and no shared mutable data) you can keep things reliable. This has worked ok for me with 1000s of threads, though not millions or anything like that.

Python by now is mostly legacy cruft, it feels to me sometimes. Everything else of course has its own problems.


As someone who started learning to code with C++ followed by Python and then Rust, proper threading approaches weren't drilled into my head until I started working with Rust. I agree with you that threading with message passing is a safer approach. Threading does add a level of complexity that isn't always necessary.

AFAIK it's not outlined in a PEP, but my intuition is that Python was designed to be accessible and "simple". It wasn't built to be a systems programming language. So for all its faults with the GIL I don't think it should be a major point of contention. For any task it's always about choosing the right tool for the job.

There is considerable effort at the moment to remove the GIL in future versions and implement better threading support. At 3.11/3.12, it has come a long way in terms of performance.


The problem with threads in Python is they're 1:1 with OS threads in every major implementation. It's not like Golang where you have greenthreading, and you don't even get all the benefits of regular threads due to the GIL. It's been too long since I've used Erlang, but the way I remember it, the concurrency is almost like using RPCs. Which is great in some applications, but Python users want shared mutable data for simple things and will make microservices for larger systems.

For a simple language, Python has so many different ways to do concurrency, some of which are misleading to new users. Event loop ("async") would've been nice and simple from the start, but they were late to that party. Everything vaguely related to packaging is also confusing, from the __init__.py files to the PIP/Conda/whatever stuff. They had a whole breaking Py2->3 transition and still didn't fix that.


The GIL is an implementation artifact mostly due to the reference counting memory management. Besides phobia towards threads, the Python crowd also seems phobic towards tracing garbage collection. There is an ongoing project to make CPython run without the GIL yet still use reference counting. I wonder how many bugs this will surface or introduce.

Most concurrent Python code that I've written has not been cpu intensive, FWIW. Just an application talking to a bunch of i/o ports or sockets, but mostly idle. My CPU intensive stuff has tended to be single threaded so I can just launch instances in separate processes. YMMV.

Yeah the tooling situation is crazy enough that I don't begin to understand it.


Yeah, Python is made for non-CPU-intensive, IO-bound concurrent code. But that use case is a great fit for async/event-loop.


I'm also curious to know what the tragedy is.


Dunno if they meant this, but I feel that adding async to a language/ecosystem where it was earlier not used makes everything awkward for a long time. Especially if it's done liberally like python did, with both sync and async styles still supported.

It feels like the worst of both worlds at times. I think Rust suffers from this too, whereas Node.js has managed to make async the default mode of operation for enough things that it doesn't feel so bad. Especially with Typescript.


>> whereas Node.js has managed to make async the default mode of operation for enough things that it doesn't feel so bad

Javascript is inherently async


Well yeah. That's probably why it feels better there.


Obligatory video, NSFW (lots of swearing):

https://www.youtube.com/watch?v=bzkRVzciAZg

"Node.js is bad ass rock star tech".


I actually learned a lot from watching that video.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: