Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Bringing multithreading to Python's async event loop (github.com/neilbotelho)
82 points by nbsande 3 months ago | hide | past | favorite | 46 comments
This project explores the integration of multithreading into the asyncio event loop in Python.

While this was initially built with enhancing CPU utilization for FastAPI servers in mind, the approach can be used with more general async programs too.

If you’re interested in diving deeper into the details, I’ve written a blog post about it here: https://www.neilbotelho.com/blog/multithreaded-async.html




> It got me wondering if it was actually possible to make Python’s async event loop work with multiple threads.

There is built-in support for this. Take a look at loop.run_in_executor. You can await something scheduled in a separate Thread/ProcessPoolExecutor.

Granted, this is different than making the async library end-to-end multi-threaded as you seem to be trying to do, but it does seem worth mentioning in this context. You _can_ have async and multiple threads at the same time!


run_in_executor is pretty powerful for running sync code in async code, but my use case was more making async code utlize the cpu better. I think, just using run_in_executor would add a lot of complication and changes to how you use async await. But great point none the less!


I've had plenty of success using run_in_executor in production. You basically just use the decorator, and it mostly just works.


Which decorator?


I went back to look at some of the old code as my memory was hazy. It was tornado's implementation of the run_on_executor method that we used, which is used as a decorator.

https://www.tornadoweb.org/en/stable/concurrent.html#tornado...


> but my use case was more making async code utlize the cpu better

But run_in_executor achieves that as well! If you use no-GIL Python and a thread pool (or GIL Python with a Process pool), you will utilize more CPU cores.


run_in_executor that can be spelled as asyncio.to_thread() won't help utilizing cpu for pure Python code. It may help with blocking I/O, or cpu-heavy C extensions that release GIL. Otherwise, you have to utilize different processes to take full advantage of cpus (there are also subinterpreters (same process, separate GILs) but that exotic).


>run_in_executor that can be spelled as asyncio.to_thread() won't help utilizing cpu for pure Python code

That isn't true. If you use a ProcessPoolExecutor as the target instead of the default executor, you will use multiple processes in pure Python code.


the default executor must be ThreadPoolExecutor. But you are right you can pass ProcessPoolExecutor too.

https://docs.python.org/3/library/asyncio-eventloop.html#asy...


I am fairly confident I will get some down votes for this, but here goes...

When I am trying to solve a technical problem, the problem is going to dictate my choice of tooling.

If I am doing some fast scripting or I need to write some glue code, python is my go-to. But if I have a need for resource efficiency, multi threading, non-blocking async i/o, and/or hi performance, I would not consider python - I would probably use JVM over the best python option.

Don't get me wrong, I think its a worthwhile effort to explore this effort, and I certainly do not think its a wasted effort (quite the opposite, this gets my up vote) I just don't think I would ever use it if I had use case for perf and resource efficiency.


Surely your comment only really makes sense from the point of view of green field or hobbyist projects? If you were working for an organization with hundreds of thousands of lines of Python already doing something important, then your last sentence doesn't hold, right?


Sure, this works sometimes. But sometimes you have mountains of code and infrastructure dedicated to one platform, and it's worth the effort to round off the occasional square peg in the interest of operational simplicity/consistency.

I've been using ThreadPoolExecutors in Python for a while now. They seem to work pretty well for my use cases. Granted, my use cases don't require things like shared memory segments; I use as_* functions under concurrent.futures to recombine the data as needed. Honestly, I prefer the futures functions as I don't need to think about deadlocks.


> Sure, this works sometimes. But sometimes you have mountains of code and infrastructure dedicated to one platform, and it's worth the effort to round off the occasional square peg in the interest of operational simplicity/consistency.

I agree with this, this is a fair trade off, but not the direction I would go as a matter of preference.


What's the alternative, though?

1. Rewrite the whole thing 2. Carve out the high perf component into a separate system and also deal with the overhead of marshalling data between two different systems?


The main object is the in-between-land, where you need 10x-50x the performance of Python, not 500x the performance on some parallelizable workload.

And in many teams, just having to worry about python makes it easier to keep team members productive if they're not expected to handle several different languages productively.


Fair point. Tho I will push back on...

> And in many teams, just having to worry about python makes it easier to keep team members productive if they're not expected to handle several different languages productively.

I think this makes sense for individuals and teams, but for an org or company I think having specialist teams makes sense where teams that require perf use JVM and teams that make business-ware or devops or something not perf would use python.


JVM based development has a place for some teams. Others need to go all the way to C++/C/Rust etc for the performance they need.

But plenty of tasks can be done beautifully in Python. That's especially true in a data processing or ML setting where most of the heavy lifting is done in libraries such as numpy, spark, pytorch etc. (Also Python is the industry standard for such teams).

Still, even for such teams there are times where you want to do SOME heavier compute tasks within the core language, and offloading this to some other dev team simply doesn't scale.

The solution is to use multi processing instead of multi threading. But this workaround is quite inflexible.

Some dev teams may have developers that can deliver this in scala (especially if spark is involved). Other may have the ability to build C++ (or CUDA) libraries to add to the python environment.

But the ability to run somewhat heavier processing than what can be achieved by a single process is often much better.

Cost wise it also makes a lot of sense. Such steps may often find themselves on some large compute cluster where you have tens or hundreds of processors (or more) available. If a single step in a processing pipeline on such a cluster can be cut from 2 hours to 1 minute, it can be a large saving. Taking it from 1 minute to 10 seconds means a lot less.

Btw, and with all due respect, the part about teams that require perf using JVM doesn't really match my experience. Where I come from, the Java devs tend to produce the slowest code of all, mostly because every step of the processing is serialized/deserialized as microservices talk to each other for each data element.

Even python based code is often faster (sometimes by order of magnitudes). Partly because of cultural differences between teams (the python code, even when exposed as microservices, tend to work with larger blocks of code. And partly because the really is processed in C++ based libraries within python, that still have a significant edge on JVM based code.

Don't get me wrong: Java has a lot of advantages for many types of business applications, where the business logic complexity can be abstracted in well organized and standardized ways. But it's not typically the go-to language when seeking maximum performance in heavy compute or massive data volume scenarios.


It leads to other problems. A lot of orgs have specifically moved away from specialist teams towards teams united by the business mission.


The problem is that your example is not most of the companies that uses python are facing, which is the majority of the python code. They want some kind of performance uplift without rewriting the whole python code base. It's cheaper if python keeps getting some kind of upgrade

An example is facebook's php to hack compiler


The use case I wrote it in mind with is FastAPI. In that case, there wouldn't be any change to the Python code. You'd just use a different ASGI server that would use this sort of multithreaded event loop. So instead of running it with uvicorn main:app, you'd run it with alternateASGI main:app.

I have an example of a very basic ASGI server that does just that towards the end of the blog


Just playing devil's advocate here.

> They want some kind of performance uplift without rewriting the whole python code base.

In order take advantage of mutli-threading and/or async i/o, you would need re-write your code anyway, right? And at the point, wouldn't re-writing in different language be an option?


> you would need re-write your code anyway, right?

Heavily restructure, sure. Rewrite? Probably not.


Not really. The engineering effort of rewriting the entire codebase into a different language is astronomical. Besides mapping the logic to the new language, think about all the quirks between languages that you need to deal with. In the worst cases, you have to come up entirely new code. Nobody wants to pay for that.

Upgrading the language however it's way easier, and you usually have the official upgrade guide about what to do. It's also much safer, easier to deploy and test with.

Once we have the sane multi threading path in python, there would be even less incentive to rewrite the code


> Not really. The engineering effort of rewriting the entire codebase into a different language is astronomical.

Not to mention that python is actually a good language choice for many types of environments, and basically the industry standard for fields like ML/AI and supporting data pipelines.

Wherever python is used for heavy duty number crunching or large data volumes, most processing is handled by libraries written in other languages, while python is handling program flow and some small parts that need custom code. The large part can currently be quite expensive.

Migrating the whole codebase to another language for such setups would simply be absurd.

Still, for the small percentage of such codebases that DOES do semi-heavy data crunching, real multithreading would be nice so one can avoid resorting to multi-processing or implementing these parts as custom C++ libraries, or similar.


Hard agree. If you want resource efficiency and high performance you're probably better off looking to lower level languages most of the time. In my experience FastAPI usually gets used by teams that need a server done quickly and simply or are constrained by a lack of experience in low level languages. That being said, I do think its worthwhile trying to improve efficiency slightly even for these cases.


Perhaps a silly question, but should an event loop actually be multithreaded?

My understanding was that tasks in an event loop should yield after they dispatch IO tasks, which means the event loop should be CPU-bound right? If so, multithreading should not help much in theory?


If your workload is actually CPU-bound after you've deferred IO to the background, that's exactly when multithreading can help.

I've seen code that spends disproportionate CPU time spent on e.g., JSON (de)serializing large objects, or converting Postgres result sets into native data structures, but sometimes it's just plain ol' business logic. And with enough traffic, any app gets too busy for one core.

Single-threaded langs get around this by deploying multiple copies of the app on each server to use up the cores. But that's less efficient than a single, parallel runtime, and eliminates some architectural options.


You are correct, async-io is cooperative. This seems an attempt to enhance these async-io cooperative threads work more like goroutines in golang. Golang can start threads if it "thinks" the workload needs more CPU.


This was exactly my question. Why do you even need an event loop? If awaits are just thread joins then what is the event loop actually doing? IO can just block, since other coroutines are on other threads and are unaffected.

Which is to say, why even bother with async if you want your code to be fully threaded? Async is an abstraction designed specifically to address the case where you're dealing with blocking IO on a single thread. If you're fully threaded, the problems async addresses don't exist anymore. So why bother?


Looking at the article he's not implementing `Task` with `Thread` - he's round-robinning processing `Task`s through simple `ThreadPool`. So instead of a single `Thread` making continuous progress on the work in the event loop he instead has a set of `Thread`s making progress _in parallel_ on work in the event loop. This is very much Java 21's approach to virtual threads (as well as in-language task runners like the kind you find in Scala libraries like ZIO, Monix, Cats, and the venerable Scalaz).


How is that materially different than just making async invocations thread forks and awaits into joins? I understand what the code is doing, I just don't understand what the point is, when it seems like the net effect is the same as just writing threaded code.


The difference is that you can spin up only so many OS threads and you can run several orders of magnitude more "green threads" / "tasks" like this that round-robin onto system threads that comprise your event loop executor. The key thing to understand is that `await` doesn't block the backing thread it simply stops the current task (the backing thread moves on to picking up the next ready task from the queue and running the new task to its next await point).


If I understand correctly, it sounds like the idea is to map N tasks to M threads.

I suppose it’d only really be useful if you have more tasks than you can have OS threads (due to the memory overhead of an OS thread), then maybe 10,000 tasks can run in 16 OS threads.

If that’s the case, then is this useful in any application other than when you have way too many threads to feasibly make each task an OS thread?


The idea is to map N tasks to M threads. This is useful more than just when you needd more threads than the OS can spin up. As you scale up the number of threads you increase context switching and cpu scheduling overhead. Being able to schedule A large number of tasks with a small number of threads could reduce this overhead.


Having too many threads all running at the same time can also cause a performance hit, and I don't mean hitting the OS limit on threads. The more threads you have running in parallel(remember this is considering a GIL-less setup) the more you need to context switch between the. Having fewer threads all running in an event loop allows you to manage more events with only a few threads, for example setting the number of event loop threads to the number of cores on the cpu.


At that point, why bother with asyncio? What we really want is something like Java virtual threads, something that doesn't have a code color.


gevent is exactly this! http://www.gevent.org/

My startup has been using it in production for years. It excels at I/O bound workflows where you have highly concurrent real-time usage of slow/unpredictable partner APIs. You just write normal (non-async) Python code and the patched system internals create yields to the event loop whenever you’d be waiting for I/O, giving you essentially unlimited concurrency (as long as all pending requests and their context fit in RAM).

https://github.com/gfmio/asyncio-gevent does exist to let you use asyncio code in a gevent context, but it’s far less battle-tested and we’ve avoided using it so far.


I love gevent, but i never was 100% sure that nothing is secretly breaking or some weird thread safety issue. In a large SaaS app all sorts of 3rd party libs do weird background threading stuff or someone randomly starts doing threading.Local and shared global context. After hitting some weird hanging redis-py client issues, i turned gevent off and it went away. Never really got around to spend time to debug the issue(especially since it happened on prod and hard to replicate on stage/local).

Does your app have a lot of dependencies that do background threads? Like Launchdarkly(feature flags), redis, spyne(rpc) and on and on.


We also heavily use gevent but this is indeed the greatest frustration. Random and difficult to diagnose issues in external libraries like sockets being closed prematurely or timing out.


Not for those of us stuck in Java 8... Is it stable already?


I hear that some Amish sects permit the use of technology that's older and proven not to be too worldly, like washing machines, chainsaws, and Java 11. Have you considered converting?


Virtual threads only became stable in Java 21, unfortunately (https://openjdk.org/jeps/444). If the issue is "I'm bound to Java 8" proposing running a research build of Java 11 in production is going to fly as well as a lead balloon on the moon.


Even so, it's pretty new, isn't it? I don't quite trust the claim it's completely transparent to applications and libraries...


Yeah... Unfortunately golden handcuffs are binding me to the financial sector and, specifically, to Hadoop


Hadoop >=3.3 supports Java11.


Hmmm. That would indeed be better. Seems like an interesting experiment to try and implement virtual threads for python!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: