Faster CPython 3.12 Plan

simonw · on Sept 20, 2022

https://github.com/faster-cpython/ideas/wiki/Python-3.12-Goa... is interesting too.

> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.

Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?

theandrewbailey · on Sept 20, 2022

> There are a number of valid solutions, several of which may be appropriate to support in Python. This proposal provides a single basic solution: “channels”. Ultimately, any other solution will look similar to the proposed one, which will set the precedent. Note that the implementation of Interpreter.run() will be done in a way that allows for multiple solutions to coexist, but doing so is not technically a part of the proposal here.

> Regarding the proposed solution, “channels”, it is a basic, opt-in data sharing mechanism that draws inspiration from pipes, queues, and CSP’s channels.

> As simply described earlier by the API summary, channels have two operations: send and receive. A key characteristic of those operations is that channels transmit data derived from Python objects rather than the objects themselves. When objects are sent, their data is extracted. When the “object” is received in the other interpreter, the data is converted back into an object owned by that interpreter.

https://peps.python.org/pep-0554/#shared-data

girvo · on Sept 20, 2022

As someone who uses channels all the time (in Nim) for cross-thread comms, this is pretty exciting. The deep-copy that Nim channels do makes things simpler at the cost of more memory allocations obviously, but even on an ESP32-S3 it's been a great abstraction. Of course I get to cheat and use actual shared memory with FreeRTOS semaphores/mutexes and such when its really required, but having channels as the first-class easy-to-use mechanism is the right move in my opinion (which is worth about as much as you just paid for it, of course)

simonw · on Sept 20, 2022

> Along those same lines, we will initially restrict the types that may be passed through channels to the following:

> * None

> * bytes

> * str

> * int

> * channels

> Limiting the initial shareable types is a practical matter, reducing the potential complexity of the initial implementation.

That's a really interesting detail - presumably channels can be passed so you can do callbacks ("reply on this other channel").

I wonder why floats aren't on that list? I know they're more complex than ints, but I would expect they would still end up with a relatively simple binary representation.

galonk · on Sept 20, 2022

I don’t know why floats aren’t included, but any float can be easily represented by an int with same bits, or a bytestring, using the struct module to convert between them, so there are clear workarounds.

tylercrompton · on Sept 20, 2022

Fun fact, floats are easier than `int` because `float` has a predictable size.

JoshTriplett · on Sept 20, 2022

> That's a really interesting detail - presumably channels can be passed so you can do callbacks ("reply on this other channel").

Callbacks or notifications, yes. I use both patterns quite often.

saagarjha · on Sept 20, 2022

Sounds like an oversight…

nine_k · on Sept 20, 2022

Passing channels over other channels used to be a reasonably standard trick in Go, IIRC.

yeswecatan · on Sept 20, 2022

Any idea what a "channel" actually is/how it is implemented?

znpy · on Sept 20, 2022

Wild guess, a double-ended queue guarded by some kind of locking mechanism ?

eyelidlessness · on Sept 20, 2022

> Very basic question: [not basic at all question which has been the subject of decades of research and produced several specialized programming models]

(Brackets my own of course.)

Sharing data in concurrent programs is not trivial, especially in environments where data is mutable. The most trivial answer to the question is “message passing”, as in the SmallTalk notion of OOP or the Erlang/OTP Actor Model. Some solutions look much more like working with a database (Software Transactional Memory). Some models that seem entirely designed for a different problem space are also compelling (various state models common in UI and games like reactivity and Entity Component systems).

wenc · on Sept 20, 2022

PyPy pursued the STM route many years ago but later abandoned it.

https://doc.pypy.org/en/latest/stm.html

eyelidlessness · on Sept 20, 2022

I was gonna ask why it was abandoned but then I read through the caveats, and well… yikes!

wiz21c · on Sept 20, 2022

Is there an explanation of why they abandoned it ? It seemed very cool at that time...

lloeki · on Sept 20, 2022

This plan sounds very much like Ruby Ractors, which are essentially sub-interpreters, each with their own GVL.

Shareable data is basically immutable data + classes/modules, and unshareable data can be transmitted via push (send+receive) or pull (yield+take). Transmission implies either deep copying (which "forks" the instances) or moving with ownership change (sender then loses access)

See here for details: https://docs.ruby-lang.org/en/master/ractor_md.html

int_19h · on Sept 20, 2022

Depends on what criteria you use for "best".

If it's performance, then, since subinterpreters run in the same process, it would be global shared state. You can't use Python objects across subinterpreters, but raw byte arrays will work just fine, provided you do your own locking correctly around all that.

samsquire · on Sept 20, 2022

Python would need to implement a multiconsumer multiproducer ringbuffer or a blocking free algorithm (I'm not sure if it is wait free) such as the actor system I implemented below

To apply to Python The subinterpreters could transfer ownership of the refcounts between subinterpreters as part of an enqueue and dequeue.

I believe the refcount locking approach has scalability problems between threads.

I implemented a multithreaded actor system with work stealing in Java and message passing can get to throughputs of around 50-60 million messages per second without blocking or mutexes. The only lock is not quite a spinlock. I use an algorithm I created but inspired by this whitepaper [1], which is simple but works. It's probably a known algorithm but I'm not sure of the name of it.

I have a multidimensional array of actor inboxes (each actor has multiple buffers for filling by other threads to lower contention to 0) then there is an integer stored for the thread that is trying to read or write to the critical section.

The threads all scan this multidimensional array forwardly and backwardly to see if another thread is in the critical section. If nobody is there, it marks the critical section. It then scans again to see if it is still valid. it's similar to going into a room and scanning the room left and scanning the room right. Surprisingly this leads to thread safely. I wrote a python model checker to verify the algorithm is correct.

Without message generation within threads, it can communicate and sum 1 billion integers in 1 second due to parallelism (it takes 2 seconds to do this with one thread) It takes advantage of the idea that variable assignment can transfer any amount of data in an assignment.

See Actor2.java (1 billion sums a second messages created in advance), Actor2MessageGeneration.java (20 million requests per second, messages created as we go) or Actor2ParallelMessageCreation.java (50-60 million requests per second, with parallel message creation)

There's also a Java multiconsumer multiproducer ringbuffer in this repository [3] too which I ported from Alexander Krizhanovsky [2]

[1]: https://lag.net/papers/content/leftright-extended.pdf

[2]: https://www.linuxjournal.com/content/lock-free-multi-produce...

[3]: https://github.com/samsquire/multiversion-concurrency-contro...

edgyquant · on Sept 20, 2022

Redis or some handroller message queue

forrestthewoods · on Sept 20, 2022

> move to a global interpreter lock (GIL) per sub-interpreter

hallelujah!

Global variables are evil. The fact that sub-interpreters aren’t currently possible in Python is one of my canonical examples of why they’re evil.

umanwizard · on Sept 20, 2022

Does anyone know what came of Sam Gross’s proof of concept that removed the GIL entirely? Is that proposal effectively dead?

dwrodri · on Sept 20, 2022

I think it's been decided that such a change was so large that it would require a major version change in Python. However, I think that was unauthoritative hearsay probably in another comment thread here on HN. But it stands to reason that removing the GIL will almost certainly change Python's memory model in some ways that could break code in ways that warrant a major version bump.

KronisLV · on Sept 20, 2022

> I think it's been decided that such a change was so large that it would require a major version change in Python.

Hah, I wonder what else Python 4 could have in it.

The Python 2 to 3 migration was hard enough and there were certain challenges along the way (mostly package availability and syntax changes, though the same is happening with new Vue versions) but it seems that in regards to most metrics Python 3 was indeed an improvement, apart from the startup time.

chpatrick · on Sept 20, 2022

Can we have block scope?

rat9988 · on Sept 20, 2022

An improvement that was not worth the cost. I guess many people aren't looking forward python 4.

anakaine · on Sept 20, 2022

I tend to disagree on no other basis than I've found python 3 to be a lot friendlier to use than python 2. Also, a number of scripts I have operate quicker under python 3, not by a lot, but its still a small win.

nurettin · on Sept 20, 2022

Half the world wouldn't be using python if it wasn't for the typing module. The other half is probably waiting for speed improvements.

coldtea · on Sept 20, 2022

>Half the world wouldn't be using python if it wasn't for the typing module

Hardly.

The majority of Python users don't use the typing module.

nurettin · on Sept 20, 2022

Oh you went for number of programmers, that isn't what I meant (obviously). Think influence. Think dropbox, uber, amazon. And think stripe trying to add type annotations to ruby. This is what I meant.

coldtea · on Sept 20, 2022

Well, Dropbox, Uber and Amazon used Python way before the typing module... and many others, including Google.

They might want it in (and helped add it), but it's not what made them use Python.

nurettin · on Sept 20, 2022

Okay, I will spell it out: large companies require robust code and typing provides it.

coldtea · on Sept 20, 2022

Okay, my $0.02 cents: this is mostly a periodic trend. Large companies had static codebases and switched many codebases to dynamic types circa 2000-2010.

And we've even been to the same circle before: a lot of programming the 60s and 70s was untyped, then they switched to typed C++, Delphi and then Java.

umanwizard · on Sept 20, 2022

> that isn’t what I meant (obviously)

It was not at all obvious from your post.

hyperbovine · on Sept 20, 2022

Just so I understand: “half the world” is now a measure of influence, not people? And this is supposed to be obvious?

petercooper · on Sept 20, 2022

Being very generous to /u/nurettin, I think maybe they mean that the use of said module by a particularly influential group of developers has the byproduct of broader Python use by folks who might not use said module.

I see some mild sense in this argument given how TypeScript has taken off and dispersed into audiences who wouldn't ordinarily be interested in such a thing. I'm not sure it works in the Python world though, since Python's latter day upward trajectory is probably more oriented around heavy use in education, science, ML, PyTorch, et al?

Jasper_ · on Sept 20, 2022

There was nothing stopping them from adding the typing module and syntax to Python 2. The issue was more or less the forced painful backwards compatibility break; in hindsight, that could have been avoided while giving us still a lot of new goodies.

js2 · on Sept 20, 2022

Looks like it's still active to me:

https://github.com/colesbury/nogil/

freediver · on Sept 21, 2022

Tested it this morning and it is actually slower in a combined benchmark than 3.11.

BerislavLopac · on Sept 20, 2022

AFAIK it was implemented in 3.11. That is, all of it except of the GIL removal itself, which actually decreased performance for single-threaded code; the actual improvement was elsewhere.

benhoyt · on Sept 20, 2022

I find that difficult to believe, as the "What's New in Python 3.11" release notes (https://docs.python.org/3.11/whatsnew/3.11.html) don't mention "GIL" or "Global Interpreter Lock" at all. A change of that magnitude would definitely get a mention there.

germandiago · on Sept 20, 2022

No, it has not been done. I published a post in Reddit related to this experiment that you can check here if you want: https://www.reddit.com/r/programming/comments/q8n508/prototy...

BerislavLopac · on Sept 20, 2022

Isn't this exactly what I wrote?

chrisseaton · on Sept 20, 2022

The GIL removal was implemented, apart from the GIL removal part?

BerislavLopac · on Sept 20, 2022

No. Sam Gross’s PoC included a number of optimisations besides the gilectomy. It were those optimisations that made it faster, while the GIL removal slowed it down again; so only the optimisations were implemented.

DonHopkins · on Sept 20, 2022

When performing gilectomization, it's better to undergilectomize than overgilectomize.

_joel · on Sept 20, 2022

It's ok, you can get a reverse gilectomy

remedan · on Sept 20, 2022

Some of "Sam Gross’s proof of concept" was implemented of which the GIL removal was only a part. The rest was performance improvements.

coldtea · on Sept 20, 2022

Yes.

The "GIL removal" umbrella proposal was two-fold. It included (a) removing the GIL, (b) several optimizations to handle some issues with GIL being removed and offset the GIL removal overhead (due to more frequent lock checks, etc).

The GIL-removal assisting changes and optimizations were merged, but the GIL removal was not.

aix1 · on Sept 20, 2022

And also the very first item of the 3.12 plan (the subject of this post) is about tweaking the GIL.

theandrewbailey · on Sept 20, 2022

> Expose multiple interpreters to Python code

> Implement PEP 554

> PEP 554 - Multiple Interpreters in the Stdlib

That's going to be fun. Why fight the GIL when multithreading, when you can just get around it with more interpreters?

fny · on Sept 20, 2022

How is this different from multiprocessing? The examples looks like a complete nightmare...

    interp = interpreters.create()
    interp.run(tw.dedent("""
        import some_lib
        import an_expensive_module
        some_lib.set_up()
        """))
    wait_for_request()
    interp.run(tw.dedent("""
        some_lib.handle_request()
        """))

I'm actually shocked this is even being contemplated. We've regressed to evaling?

ynik · on Sept 20, 2022

There's a massive difference to multiprocessing: the different sub-interpreters can use a C(++)/Rust extension module to talk to shared state. In the current multi-processing world, the whole C++/Rust state needs to be duplicated for each process (in the case of our app, this means 5 GB memory usage per core); with subinterpreters, we can share the same C++/Rust state.

The `interpreters` API is just the starting point. Compare it with `subprocess`, not with `multiprocessing`. Once subinterpreters are useful, people will build higher-level APIs for them.

BiteCode_dev · on Sept 20, 2022

It's just a poc to show indecently running codes, the exec() is not how you will use it, but it simulates importing python code and running it (because an exec() is what import does :)).

joppy · on Sept 20, 2022

It's the minimum needed to get something going, and any really sane use of it would be evalling something like

    import worker
    worker.run()

The PEP explicitly mentions this, and that something like subinterpreter.run(func, ...) could be considered in the future: https://peps.python.org/pep-0554/#interpreter-call

simonw · on Sept 20, 2022

I think this is so smart. The main thing holding back replacement of the GIL at the moment is that there is a VAST existing ecosystem of Python packages written in C/etc that would likely break without it.

Multiple interpreters with their own GIL keep all of that existing code working without any changes, and mean we can run a Python program on more than one CPU at the same time.

lazka · on Sept 20, 2022

Only C extensions that themselves have no global state and don't depend on the GIL for locking, which most of them do. So they will all require some porting, and it will take time since it requires newer CPython API only available in 3.9+ and some even 3.11+ (PEP 630).

ehsankia · on Sept 20, 2022

But isn't this basically just nicer multiprocessing?

cozzyd · on Sept 20, 2022

It's much nicer if you're using the C API...

mixmastamyk · on Sept 21, 2022

No, they are in-process and able to access the same memory, although probably put behind a communication layer.

rtpg · on Sept 20, 2022

I do think that there’s been a lot of work around GIL removal, and every talk seems to end at the reality that the GIL allows for avoiding of a loooooot of locking structures and when removing the GIL you end up needing many granular locks.

DannyBee · on Sept 20, 2022

It comes at a cost, of course. You don't really have shared memory state, which is often easiest to conceptually think about.

So you are just transforming the problem into a data sharing problem between interpreters, which requires careful thought on both the language side for abstractions, and the consumer side to use right.

It also makes the tooling and verification much harder in practice - for example, you aren't reasoning about deadlocks in a single process anymore, but both within a single process and across any processes it communicates with.

At an abstract level, they are transformable into each other. At a pragmatic level, well, there is a good reason you mostly see tooling for single-process multi-threaded programs :)

girvo · on Sept 20, 2022

> which is often easiest to conceptually think about

Absolutely, but is also the easiest to shoot yourself in the foot with. Trade-offs! I'm biased though, I'm a big fan of deep-copy channels (which for small shallow objects is still fast), though not having the option at all for shared memory here will be a bit of a pain for certain things of course.

isthisthingon99 · on Sept 20, 2022

If all global state is made thread safe and, then whether threads are subinterpreters or a single interpreter is conceptually irrelevant and probably easier to implement.

kjeetgill · on Sept 20, 2022

It really really depends on what you mean by global state here. If you mean the global state within the interpreter that's one thing. Preserving the global state of your application is another.

But a weird "global state" (really more a global property) is the semantics between concurrent pieces of code and the expectations about things like setting variables, possibly interleavings etc.

The nice part of different interpreters isn't just getting around the gil, and maintaining similar isolation, but it's almost like a Terms and Services agreement: I opened this can of worms and it's my responsibility to learn what the consequences are.

int_19h · on Sept 20, 2022

It is not conceptually irrelevant. With threads, you can create a Python object on one, store it into a global (or some shared state), and use that object from a different thread. You can't do that with subinterpreters tho.

isthisthingon99 · on Sept 20, 2022

I think the last decade or so of programming has taught us that people just plain suck at multithreading. Go, Rust are all languages that solve this problem in different ways. It would be a tragedy if Python went back to the old way and didn't have a better solution.

int_19h · on Sept 20, 2022

"Went back" implies that threads and shared state are not the status quo. They definitely are in Python (and, realistically, they also are in general, given the degree of Rust adoption vis a vis other PLs). So Python will have support them, if only so that we don't have to rewrite all the Python code that's already around. A new language has the luxury of not caring about backwards compatibility like that.

Also, Go doesn't really solve the problem - sure, it has channels, but it still allows for mutable shared state, and unlike Rust, it doesn't make it hard to use.

isthisthingon99 · on Sept 21, 2022

In my career, I would say 95% of parallelism does not require low level threading primitives like locks. A lot of it is solved by queues which can be provided by the runtime. The rest of the 5% usually takes up 25% of the debugging, lol.

samsquire · on Sept 20, 2022

Couldn't you transfer ownership between subinterpreters with shared memory outside the subinterpreter?

Associate some shared memory with each subinterpreter (the same array or map)

int_19h · on Sept 20, 2022

It's not a question of having a mechanism to transfer data. Sure, you can easily use a static global in a native module to easily transfer a reference across subinterpreter boundaries. But the moment you try to increment refcount for the referenced object, things already break, because you're going to be using the wrong (subinterpreter-global) lock.

samsquire · on Sept 20, 2022

Oh I was thinking from a native python object perspective.

You could have a rule that the refcount must be 1 when sending an object between subinterpreters.

In other words, you cannot use an object that was .send() to another subinterpreter.

Then you invalidate the reference in that subinterpreter when it calls send to the other subinterpreter which is transferred by assignment.

Can transfer any amount of data with zero copies.

int_19h · on Sept 20, 2022

Even a completely empty object will contain a reference to its type, which is itself an object. How will you marshal that? Bear in mind that each subinterpreter has its own copy of type objects, and there isn't even a guarantee that those types match even if their names do.

samsquire · on Sept 21, 2022

It sounds it is a problem but I feel it can be solved with engineering and mathematics. Not saying it would be easy though.

Couldn't you separate the storage of the refcounts from the objects and use a map to get at them?

As for the identities between types being different.

To create an subinterpreter that can marshall between subinterpreters without copying the data structures requires a different data structure that is safe to use from any interpreter. We need to decouple the book keeping data structures from the underlying data.

We can regenerate book keeping data structures during a .send or .receive

Maintaining identity equivalence is an interesting problem. I think it's a more fundamental problem that probably has mathematical solutions.

If we think of objects as being a vector in mathematical space. We have pointers between vectors in this memory space.

For a data structure to be position independent. We need some way of intending references to be global. But we don't want to introduce a layer of indirection on the reference of object relationships. That would be slower. Could use an atomic counter to ensure that identifiers are globally unique.

Don't want to serialize the access to global types.

It sounds to me it is a many-to-many to many-to-many problem. Which even databases don't solve very well.

samsquire · on Sept 21, 2022

It occurred to me, that I was told about Unison Lang recently and this language uses content addressable functions.

In other words, the code for a function is hashed and that is its identity that never changes while the program is running.

If we use the same approach with Python, each object could have a hash that corresponds to the code only, instead of the data. This is the objects identity even when added to the book keeping data of another subinterpreter.

This requires decoupling of book keeping information from actual object storage. But replaces pointers with lookups which could be inlined to pointer arithmetic.

forrestthewoods · on Sept 20, 2022

> conceptually irrelevant and probably easier to implement.

Well, it depends on how it’s implemented.

If “made thread safe” means constantly grabbing locks around large blocks of data then the end result is concurrency (hopefully!) but not parallelism. Meaning you might only have one thread active at a time in practice.

Wrapping the universe in a mutex is thread safe. But it’s not a good solution.

isthisthingon99 · on Sept 20, 2022

I presume they know what they are doing and won't be doing a big world mutex :-)

Whitespace · on Sept 20, 2022

I'm glad to see this as an outline, which is how I structure most of my project work. It can be hard for others to follow, but it's very concise and scannable (just read the first indentation level for the top-level idea).

To paraphrase Adam Savage from his excellent book, Every Tool's a Hammer, lists [of lists] are very powerful way to tame the inherit complexity of any project worth doing.

d0mine · on Sept 20, 2022

You might like creating outline using Org mode (the ultimate hammer) https://www.youtube.com/watch?v=VcgjTEa0kU4&t=344

crazytalk · on Sept 20, 2022

Looking forward to runnable code this time. Most of us are old enough by now to remember many project plans just like this one. Fool me once..

Twirrim · on Sept 20, 2022

This project has already landed improvements in 3.10, and some much bigger improvements in 3.11. This work for 3.12 is "just" a continuation of that excellent effort:

https://www.phoronix.com/review/python-311-benchmarks/4

nedbat · on Sept 20, 2022

Have you followed the 3.11 performance improvements by the same group? It's ~25% faster than 3.10.

jaimex2 · on Sept 20, 2022

faster at what?!

That's a pretty generic statement and likely true only in a very specific frame.

sanxiyn · on Sept 20, 2022

25% number is from pyperformance benchmark suite, which you can replicate. Whether pyperformance is representative benchmark suite is another question.

https://github.com/python/pyperformance

jaimex2 · on Sept 20, 2022

Cheers.

It rubs people the wrong way but I always call out blanket statements. Generally languages get faster with each version and theres a lot of numbers thrown around, it doesn't mean your apps will get anywhere near that boost.

If you're lucky that one loop that concats strings got a few ms shaven off while that ORM youre using continues to grind the whole thing down.

arcanemachiner · on Sept 20, 2022

https://www.phoronix.com/review/python-311-benchmarks/4

est · on Sept 20, 2022

How does it compare to peak python2.x performance?

mixmastamyk · on Sept 20, 2022

Already passed it with the 3.6 dict improvements iirc.

NavinF · on Sept 20, 2022

python2.x performance was surpassed long ago. I don't think anyone bothers to benchmark it anymore.

jupp0r · on Sept 20, 2022

> Per-interpreter isolation for extension modules

This will break many modules. Basically any that use static variables, which is done pretty much everywhere.

oblvious-earth · on Sept 20, 2022

Yes this would be a challenge for extension modules to implement support for this. Here is a discussion between the core dev and the numpy team: https://mail.python.org/archives/list/numpy-discussion@pytho...

It's going to be a bit of a chicken and egg problem, core Python will need to prove it's worthwhile for extension devs to implement, core Python will struggle without support from extension devs. We shall see.

hermitdev · on Sept 20, 2022

IMO, it was this sort of chicken & egg problem that slowed the adoption of 3.x in the first place. I know personally, I wasn't able to use 3.x for anything non-trivial until close to 3.7 because some of the 3rd party libs I needed weren't available. I seriously hope this doesn't happen again, though I am really excited for these improvements to CPython.

oblvious-earth · on Sept 21, 2022

I don't disagree, but the positive thing about this is it's opt-in for extensions.

If extensions don't support it it means you just can't use that extension when trying to run multiple interpreters in the same process. Let's see if there's even a good use case for running multiple interpreters in the same process outside of embedded programming, it's not 100% clear yet.

jensus · on Sept 20, 2022

If it's static would it not get it's own allocation within each of the isolated interpreters?

jupp0r · on Sept 20, 2022

Static modules are loaded as shared libraries/dlls. The way operating systems implement this is that each library is loaded once per process and that statically allocated memory is mapped into the virtual address space of the process. You can't load one so/dll multiple times in some sort of container, so each module would have to implement this isolation inside their module, probably through some sort of API that the python runtime offers to the module. It's not rocket science but it will definitely break existing code where it's common practice to use dll lifetime hooks as initialization code that allocated some global state that's conveniently shared throughout the module.

girfan · on Sept 20, 2022

> You can't load one so/dll multiple times in some sort of container

I believe you can do that with `dlmopen` in separate link maps. I have worked with multiple completely isolated Python interpreters in the same process that do not share a GIL using that approach.

unnah · on Sept 21, 2022

Thank you for the hint about dlmopen! I had a problem that can be solved by loading multiple copies of a DLL, and it looks like reading manpages of the dynamic linker would have been a better approach than googling with the wrong keywords.

girfan · on Sept 21, 2022

That's great!

There are a few cases where `dlmopen` has issues, for example, some libraries are written with the assumption that there will only be one of them in the process (their use of globals/thread local variables etc.) which may result in conflicts across namespaces.

Specifically, `libpthread` has one such issue [1] where `pthread_key_create` will create duplicate keys in separate namespaces. But these keys are later used to index into `THREAD_SELF->specific_1stblock` which is shared between all namespaces, which can cause all sorts of weird issues.

There is a (relatively old, unmerged) patch to glibc where you can specify some libraries to be shared across namespaces [2].

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=24776#c13

[2]: https://patchwork.ozlabs.org/project/glibc/patch/20211010163...

jupp0r · on Sept 20, 2022

IIRC glibc is limited to 16 namespaces though.

girfan · on Sept 21, 2022

Currently it is, yes. I am not sure how fundamental it is. I tried patching glibc to support more (128 in my case) and it seemed to work fine.

int_19h · on Sept 20, 2022

It's all a single process, and native modules are just shared libraries, so how would it allocate multiple instances for different interpreters?

unnah · on Sept 20, 2022

Does anyone know of a way to load multiple instances of a DLL in the same process on Linux? A few months ago I was googling for a solution and didn't find anything ready-made. I guess the dynamic linker wants to have a unique address for each symbol, but in principle you should be able to load another DLL instance, initialize it and call its functions indirectly by using function pointers.

LtWorf · on Sept 20, 2022

How would you find said function pointers?

misnome · on Sept 20, 2022

dlsym and RTLD_LOCAL ?

judge2020 · on Sept 20, 2022

Is this from an official Python Foundation group? Weird that it's not under the main `python` github org.

rovr138 · on Sept 20, 2022

> Guido van Rossum edited this page yesterday · 13 revisions

Gotta admit. That sounds pretty official

wodenokoto · on Sept 20, 2022

No. He could be improving Python on his own time.

fatbird · on Sept 20, 2022

For those who didn't catch the reference: https://www.youtube.com/watch?v=ohDB5gbtaEQ

kzrdude · on Sept 20, 2022

On Microsoft's time, since they are employing him

5d8767c68926 · on Sept 20, 2022

To be that guy... Didn't Guido officially step down as BDFL?

oblvious-earth · on Sept 20, 2022

Yes, he then retired, came out of retirement to work at Microsoft with a remit to work on whatever he wants, and decided the project he wanted to work on was make CPython faster.

kzrdude · on Sept 20, 2022

What's the title of the budget for employing him? General brain trust? I'm genuinely amused

oblvious-earth · on Sept 21, 2022

The title is "Distinguished Engineer".

I have no idea of the pay but based on my research of salaries for getting a job this year I would wildly speculate high 6 digits to low 7 digits.

LVB · on Sept 20, 2022

Some background on the project: https://github.com/faster-cpython/ideas/blob/main/FasterCPyt...

sanxiyn · on Sept 20, 2022

Not exactly. I would describe this to be from Microsoft faction of Python Software Foundation. So yes, some members of Python Software Foundation (mainly Microsoft employees) are behind this, but not all members are.

wrycoder · on Sept 20, 2022

While the Cpython core developers are Fellows of the PSF, the PSF does not provide technical direction. That is not its purpose.

__s · on Sept 20, 2022

Python core developers are behind it at least

qeternity · on Sept 20, 2022

It’s a bit frustrating to see the first item related to parallelism and the GIL. Anybody doing parallel compute in Python has long since worked around these issues. IMHO Python needs better single threaded performance first, and then once all the juice has been squeezed from that lemon, we can sit down and get serious about improving multi threaded ergonomics.

staticassertion · on Sept 20, 2022

I don't really use Python if I can help it, but I'm still really glad to see people working on this. Whether I like it or not Python will probably always be some part of my job and I really appreciate that there's finally some focus on it getting faster that isn't just "write that part in C".

lsofzz · on Sept 20, 2022

A "faster" CPython is a decade plus minus four years old story ;_;

I'll check back in a decade.

jokoon · on Sept 20, 2022

I wonder if python could possibly be as fast as js if enough money was spent on it.

_joel · on Sept 20, 2022

It's not a case of money imho, it's a case of it's a juggernaut of a userbase and ecosystem that moves very slowly and implementing improvements to execution times (generally) are intremental changes, not paradigm shifts as they make backwards compatability a nightmare/impossible.

I mean 2.x is still in the wild and some companies provide support for it, still!

mindwok · on Sept 20, 2022

I think the other issue if we compare it to JS is the unfortunate reliance on C that python has. When Chrome came around, JavaScript was already standardised and there were many competing implementations which had to conform to ECMAScript and give users a relatively consistent experience. So when Google made V8 they could kind of go crazy with optimisations as long as they conformed to the spec.

Python on the other hand has one real implementation, and the ecosystem has become extremely intertwined with that implementation. Implementing a new Python interpreter is great, but it rarely gains traction because most of the ecosystem is so reliant on CPython specific modules that don’t work in the new interpreter so they never really get off the ground.

miohtama · on Sept 20, 2022

This is possible, but it would need some backwards incompatible in the object model. We still are likely to see Python 4 on one day. People are still remembering the pain Python 3 transition caused.

IshKebab · on Sept 20, 2022

Unlikely. Python has lots of features that were added without any thought to how to make them run fast - it simply wasn't a goal. As a result Python includes a ton of dynamic features that make it really hard to optimise.

ahrzb · on Sept 20, 2022

I don’t think so, there are some design choices that make the two quite different. This „dictionaries all the way down“ approach has a cost.

robertlagrant · on Sept 20, 2022

How will this affect things if I have an instance of the built in SQLite? Will it be accessible by multiple GILs at once?

mixmastamyk · on Sept 21, 2022

Sqlite itself supports concurrent reads but not writes, so Python will likely not improve on that.

remram · on Sept 20, 2022

Yes, just like you can already access it from multiple threads in the same interpreter right now.

robertlagrant · on Sept 21, 2022

> Yes, just like you can already access it from multiple threads in the same interpreter right now.

Yes, but doesn't the GIL currently serialise things?

remram · on Sept 21, 2022

The GIL serializes all the things you can do in Python... I am not sure I understand the question to be honest.

robertlagrant · on Sept 21, 2022

With the new multi-GIL I wonder how different concurrent GILs will work against the same sqlite database.

remram · on Sept 22, 2022

I see. It will give you the same performance as using a single SQLite database from different processes now.

Note that SQLite limits access to 1 writer at a time though.

RantyDave · on Sept 21, 2022

So, is the GIL released when the thread goes to SQLite? Is the behaviour dependent on how the interfacing between Python and C is written?

remram · on Sept 21, 2022

Yes, it is released [1]. This allows you to access it from multiple thread in the same interpreter though, so I still don't understand robertlagrant's question.

[1]: https://github.com/python/cpython/blob/4b81139aac3fa11779f6e...

MikeYasnev007 · on Sept 20, 2022

Perhaps it is also good to build Artemis project on hn infra. Articles and discussions. I will create mirror on habr

MikeYasnev007 · on Sept 20, 2022

I will review this tech during this week also as jq tools

melonrusk · on Sept 20, 2022

Are there any plans to remove threading from python ?

A year or two ago I read up on the various efforts to make a fast, more parallel CPython, and one of the core underlying problems seemed to be the use of machine threads, resulting in a very high locking load as the large (potentially unlimited) number of threads attempted to defend against each other.

Letting an operating system run random fragments of your code at random times is very much a self-inflicted wound, so I was wondering if the python community has any plans to not do that any more ?

miohtama · on Sept 20, 2022

Threaded IO code on Python is fine if you are IO limited, as opposite to CPU limited. Most web workloads are like this as they wait database.

sylware · on Sept 20, 2022

I would seriously consider a risc-v assembly port of a python interpreter.

Removing fanatic compiler abuse is always a good thing. That said, I saw some assembler macro abuse (some assemblers out there have extremely powerful and complex macro pre-processors), then the hard part would be not to abuse the macro pre-processor of the assembler.

I know it is not to make python actually "faster", but to have a python implementation which does not require those grotesquely and absurdely massive compilers, then the SDK stack would be way more reasonable from a technical cost stand point.