Let's Remove the Global Interpreter Lock

eslaught · on Aug 14, 2017

The comments here are missing a massive use case: shared memory. Shared memory isn't just about programmer convenience. It's about using a machine's memory resources more effectively.

Yes, shared memory is available in multi-processing, but it doesn't necessarily interact well with existing codes.

I've been working on adding Python support to Legion [1], a task-based runtime system for HPC. Legion wants to manage shared memory so that multiple cores don't necessarily need multiple copies of the data, when the executing tasks don't conflict (all are read-only, or access disjoint data). Legion is C++, so this mostly "just works". Some additional work is required to support GPUs, but it's still not so difficult. But with Python, if we go with multiprocessing, we have to switch to a different mechanism. Worse, Python is an optional dependency for Legion, so we can't depend on Python's multiprocessing support either.

If you have a large existing project, and a use case that can take advantage of shared memory, being forced into Python's multiprocessing scheme for parallelism is a pain.

We've been investigating using a dlmopen approach as well, based on this proof of concept [2]. Turns out that dlmopen in every available version of libc has a critical bug that prevents it from being practically useful, if you have any desire to make use of native modules. You can build a custom libc with this patch [3] but rolling a custom libc is also a massive pain.

In all likelihood we'll end up rolling our own multiprocessing to make this work. If the GIL were truly gone though, we could potentially avoid many of these issues.

[1]: http://legion.stanford.edu/

[2]: https://news.ycombinator.com/item?id=11844268

[3]: https://patchwork.ozlabs.org/patch/496559/

detroitcoder · on Aug 14, 2017

^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem.

tinix · on Aug 15, 2017

FWIW, I routinely throw many GBs of pickled dataframes into Redis all the time, and then cluster the workload between multiple processes that are coordinated as a sort of namespaced job queue, all via Redis pubsub, blpop, l/rpush, and set/get. There are much faster and more efficient serialization formats like msgpack or protocol buffers however, compared to pickle, if you really need to squeeze out performance. You just have to chunk your bulk out into pieces and spread the bulk across multiple workers. You have an orchestrator class that puts things onto the queues, pulls things off, loads any modules you need, handles exceptions, etc...

Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.

Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.

ogrisel · on Aug 16, 2017

The pickling implementation of joblib has support for memory mapping numpy arrays nested in arbitrary data structures such as pandas dataframes.

Save the dataframe in a folder that can be accessed by the gunicorn worker:

    import joblib
    joblib.dump(df, '/folder/shared_data.pkl')

Then in the code run by the flask / gunicorn workers themselves:

    import joblib
    shared_df = joblib.load('/folder/shared_data.pkl', mmap_mode='r')
    # use the shared_df as usual (inplace modifications are not
    # authorized)

Some pandas function can have issues with read-only buffer though: https://github.com/pandas-dev/pandas/issues/17192 (caused by a currently unsolved bug / limitation of Cython) but it can work for your use case.

detroitcoder · on Aug 16, 2017

This looks very interesting. I am reading the docs https://pythonhosted.org/joblib/parallel.html#manual-managem... and it looks like it would help a lot (possibly solve the issue). Do you have any experience using this in production?

detroitcoder · on Aug 16, 2017

DAMN. I just did a basic test and it kinnda just worked?!? I created a test dataframe of 100M rows X 10 cols which took up ~2.3G and then used joblib.dump within the on_starting hook which is run when the gunicorn master starts up. Then loaded that df in with joblib.load within the worker and the total memory consumption was practically flat. Then I bumped up the number of workers to 20 and still flat. That is actually amazing. Coolest thing I have seen in months for how easy it is. Now I have to test out if the analytics actually work and a deep dive into the mechanics of mem-mapping.

ogrisel · on Aug 24, 2017

Thanks for your feedback. I am glad I could help you.

zeptomu · on Aug 14, 2017

> create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. [...]

May I ask what you consider large memory - MByte, GByte, TByte? The simplest solution is to store it as a blob on a SSD, and read it via simple file IO or put it into a DB. But I assume this was too slow, so it would be interesting to go into more details.

In the end you can do shared memory with multiprocessing in Python, which - I have to admit - requires some setup and bookkeeping work.

detroitcoder · on Aug 16, 2017

Lets say there are a couple dataframes that need a matrix multiply that take up about 10gb on a 32gb host. I want to parameterize these manipulations and expose over http. I can only afford to cache 3 sets of them, which means that I can perform 3 concurrent requests. I would like to provide more concurrency than this without reading from disk or storing the data out of process in a separate service which adds complexity.

marmaduke · on Aug 15, 2017

Tried a memmap?

sametmax · on Aug 15, 2017

I still wish to find a good tutorial about memmap. The doc about it is very formal. Something with clear use cases, patterns, gotchas and best practice would probably make it more popular.

detroitcoder · on Aug 16, 2017

Check out the joblib.dump example mentioned above. It is pretty impressive so far.

sambe · on Aug 14, 2017

I rarely hear people complain about genuine use-cases but this would seem to be one. However, aren't most/all of the dataframe operations done in C extensions in these cases?

jzwinck · on Aug 14, 2017

While a lot of NumPy is C and Fortran, Pandas is mostly pure Python and some Cython. And mostly it does not release the GIL.

You often end up having to implement your own C extensions or use Numba for the core of your processing. Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

throwaway613834 · on Aug 14, 2017

> Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

Is there any sort of list (comprehensive or otherwise) that denotes which NumPy functions are parallelism-friendly? I mean this whether it's in terms of releasing the GIL, in terms of SIMD support, or in terms of being multi-core.

jzwinck · on Aug 15, 2017

Why are you asking this using a throwaway?

np.dot() is multicore. np.load () (and family) releases the GIL. SIMD mostly depends on the build system, so if you want it you might need to build NumPy from source.

https://stackoverflow.com/questions/24022723/where-can-i-fin...

goerz · on Aug 15, 2017

Is there a way to disable this? In an HPC environment, I don't want routines going multi-core without my explicit permission, under any circumstances. I will already have manually set up the parallelization to be at the highest logical level. If using Python, that usually means I have planned out the number of processes to be equal to the number of cores. If each process then starts doing its own multicore calculation (badly load-balanced!) it overtaxes the node and slows everything down.

I really wish numpy/pandas/scipy wouldn't do this kind of uncontrollable parallelization.

yaroslavvb · on Aug 15, 2017

Underlying implementations often have a way to disable parallelism, ie, OMP_NUM_THREADS=1 or MKL_NUM_THREADS=1

sirfz · on Aug 15, 2017

pd.HDFStore is a good option for storing large DataFrames and it has some power querying capabilities.

sametmax · on Aug 14, 2017

But then multi-interpreters would allow that, and the article discard it as a valid solution. I find it harsh. It seems much easier to implement, doesn't have the same serialization problem than multiprocessing has and allow to utilize all the CPUs. Yes it's not as good proper threads because you do have more overhead, but it's an order of magnitude better than what we currently have, while being way easier to do that getting rid of the GIL.

Too bad the current project is on hold.

sitkack · on Aug 15, 2017

The points against sub-interpreters or multi-interpreters is a false dichotomy. There are plenty of scenarios where having multiple interpreters in the same address space would be valuable. Queues could be effectively to communicate between interpreters, no sharing needed. Where sharing is required, those structures can be on their own non-VM heap with the required locks.

sitkack · on Aug 15, 2017

The other glaring issue with CPython is all the globals, meaning there can be only ONE Python VM running in a process space.

The closest work has been done on PyParallel https://news.ycombinator.com/item?id=7861942 but afaik it is only for windows.

lstyls · on Aug 14, 2017

This is a good point and one of the few convincing arguments I've heard against the GIL. Thanks for providing so much detail!

Did you consider just mounting a ramdisk and storing data as files? At first glance it seems like a decent fit for sharing read-only data in memory.

darpa_escapee · on Aug 15, 2017

From my experience, tmpfs adds an overhead. If I use an in-menory database for SQLite, it is about twice as fast to interact with than accessing a database file loaded into a tmpfs.

riri-au · on Aug 14, 2017

Agreed, shared programming is an immensely useful feature for numerical programming, including data science and machine learning. Lots of people will say that those should be written in C++, but I think the rise of machine learning & data science in high level languages argues against their point.

bhntr3 · on Aug 15, 2017

Yeah. Python is pretty standard now. To implement high throughput scoring on models written in python, I have to run multiple processes with one copy of the python model for each process. For large models like random forests, this can eat up a lot of memory.

Ideally, it would be a single model in memory with access from multiple threads. But that won't work right now cause GIL.

crb002 · on Aug 14, 2017

Having ported Ruby to IBM's Blue Gene/L my advice is to forget about the GIL. Run one Python process per core. Use something like MPI2 for message passing communication. Ruthlessly eliminate bloat code from production binaries and statically link all the things.

metalliqaz · on Aug 14, 2017

I agree wholeheartedly. Almost every time I hear from someone who is upset about the GIL, I find that they would be much better suited to using multiprocessing instead of multithreading.

With 80% of the developers out there, they are basically assured of producing better, more stable code this way.

Animats · on Aug 14, 2017

Python's "multiprocessing" means launching another Python interpreter in a subprocess. Each process has a full copy of the Python environment. They may share the base interpreter, but there's a separate copy of every package loaded and all data. Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive; it means a full interpreter launch and a recompile/reload.

"Multiprocessing" is useful when you have a lot of work to do concurrently and not too much data to pass between processes. I've used Python subprocesses that way. Parallelizing your number crunching is probably not going to work very well.

zeptomu · on Aug 14, 2017

See my other replay in this thread.

> Parallelizing your number crunching is probably not going to work very well. [...]

The question is, what exactly does "number crunching" mean? We do aerial imagery analysis, so image processing in essence, which I would classify as a "number crunching" problem. A common thing e.g. is to do a time-series analysis and you can simply start multiple (2, 4, ..., N with clusters, etc.) processes for each problem. Obviously this works because most methods are computation and/or memory heavy - the additional memory requirements and "overhead" of Python itself (IMHO people overestimate the weight of starting new processes instead of threads) is completely dwarfed by the requirements (memory and CPU) of the method itself.

kerkeslager · on Aug 14, 2017

...which is true, but doesn't mean you can just ignore it.

Interpreter state is among the most frequently accessed memory in many applications, meaning it's ideal to have it in cache. The difference between two interpreter states and one might not be big compared to the data being processed, but it's big enough to bump a lot of interpreter state out of cache, which for many programs can have drastic performance implications.

If you don't think cache locality is important, look at radix sort versus quicksort. Radix sort has a much lower O, but performs worse in most cases because of its poor cache locality.

Look, I get that there are fairly easy ways to work around these problems, but let's not just blithely pretend they aren't problems.

zeptomu · on Aug 14, 2017

Agreed, but there is a lot of misinformation about the topic. I met developers that thought the GIL prevents you from running your program in multiple instances at the same time on one machine - which is obviously not the case.

Sure, it's a problem for specific workloads, and Python will get there eventually - I just don't think it is a deal breaker.

hyperbovine · on Aug 14, 2017

Actually things are not as bad as they used to be. Since 3.4 you can alter the way multiprocessing starts processes:

https://docs.python.org/3/library/multiprocessing.html#conte...

The ``forkserver`` method eliminates most of the problems you mention: child processes are only started once, and they fork() from a totally separate process so they don't inherit all of the resources of the main process (in particular, they don't copy the whole heap). I've found this eliminates 90% of the performance-related issues I used to experience with multiprocessing.

asperous · on Aug 14, 2017

If you're CPU bound (only reason to care about the GIL anyway), then you want one process per core. So at least the L1 memory cache isn't shared. The separate memory consumption is minimal (3-5MB*N cores).

You don't need to do setup/destroy more then once.

peterkelly · on Aug 14, 2017

If CPU load is an issue, why would you be using an interpreter in the first place?

zerkten · on Aug 14, 2017

It seems like people never make this assessment, or use the GIL argument to put interpreted languages down. I personally run into I/O bound problems way more often than CPU bound ones. That said, I'm mainly doing things in the realm of a Python web developer. Scientists probably hit CPU bound problems more often with Python, but seem to drop down to C/C++ extensions without needing to complain about the problems.

khedoros1 · on Aug 14, 2017

A lot of the heavy lifting is done through calls to C libraries anyhow, with Python just being a convenient way to pass the data around.

falcolas · on Aug 14, 2017

Indeed, and in that case the GIL is effectively a non-issue (there's no requirement for the GIL to be held by non-python code).

Animats · on Aug 14, 2017

No, no, if you're manipulating Python objects from C code, you have to hold the lock. You can release it only when not doing anything with objects in Python's memory space. Otherwise you get race conditions and intermittent crashes.

throwaway613834 · on Aug 14, 2017

> If CPU load is an issue, why would you be using an interpreter in the first place?

You're basically asking why NumPy, SciPy, Numba, etc. even exist.

They exist because Python is ridiculously fast to develop in compared to, say, C++.

CogitoCogito · on Aug 15, 2017

By using numpy, etc. you're basically _not_ using the interpreter because you're using C/C++/fortran code that's been compiled with python bindings.

To combine both your points, the best approach (if you like python) is to stick with python due to is ease of development and use libraries such as numpy as far as possible. However, if your use case is CPU bound but not served by those libraries, then you'll either need to develop your own extensions or throw away the interpreter altogether (and go with a different language).

dsfyu404ed · on Aug 14, 2017

Just because those resources exist does not mean you get to park your 1997 Chevy Cavalier diagonal across three parking spaces.

I'm not going to run your code on my server if your code uses resources so poorly that I can't run other things I want to run on my server.

DigitalJack · on Aug 14, 2017

You are getting downvotes, I suspect, because your comment makes no sense in the context of the post you replied to.

Perhaps you meant to reply to GP?

valarauca1 · on Aug 14, 2017

>Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive

Statically and dynamically loaded binaries are resident in the kernel's page cache. Which while each process will have different locations within its process address space for each process (b/c ALSR), they _should_ be de-duplicated in RAM, ultimately all these seperate in process images will be pointing at the same physical RAM page(s).

So from a hardware cache standpoint you're mostly okay.

Animats · on Aug 14, 2017

That's just the interpreter's executable. All the stuff that's generated from the Python code you load, and any data it generates, is unique to the process.

dom0 · on Aug 15, 2017

With Python that's a lot of stuff; I suggest running strace python some_small_script.py to see just how much data Python loads on every single startup.

dekhn · on Aug 14, 2017

The other issue with multiprocessing is that it requires the enclosing code to be pickleable, and many Python objects are not pickleable. For example, if I have a thread-safe RPC client and want to send thousands of RPCs using the client, I can't do that with multiprocessing (subprocess pool; threading pools work). RPC clients manage a TCP connection, if you use multiprocess you end up having to make many TCP connections.

pslam · on Aug 14, 2017

Absolutely agree. Almost all tasks will perform very well when using multiprocessing. It also has a nice side-effect of steering you towards explicitly coding data flows without fine-grained sharing.

If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.

There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.

weberc2 · on Aug 14, 2017

> If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.

Unfortunately, these performance concerns often manifest well after the "rewrite it in a different language" date has expired. There are a lot of people in that boat, and they need better options.

> There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.

If I understand correctly, the article discusses this ("subinterpreters"), but claims that there is no advantage to this approach vs multiprocessing. Presumably any overhead savings are eaten by GIL contention or some such?

dom0 · on Aug 14, 2017

> There are a lot of people in that boat, and they need better options.

Land isn't coming to you, folks, you must start rowing if you want to get there.

Rewrite bit for bit. Module for module. Package for package.

weberc2 · on Aug 14, 2017

Sounds like you're saying this is infeasible; care to explain why?

funkymike · on Aug 14, 2017

I took it to mean that it is feasible. Instead of saying "well we used the wrong language, I guess we're screwed," you rewrite one component at a time, piece by piece, until the whole has been replaced.

This is the approach I try to use myself. It's nearly impossible to replace an entire system all at once. But replacing one part at a time is doable and you can see the improvements much sooner.

weberc2 · on Aug 14, 2017

By "it is infeasible", I meant, "removing the GIL is infeasible"; not "rewriting is infeasible".

metalliqaz · on Aug 14, 2017

What would that get you?

glic3rinu · on Aug 14, 2017

Lately I've found out that multiprocessing will not help you if your program is multithreaded. There is no sane way of forking a multithreaded program. For one, the child process will inherit a copy of all locks in the state they where at forking time, possibly causing random crashes and deadlocks.

lstyls · on Aug 14, 2017

> There is no sane way of forking a multithreaded program

The sane way of forking a multithreaded process is to exec immediately after.

dom0 · on Aug 15, 2017

It is possible to do more after a fork (cf. async-signal-safe), but it's hairy enough to just say — don't, always exec (similar to how doing actual work in a signal handler is generally a very bad idea).

hyperbovine · on Aug 14, 2017

If the child program is multithreaded then it's almost certainly not pure Python in the first place. So, wrap it up in `with nogil:` Cython statements and use the threading module (or concurrent.futures.ThreadPoolExecutor).

d0mine · on Aug 14, 2017

that is why 'forkserver' start method exists. https://docs.python.org/3/library/multiprocessing.html#conte...

optimusclimb · on Aug 14, 2017

Except when your use case requires a massive shared data cache that needs to be atomically updated.

omarforgotpwd · on Aug 14, 2017

Redis could help. Obviously not perfect for every use case but covers many of them.

optimusclimb · on Aug 14, 2017

It doesn't if you need to manage atomic data across the processes, as there's no way to lock and block the other cache consumers (think the data you need to handle cache evictions, etc.)

Also, you're describing multiple python processes + an extra server (redis) process - as a "simpler" solution for the limitation that Python doesn't do multi-threads well.

Of course there are a ton of use cases out there where you can scale in other ways, but threads and shared memory exist for a reason - there's no reason not to call a spade a spade and say the GIL is still a limitation.

tinix · on Aug 15, 2017

Blocking workers in a Redis queue is not hard... You can simply put them all on a pubsub control channel and then orchestrate them that way you need to do shit. Or literally just take down the processes, or the network, so they disconnect and stop BLPOPing the queue.

Cache evictions can be handled by Redis natively with TTL.

For retries and failure mitigation, you can still lean on Redis via BRPOPLPUSH/RPOPLPUSH.

If you want to scale beyond one machine, you can't rely on threading to help you. So why not just do it right to begin with, and use a parallel worker queue?

It's not a matter of the GIL being a limitation, a single machine is a limitation too. Don't blame your tools because you're misusing them.

As for threading in Python... on a single machine, for one reason or another... I would still rather use multiple processes, or at the very least, would just simply use eventlet and greenthreads.

Not saying it covers all use cases, it's not a silver bullet, and it doesn't replace threading natively, but damnit, it scales better, and it's the right way to do the task at hand.

optimusclimb · on Aug 15, 2017

Wasn't reading data from or using a redis queue, this wasn't a blocking queue issue.

Second, my use case wasn't a simple cache, I was omitting details. So redis having a TTL eviction policy for the values it stores is a moot point. The resources I was dealing with ranged from around 0.5GB to several gigabytes. That was the important working data - but whether or not these objects were available was what had to be coordinated (as well as some other bookkeeping data.)

Also, in this case - of course scaling beyond one machine was important. We were. The issue is that for each machine you allocate, you want to maximize usage of its resources. So each machine gets its own data cache, but nonetheless, we still wanted to max CPU usage per machine. So again, it's multi-process, vs. multi-thread, and in this case - multi-threaded with shared memory was a much easier paradigm than handling co-ordination among separate python processes.

I was just giving an example of reasons one would want true multi-threading in python. I wasn't trying to go into explicit details of an entire project. Please consider this when you reply to people and tell them they're "misusing their tools."

Good day stranger.

abecedarius · on Aug 14, 2017

I think it's ok to not write everything in Python, and this is a long way from the top of my problems with it.

optimusclimb · on Aug 14, 2017

Of course it is - and that's what people do. The reason for the parent article is that there ARE people that would like to continue to use Python the language, and their existing source code/libraries, but would like not to deal with the GIL. Just because it is not a priority for you doesn't mean it isn't for others.

jhayward · on Aug 14, 2017

This is self-fulfilling. As long as Python is useless for a set of tasks that are intrinsic and important to some domains, they won't use it.

jhayward · on Aug 14, 2017

> Except when your use case requires a massive shared data cache that needs to be atomically updated

You can delete the last 6 words. Anything where multiple processes would have to read in/acquire a massive dataset to do some independent work qualifies. For instance, running some number (e.g., hundreds to hundreds of thousands) of analytical or statistical tests over a set to pick parameters, etc.

abecedarius · on Aug 14, 2017

Read-only shared memory can cover that. Python's ref counting does make it a nuisance: you can't share it as a Python object graph.

zepolen · on Aug 14, 2017

Of which there are plenty well defined ones to do the job already, and as a plus they can communicate with any language not just Python.

kerkeslager · on Aug 14, 2017

Your parent comment gives good advice, because the GIL is probably here to stay and so there's no use complaining about it. But the idea that multiprocessing gives better results than multithreading is ridiculous.

In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight. Threads are almost always preferable to processes in most languages.

I understand why the GIL is still around, and don't necessarily support removing it, but it's definitely not there because it produces "better, more stable [Python] code".

gnaritas · on Aug 14, 2017

> In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight.

But also plagued with shared state concurrency bugs, something multi-processing completely avoids so...

> Threads are almost always preferable to processes in most languages.

No, they aren't. It's too easy to write buggy code with threads, it's a flawed model. Now it's certainly true that more people choose threads than processes but that's because they vastly overestimate their ability to write bug free lock based code. Processes are better.

kerkeslager · on Aug 15, 2017

1. In Python, Python provides mechanisms for communicating between processes. Literally the exact same mechanisms can be used to communicate between threads. So I'm not sure why you think processes are inherently safer than threads.

2. If we're taking about all languages, I'm really just not sure why you would assume threads imply locks. There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.

gnaritas · on Aug 15, 2017

> So I'm not sure why you think processes are inherently safer than threads.

Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.

> There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.

It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.

kerkeslager · on Aug 15, 2017

> Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.

Please read what I said before the part you quoted. In fact, maybe read the rest of the chain of comments--the topic of conversation is threads versus processes in Python, and threads in Python do not require you to use shared mutable state, locks, or any of the assumptions you've made. If you can write multiprocess code in Python, you can write multithread code using the same mechanisms for memory-safe interthread communication as you would for interprocess communication.

> It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.

It's not about shared mutable state, because that's not what anyone was talking about before you brought it up, and there are plenty of threading models that don't have shared mutable state, too.

You're preaching to the choir here about locks and shared mutable state being bad, but it has nothing to do with anything that was being discussed before your showed up with a bunch of assumptions.

gnaritas · on Aug 15, 2017

I know exactly what you said, I "know" multi-threading doesn't require shared mutable state, I never claimed it did.

Don't presume to tell me what topic I might want to digress on, if you don't want to reply then don't, no one forced your hand.

pertymcpert · on Aug 15, 2017

Not the person you replied to, but this thread is really frustrating to read. Multithreading does not imply shared mutable state.

gnaritas · on Aug 15, 2017

No one claimed it did. Threaded code is plagued with bugs, it was not claimed that implies all threaded code uses shared state. Your frustration is unwarranted.

swift · on Aug 15, 2017

The question of shared state vs. message passing is orthogonal to processes vs. threads. Both techniques can be and are commonly used in both situations.

gnaritas · on Aug 15, 2017

No it isn't. Clearly both techniques "can" be used, but that one "allows" shared state trivially and one doesn't matters greatly; it is not orthogonal, you just don't grasp the point being made about the nature of the choice of abstractions and the problems that come with them.

ericfrederich · on Aug 14, 2017

The GIL is a legit pain when dealing with GUIs.

When you're jumping between C/C++ code and Python code you don't care much about the GIL... until you have a GUI which needs to be kept responsive and needs the GIL to do so.

simonh · on Aug 14, 2017

Ive done a fair bit of GUI development in Python, mainly using Qt and not hit any significant responsiveness issues. The multi-threading support in Python is perfectly fine for providing responsive switching between activities and event loops, as long as you don't have anything that locks hard for too long. But in that case you can always split that off into a separate process e.g. The way browsers nowadays run a process per tab.

bb101 · on Aug 14, 2017

I can see how using multiprocessing trumps threads for smaller programs. However it can become memory inefficient to have larger programs running in multiple processes, especially on servers with less resources.

njharman · on Aug 14, 2017

If I run N versions of program that occupies 8mb of memory the memory footprint of the code is much less than N*8mb due to shared libraries/memory pages.

It's a factor, sure. But, one you should weigh with other factors to determine what is best.

captainmuon · on Aug 15, 2017

If you have long running, computationally intensive code, with simple interactions, sure. Then multiple processes is the right thing.

But sometimes you are writing a GUI app, or some "real time" code [1]. You put blocking calls onto a different thread to keep the UI responsive. But then you find that still the blocking calls freeze the UI across threads due to the GIL.

Pure Python code is not the problem in this case - the GIL gets released between statements often enough. It is long running C code. You could release the GIL manually in there, but it is not done everywhere. Also, there are often calls that are supposed to be instant (like opening a file, or starting an async operation), that take seconds under bad conditions (when the network is down).

----

[1] well, with Python probably not in the strict definition of real time, but say you are controlling some external device

slantedview · on Aug 14, 2017

> multiprocessing instead of multithreading

There's a reason threads exist.

gnaritas · on Aug 14, 2017

Those reasons aren't what they used to be, resources aren't nearly as limited these days and we now have the hindsight to see that threads lead to very buggy code due to shared state. Processes are better.

agumonkey · on Aug 14, 2017

hettinger said that multiprocess used pickle for every communication and that it must be accounted for when optimizing

zeptomu · on Aug 14, 2017

I am a heavy user of Python and its scientific libraries (numpy, etc.), and although I know about the GIL, I have to add, that for us (we do a lot of scientific code-prototyping to evaluate remote sensing processing methods) the GIL hasn't been a problem so far.

E.g. in the remote sensing and earth observation domain you can simply divide your problem (e.g. semantic segmentation) into (maybe over-lapping) subproblems (via e.g. tiling) and start separate processes for each image processing tool-chain.

Granted you may not utilize your resources to the full extent by only applying multiprocessing (and ignoring threading), but in my experience you can solve a lot of problems by simply applying map-reduce-like programs and optimizing for throughput.

astrodust · on Aug 14, 2017

Counterpoint: Threads in JRuby work as threads should work. No GIL. No grinding of gears.

Multi-process is just one form of concurrency, and it's not always the best one.

_euac · on Aug 14, 2017

Whoa, thanks for bringing up MPI2, you might have just saved me a lot of painstaking development with the mmap and multiprocessing libraries.

kevingadd · on Aug 14, 2017

The post addresses this strategy and describes why they consider it insufficient. You can already do this in Python, anyway.

sametmax · on Aug 14, 2017

Yes but they discard multiple interpreters as having no real advantages. This is dishonest since it should be able to share objects with much, much less overhead as multiprocessing, while allowing to use multiple CPU. It's not perfect, but honestly it seems like a very good deal for Python.

omarforgotpwd · on Aug 14, 2017

Yes, multi-processing is much easier anyway. Not to mention how complicated, not backwards compatible and thorny trying to get rid of the GIL is...

cshenton · on Aug 14, 2017

Thats not an option when you want to do something like reinforcement learning with lock free updates. In that case, the networks are small enough that you want to use the CPU, but learning sensitive enough that you don't want multiple copies of the network getting out of sync. Then you absolutely need multiple cores sharing memory.

sbeckeriv · on Aug 14, 2017

I would like to know more.

carapace · on Aug 14, 2017

I can't tell you how happy I was to see your comment at the top of this discussion.

Relevant: "Python is Only Slow If You Use it Wrong" http://apenwarr.ca/diary/2011-10-pycodeconf-apenwarr.pdf

gshulegaard · on Aug 14, 2017

I feel like the GIL is, at this point, Python's most infamous attribute. For a long time I thought it was also the biggest flaw with Python...but over time I care less and less about it.

I think the first thing to realize is that single-threaded performance is often significantly better with the GIL than without it. I think Larry Hasting's first Gilectomy talk was extremely insightful (about the GIL in general and about performance when removing the GIL):

https://youtu.be/P3AyI_u66Bw?t=23m52s

I am not sure I would, personally, trade single-threaded performance for enabling multi-threaded applications. I view Python as a high-level rapid prototyping language that is well suited for business logic and glue code. And for that type of workload I would value single-threaded performance over support for multi-threading.

Even now, a year later, the Gilectomy project is still slightly off performance-wise (although it looks really really close :) ):

https://youtu.be/pLqv11ScGsQ?t=27m32s

As noted elsewhere, multi-processing offers adequate parallelization for this type of logic. Also, coroutines and async libraries such as gevent and asyncio offer easily approachable event loops for maximizing single-threaded resource utilization.

It's true that multi-processing is not a replacement for multi-threading. There definitely are tasks and workloads where multi-processing and its inherent overhead make it unsuitable as a solution. But for those tasks, I question whether or not Python itself (as an interpreted, dynamically typed language) is suitable.

But that's just my $0.02. If there is a way to remove the GIL without negatively impacting single-threaded performance or sacrificing reference counting for a more robust (and heavy) GC, then I am all for it. But if there is not...I would just as soon keep the GIL.

VectorLock · on Aug 14, 2017

The GIL has been a much bigger problem for perception than it ever has been for performance. Python has lost more mindshare over it than anything else. The few machine cycles that were ever saved by moving away from it were far outweighed by the waste of human cycles.

bsder · on Aug 14, 2017

The few machine cycles that were ever saved by NOT moving away from it (which is the ONLY justification for keeping it) were far outweighed by the waste of human cycles.

If Python would simply suck it up and eat the 20% performance hit, we could stop talking about the GIL and start optimizing code to get the 20% back.

xenadu02 · on Aug 14, 2017

Many projects have solved this problem with dual compilation modes and provide two binaries the user can select from at runtime.

Eliminating the GIL doesn't have to mean actually eliminating it. You could certainly have #defines and/or alternate implementations that make the fine-grained locks no-ops when compiling in GIL mode. Conversely make the GIL a no-op in multithreaded mode.

sametmax · on Aug 14, 2017

Which is what multi-interpreters is a good solution. You keep the GIL and the benefits of it, but you loose the cost of serialization and can share memory.

dec0dedab0de · on Aug 14, 2017

Could someone who really wants to get rid of the GIL explain the appeal? As far as I understand, the only time it would be useful is when you have an application that is

  1. Big enough to need concurrency

  2. Not big enough to require multiple boxes. 

  3. Running in a situation that can not spare the resources for multiprocessing. 

  4. You want to share memory instead of designing your workflow to handle messages or working off a queue.

#4 does sound appealing, but is it really worth the effort?

m_mueller · on Aug 14, 2017

In my five years of python I've run up against this boundary at least once. In your list I would

* take out #2. if something can make use of multiple nodes it can usually make even better use of multi-core parallelization (which affects both computational and memory bandwidth performance). multi-node comes with a much higher communications overhead, so there's a relatively wide range of applications that scale well on multi-core but not multi-node.

* add that #3 comes up as soon as you have complex data structures to share. Serializing and Deserializing (by default with pickle) is a huge overhead for anything a bit more involved. If you design for this from the start you can be fine, but often these things grow and eat up bigger and bigger usecases until you run against the GIL. This basically happens with anything that has enough data and users and need - hey I heard your scheduler tool works well for the cafeteria, I'm sure it can handle our global operations right?

* about #4 - see the previous point.

njharman · on Aug 14, 2017

Once (or a few times) in 5 years puts this problem into the "not worth(ROI) solving" bucket for me.

Those few times, put down the hammer and use some other tool for those not naillike jobs.

m_mueller · on Aug 14, 2017

Here's the thing: Python, especially 3.6, is such a well rounded language that all other major limitations have IMO been solved already. In my view the GIL is the main one left, and reason to pause and think whether python is a good idea at the start of a project. Removing it is therefore worth it, and would also give a nice additional incentive for everyone to switch to python 3.x, so we don't have to keep on maintaining 2.7 with the same code (i.e. the worst of both worlds).

neolefty · on Aug 14, 2017

Certainly not worth it for one person to tackle the GIL, but a million people running into it a few times in 5 years, and I think it's economical.

kllrnohj · on Aug 14, 2017

#1 & #2: Consumer CPUs are now pushing 16 cores & 32 threads. Python is limited to ~1/20th of what a single box is capable of. That's a pretty big bottleneck.

#4: Even if you're just talking message passing sending a message between threads is in the 10s of nanoseconds while between processes is 10s of microseconds. That's a ~1000x slowdown on core communication. Given that CPU cores are not getting any faster, that's a pretty big hit to efficiency to take. Similarly simply moving data between processes is expensive, while moving data between threads is free.

breatheoften · on Aug 14, 2017

Moving data between threads is only free to the extent that synchronization is free. Maybe you could say that moving immutable data between threads is free but I don't think you can say its free in general ... Doing so significantly undersells the complexity that comes with shared memory concurrency.

kllrnohj · on Aug 14, 2017

You seem to be conflating moving with sharing. Moving between threads is always free[1] regardless of if it's mutable or immutable, and there's no concurrency issues at all since it's a move.

Move means the sender no longer has a reference. As in, std::move, rust's ownership transfer, webworker's transferables, etc...

1: Yes there's a single synchronize point where the handoff happens, but this is part of sending a message at all. It's also independent to the size & complexity of the payload itself when we're talking multi-threaded instead of multi-process. You have that exact same sync point that costs the exact same regardless of whether your message consists of a single byte or a multi-gigabyte structure.

breatheoften · on Aug 15, 2017

Ah I see -- Can you actually describe ownership in python sufficiently well to be able to describe this move operation for any useful python data structures?

kllrnohj · on Aug 15, 2017

Generically ownership is simply who has a reference to the object.

So if a.foo = b, then a 'owns' b. A 'move' is simply handing a different object the reference, then dropping your own reference. For example:

  a.foo = b // a 'owns' b
  c.foo = a.foo // a & c share b, however if the next line is:
  a.foo = None // a has 'moved' b to c, since c now has the only reference to b.

Some languages have codified this to make the contract part of the language, but it doesn't need any first-class language support. It's just a pattern at the end of the day.

sqeaky · on Aug 14, 2017

You are right about the majority of what you said, but I am pedantically picking on one point. CPU cores are getting faster, but they aren't doing it with clock speed, they are dispatching more instructions per cycle or otherwise making the work faster.

kllrnohj · on Aug 14, 2017

IPC gains per generation are vanishingly tiny, if they exist at all. Skylake -> Kaby Lake, for example, had no IPC improvements at all. A very small clock bump to the various tiers was it.

Even if you look over a large generation gap there's only a ~20% IPC improvement going from an i7-2600K to an i7-7700K ( https://www.hardocp.com/article/2017/01/13/kaby_lake_7700k_v... )

6 years & a shrink from 32nm to 14nm and all it can muster is +20%. Cores are just not getting faster by any meaningful amount.

sqeaky · on Aug 16, 2017

My compiles have gotten more than 20% faster so something is making my newere machines faster than my older machines.

That it is not 10x as fast I blame on AMD for not being as competitive as they could have been.

happycube · on Aug 15, 2017

Yeah - I remember when five year old CPUs were basically useless!

(Kaby Lake is basically a new stepping of Skylake - if intel wasn't having problems with new process nodes it likely wouldn't have been released at all, and if it was it would've been used for a one-off chip in the same generation ala the 4770K)

jononor · on Aug 14, 2017

The efficiency hit is very dependent on how large your computation chunks are. If the computation per message batch is on order of 100 ms, it would be <10% loss.

kllrnohj · on Aug 14, 2017

Assuming very small messages that are rarely sent then yes, the hit of multi process is not going to be your biggest issue.

colesbury · on Aug 14, 2017

Your criteria 2, 3, and 4 doesn't make much sense to me. We often have workloads that require multiple boxes, but we still want to make effective use of each box. Common server hardware has dozens of cores, which requires a lot of parallelism to fully utilize. The GIL hinders that, even when most of the work doesn't hold the GIL (see Amdahl's law)

Python multiprocessing doesn't work well with a lot of external libraries. For example, CUDA doesn't work across forks and many system resources can be shared across threads but not processes. Python objects must be pickled to be sent to another process, but not all objects can be pickled (including some built-in objects like tracebacks).

A lot of different parallel programming models can be built on top of threads (shared memory, fork-join, message passing), and to a certain extent they can be mixed. That's not true of Python multiprocessing, which only allows a narrow form of message passing. (It's also buggy, has internal race conditions, and easily leaks resources.)

The problem for CPython is that it may not be possible to remove the GIL without breaking the C API, and a lot of the benefit of Python is the huge number of high-quality packages, many of which use the C API.

PrimHelios · on Aug 14, 2017

CPython doesn't have any reservations about breaking the Python API between minor versions, so why care about the C API? I get where you're coming from, but they've already shown they don't care much for compatibility, so I don't see why that's a big obstacle.

dom0 · on Aug 14, 2017

Removing the GIL (in a non-braindead way) likely entails breaking all existing code using the C API. PyPy could do so without breaking cpyext, by maintaining the illusion of a GIL whenever control passes to cpyext.

PrimHelios · on Aug 15, 2017

That makes sense, I hadn't thought about the extent of the breakage.

std_throwaway · on Aug 14, 2017

Does it lock the GIL so numpy can release it again immediately afterwards?

dkersten · on Aug 14, 2017

Perhaps it makes the unlock call a no-op before numpy tries to unlock it.

dom0 · on Aug 14, 2017

> (see Amdahl's law)

Amdahl's law bears little relevance to throughput computing (i.e. most servers).

> (It's also buggy, has internal race conditions, and easily leaks resources.)

There is also at least one memory corruption bug in multiprocessing (linked a few months back by a fellow HN reader).

make3 · on Aug 14, 2017

There are many cases where the objects are too big to be passed around. Python is used a huge amount in Machine learning and datascience, where being able to do parallel work on stuff already in memory would be great.

smaddox · on Aug 14, 2017

Can't this already be handled by calling out to a C/C++ or FORTRAN procedure that processes the data in multiple threads? For number crunching, Python is almost exclusively used as glue.

foobarchu · on Aug 14, 2017

You CAN handle it, but why should you have to? If it's possible to remove that barrier, then it absolutely should be removed. If the only answer to a problem is "use another language", then the language in question has a limitation that needs to be addressed.

Drdrdrq · on Aug 14, 2017

It is not a limitation at all in this case. Python is just a front to Tensorflow and similar libraries/frameworks so GIL doesn't matter there.

ams6110 · on Aug 14, 2017

Today's machine learning and data science students don't know how to code in those languages. They know python, and maybe java.

orangejewce · on Aug 14, 2017

Don't forget R... shudders

sin7 · on Aug 14, 2017

What's wrong with R? I know it's not a programmers language but it's great for getting things done.

dec0dedab0de · on Aug 14, 2017

So work on data that can not be broken down into smaller chunks? That makes sense, and is something I never come across.

chrisseaton · on Aug 14, 2017

I'm sure they can be broken down into smaller chunks, but is it more efficient if they aren't broken down and instead shared memory is used? If you want parallelism you're obviously already worried about performance.

fulafel · on Aug 14, 2017

Something like the web worker primitives might work there (transferables & sharing read only data).

leereeves · on Aug 14, 2017

Are those applications often bottlenecked by the CPU, as opposed to GPU or data transfer?

rspeer · on Aug 14, 2017

The world of algorithms that run well on a CPU is still much, much bigger than the world of algorithms that run well on a GPU, even in machine learning.

And even if you're fortunate enough that Nvidia designs their GPUs to solve your problem, why should the CPU cores sit idle?

omarforgotpwd · on Aug 14, 2017

Motivation for removing the GIL is basically that when people hear about it they go "hmmm that doesn't sound good". Obviously many applications have been written in GIL languages and there aren't really many practical problems that can't be overcome easily.

neolefty · on Aug 14, 2017

I think it may be some Stockholm Syndrome -- people have worked very hard to get around the GIL, and they've come to expect its limitations and respect those solutions.

But I've never heard of someone asking for a GIL to be added to the JVM.

titanix88 · on Aug 16, 2017

This! Try to implement a controlled task scheduler using multiprocessing and sooner or later you are going to hit some unexpected behavior, like - multiprocessing.Queue belly-upping for no reason, UNIX signals propagating throughout the process chain and killing them left and right, hitting some data/object which are not serializable etc. Getting multiprocessing to work right takes a LOT of careful efforts, which breaks the whole promise python.

I've since moved to clojure, which is a language designed with concurrency from ground up. Look at clojure's `atom` - it's basically what every beginning programmer expects from globally shared variables, minus the gymnastics of handling race conditions on your own.

Also, `core.async` is such a beautiful thing to work with for writing schedulers. Compared to this, python's asyncio is an unfunny joke.

I don't think python's GIL can be removed with ad-hoc locking. Nothing sort of complete re-implementation will do.

astrodust · on Aug 15, 2017

Exactly! If a feature is good it's worth adding, and the GIL in any language is not good.

People are just as quick to bemoan a language for not having something (generics, templates, pre-processors) because they see some perceived need, but a GIL is never one of those things.

sandGorgon · on Aug 14, 2017

+1 on this - what is more important for me is some kind of Numba LLVM jit to automatically optimize hotspots : kind of like the JVM hotspot compiler.

Numba already does some of this.

Additionally, I cannot help but wonder if the answer to these problems has been the JVM all along. Especially with JVM 9 and the Truffle framework - https://github.com/securesystemslab/zippy

rbjorklin · on Aug 14, 2017

I was just about to mention Graal & Truffle when I saw your post! I wasn't aware of ZipPy but it looks promising! Java 9 will provide a proper interface for Graal through JVMCI and is only 37 days away from GA [1]. With Graal supposedly only months away from GA [2], ZipPy may very well prove to be the future of high performance Python.

[1] http://www.java9countdown.xyz/ [2] https://www.infoq.com/presentations/polyglot-jvm-graal (see roughly 42:00 - 47:00)

EDIT: Wording.

btown · on Aug 14, 2017

Say you're running CPU-bound workers that need to load significant data into RAM - say, a machine learning or NLP model. The most cost-effective theoretical approach would be to have that in shared memory, so you're not paying for that RAM multiple times in order to fully utilize all cores. Even if you need multiple boxes, the cost savings per core would be substantial. My understanding is that multiprocessing makes you jump through hoops to set up that shared memory; this would make it largely transparent to the user while remaining performant. I haven't used multiprocessing in production, though, so I could be wildly off base there.

dom0 · on Aug 14, 2017

Unless your model actually consists of a large number of Python objects (and not a handful of PyObjects referencing something like a np array), there isn't really anything blocking you from doing so. You can have a master process map the blob of static data into a block of shared memory that's mapped by the secondary processes; ctypeslib lets you access it as a numpy array again.

ant6n · on Aug 14, 2017

Using multiprocessing is a pain to use, and it's slow.

bobwaycott · on Aug 14, 2017

If you’re looking for simple threaded multiprocessing, it’s not that hard/painful:

    from multiprocessing.dummy import Pool
    
    pool = Pool(num_threads)
    result = pool.map(your_func, your_objects)
    pool.close()
    pool.join()

Improve and/or complicate things from there.

zeptomu · on Aug 14, 2017

This is a nice pattern and there are surprisingly many problems that can be solved that way. AFAIK you do not have to join() here as the processes die after the map call.

Often the challenge is a big amount of (hopefully read-only) data that you want to access in every 'your_func'. The naive solution is to copy the data, but this might blow your memory.

dec0dedab0de · on Aug 14, 2017

in what was is it a pain that threading is not?

sevensor · on Aug 14, 2017

With threading, all of your threads can refer to the same objects. Multiprocessing means you have multiple interpreters running. That means no shared memory, and communication over pretty slow queues. I've definitely wanted to have multithreaded Python programs where all threads referred to the same large read-only data structure. But I can't do this because of the GIL. I mean, I can, but it's pointless. I can't do this with multiprocessing because of the limitations on shared memory with multiprocessing.

Edit: I realize I'm contradicting myself here. No shared memory is a first approximation. You can have shared memory with multiprocessing, but most objects can't be shared.

btilly · on Aug 14, 2017

And yet, if you could have what you want, would it actually be faster?

The costs of synchronizing mutable data between cores is surprisingly high. Any time your CPU thinks that the data that it has in its cache might not be what some other CPU has in its cache, the two have to coordinate what they are doing. And thanks to the fact that Python uses reference counting, data is constantly being changed even though you don't think that you're changing it.

Furthermore if you throw out the GIL for fine-grained locking, you then open up a world of potential problems such as deadlocks. Which look like "my program mysteriously froze". Life just got a lot more complicated.

It is easy to look at all of those cores and say, "I just want my program to use all of them!" But doing that and actually GETTING better performance is a lot trickier than it might seem.

sevensor · on Aug 14, 2017

Right, but like I said, I'd be fine with a read-only shared data structure. I have a problem that has a hefty data model. The problem can be decomposed and attacked in parallel, but the decomposition doesn't cut across the data. Right now I run n instances on n cores, but that means making n copies of a large data structure. This requires a lot of system memory, ruins any chance I have of not wrecking the cache (not that I have high hopes there, but still), and forces me into certain patterns, like using long-lived processes because it's expensive to set up the model, that I'd prefer to avoid.

btilly · on Aug 14, 2017

You might want to look at https://stackoverflow.com/questions/17785275/share-large-rea... for inspiration.

If you need to share a large readonly structure, the best way IMO is that approach. Implement the structure in a low-level language that supports mmap (be very sure to make the whole structure be in the mmap'd block - it is easy to wind up with pointers to random other memory and you don't want that!) and have high performance accessors to use in your code.

sevensor · on Aug 15, 2017

Thanks for the link! Might be worth going down that path.

btilly · on Aug 15, 2017

Good luck. Another benefit of this strategy is that you optimize that data structure using techniques that aren't available in higher languages. So, for instance, small trees can be set up to have all of the nodes of the tree very close together, improving the odds of a cache hit. You can switch from lots of small strings to having integers that index a lookup table of strings for display only.

The amount of work to do this is insane. Expect 10x what it took to write it in a high level language. But the performance can often be made 10-100x as well. Which is a giant payoff.

sevensor · on Aug 16, 2017

Thanks! I've already partly rewritten it in C once, but I misunderstood the access pattern and I ended up having a lot of cache misses. The speedup was measurable, but disappointing, and the prospect of doing another rewrite had put me off. I hadn't put two and two together about this being an effective way to share memory under multiprocessing until reading this thread, so it's worth revisiting now.

_euac · on Aug 14, 2017

Yeah, sharing memory between processes is a very delicate ballet to perform. That said, sharing a read-only piece of data is way simpler than you'd expect, depending on size and your forking chain. The documentation could do a better job of explaining the nuances and provide more examples.

sevensor · on Aug 14, 2017

Care to elaborate? All I've seen in the docs is how to share arrays or C structures between processes. It would take a substantial rewrite to use either. Is there some kind of CoW mechanism I'm missing?

emidln · on Aug 14, 2017

Serializing data for IPC is often undesirable (copies kill) which leads to multi process shared memory. Sharing memory across process boundaries safely is a problem you avoid entirely with threading. You still need to lock your data (or use immutable data), but the machinery is built into your implementation (and hopefully trustworthy).

hermitdev · on Aug 14, 2017

It's been a while, and my memory is fuzzy, but I recall either pyodbc or pysybase reacting very poorly with the multiprocessing module. With multiprocessing, Python would segfault after fork. With threading, it would "work" albeit slowly. Also, IIRC, it did not matter if the module was imported before or after the fork, still segfaulted. I never had the time to try and track down the issue that was causing it, though, deadlines and all that.

cyphar · on Aug 14, 2017

You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker. Also, copying large datasets between processes is not efficient. And also, there are cases where the fan-out approach is not the best way of parallelizing a task, and passing information back up to a parent task is more complicated than necessary.

dec0dedab0de · on Aug 14, 2017

"You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker."

the multiprocessing library allows you to launch multiple processes using your function definitions. It's almost the same as the multithreading library but does not share data.

It seems the real problem, as you pointed out, is the additional memory. I didn't consider situations where each process would need an identical large data set, instead of just a small chunk to work on.

ant6n · on Aug 14, 2017

It gets more interesting when you have a large data set that's required for the computation, but as you compute, you may discover partial solutions that can be cached and used by other workers.

So not only a large read-only data set, but also a read-write cache used by all workers. This sort of thing is relatively easy with threads, but basically impossible with multiprocessing.

dom0 · on Aug 14, 2017

Depending on where you want to go and the application, such things may be good idea for a low number of workers but can become a major bottleneck.

dkersten · on Aug 14, 2017

To add to what everyone else said, if you need transactional semantics, its much simpler in multiple threads. With multiple processes (local or remote), you can't simply share an atomic data structure or a lock, you have to use a distributed lock or consensus algorithm, which are more complex and usually quite "chatty". If memory or network bandwidth are constrained, it may be especially desirable to eliminate this, but even if not, fast locking/transactions may be desirable regardless.

If you're using multiple processes for CPU-bound performance, why not squeeze as much as you can out of each CPU?

dom0 · on Aug 14, 2017

Just like you can share memory between processes, you can also share OS-level locks and semaphores between them. A distributed lock manager is not required for the single-node case.

gaius · on Aug 14, 2017

I haven't written shared memory code in literally years, I just use Redis now.

sametmax · on Aug 14, 2017

It's just a low hanging fruit for perfs from the dev point of view. It's nice and useful, just nowhere as needed as most people asking for it pretend it is.

wiremine · on Aug 14, 2017

> We estimate a total cost of $50k...

Just looking at it from a financial perspective, having a great Python interpreter that doesn't have a GIL seems like a no brainer for $50,000, and it creates another reason why people should take a look at PyPy.

Side note: if you haven't looked at PyPy, check it out, along with RPython

https://rpython.readthedocs.io/en/latest/

chubot · on Aug 14, 2017

Who uses PyPy? I have been hearing about it for so long now, maybe 10 years. And I have been programming in Python almost full-time for 14 years.

But still I don't know anybody who uses it? It seems like the C extension API is still an issue, or am I mistaken?

est · on Aug 15, 2017

we run a large scale sockjs cluster.

Switched from CPython to PyPy, instant 3x performance boost.

andruby · on Aug 14, 2017

How can they estimate this? What about all the libraries that might not be compatible with the solution PyPy comes up with?

This feels like a number that might in the end blow up to 10x the original estimate.

rguillebert · on Aug 14, 2017

It's not the PyPy developers' job to make every Python library threadsafe, people writing libraries will have to make their code threadsafe, like in every other language.

taeric · on Aug 14, 2017

There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change. Userspace for these libraries is defined by how it is, not how it was intended.

(And really, was it intended to be dangerous in this way?)

masklinn · on Aug 14, 2017

> There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change.

No, these libraries are already semantically broken in the same way e.g. libraries which didn't properly close their files and assumed the CPython refcounting GC would wipe there asses were broken.

They're already broken under two non-GIL'd implementations.

cjhanks · on Aug 14, 2017

I agree. Even developers who are well aware of how to write thread-safe code probably don't even bother with mutex locking in Python. That code isn't poorly written... it's just code targeting the implementation.

koolba · on Aug 14, 2017

No the fault in that situation is a user blindly upgrading PyPy without testing the totality of their software package and its dependencies.

Expecting bad code to magically work forever is unrealistic and hinders progress.

rguillebert · on Aug 14, 2017

Then just use the version of PyPy with a GIL?

anonacct37 · on Aug 14, 2017

That's not the concern. Python already has threads and race conditions (although the GIL means that the interpreter itself probably won't get corrupted while executing a piece of bytecode).

What python doesn't have is a C api for extensions that makes sense without a GIL. So ideally a correct threadsafe C extension will continue to be correct, which probably implies that a function called "PyEval_AcquireLock" will continue to provide similar guarantees. Which means that the process for utilizing more cores with pure python code in one process will probably be a gradual upgrade process.

rguillebert · on Aug 14, 2017

C extensions will still run under the GIL

int_19h · on Aug 15, 2017

Given the amount of C extension code running in a typical large Python app these days, isn't this basically defeating the purpose?

rguillebert · on Aug 15, 2017

It really depends on the use case I think

lemoncucumber · on Aug 14, 2017

Would you say the Python ecosystem is stuffed to the GILs with incompatible libraries?

joaodlf · on Aug 14, 2017

This reminds me how the sales/marketing teams in my company typically sell new features: "Not having this feature costs us 50k a month!"

It might as well not be the case here, I just found it funny, 50k is our little magic number.

heavenlyblue · on Aug 14, 2017

Andrew Godwin had raised £17K out of £2.5K expected in order to implement (I believe excellent) Django migrations that are now part of the official repository: https://www.kickstarter.com/projects/andrewgodwin/schema-mig...

Neither do I think that raising $50K for Python interpreter would be an issue.

PS: I don't find Django an excellent ORM per se. On the other hand it's highly pragmatic, and their implementation of automatically-generated migrations have saved a good chunk of my time.

vladf · on Aug 14, 2017

There seem to be a lot of naysayers in the comments about removing the GIL. Multiprocess parallelism isn't always appropriate, so I find this to be a very promising change that will definitely make me want to switch to PyPy. Here are the use cases I've found multiprocessing to be inappropriate:

* High-contention parallel operations. Doing synchronization through a Manager (a separate IPC-based synchronizing broker process) is of course less preferable than, say, a futex.

* Embarrassingly parallel small tasks. This is a big one. If the operation being parallelized is short, then message-passing overhead takes up more runtime than the operation itself, like a bad Amdahl's Law scenario. Shared address space multithreading solves this problem.

* Related: parallelization without the pickling headaches! Many objects can be synchronized but not easily pickled or copied! True multithreading would really enable a large amount of use cases (map a lambda instead of a named function, anyone?) since the same Python interpreter can just pass a pointer to a single shared object.

* Related: lots of libraries (Keras, TensorFlow, for instance) make heavy use of module level globals, and aren't meant to be run on multiple cores on the same machine (TF, for instance, hogs all GPU memory). Multithreading in these deep learning environments (assuming PyPy support from those packages) is useful for parallelizing the input ingestion pipeline. But this point isn't TF/Keras dependent; I can't recall other modules but don't doubt the heavy use of module-globals that's unfriendly with fork()-ing, especially if kernel-related state is involved.

njharman · on Aug 14, 2017

> Multiprocess parallelism isn't always appropriate

Using Python isn't always appropriate.

vladf · on Aug 14, 2017

Are you saying that because a language is missing something, when considering a fix for that thing, the existence of other languages/solutions is an argument against that fix?

njharman · on Aug 22, 2017

I'm saying that hammers and saws exist cause more than one tool is needed to solve problems.

falcolas · on Aug 14, 2017

> There seem to be a lot of naysayers in the comments about removing the GIL.

That's because it's been attempted over and over and over again. And each time it ends up failing due to the decrease in single-threaded performance (the bevy of necessary memory mutexes aren't free)), and the extensive amount of work required to make all of the standard libraries threadsafe.

I don't buy the $50,000 cost for a second. Sure, you might be able to safely change the interpreter for that little money, but you couldn't fix up performance and the standard library for that.

vladf · on Aug 14, 2017

Simplicity of implementation and single threaded speed seem to be, well, implementation issues. Nonetheless, they are reasonable doubts about the project. However, my comment was mostly aimed at the other commenters who were saying multiprocessing suffices for parallel workloads - that came off as dismissive for the reasons I mentioned above.

cjbillington · on Aug 14, 2017

This seems like a good place to spruik something I made, a Python package for profiling how much the GIL is held:

https://github.com/chrisjbillington/gil_load

In my experience, the GIL is not held for nearly as high a proportion of the time as people think it is, because properly written C extensions and blocking io always releases the GIL. So long as the proportion of time the GIL is held is not approaching 100%, then you can still get gains from threading. This is almost always the case in numerically heavy code that uses numpy or scipy, since the extensions release the GIL. Threads work almost just as well at speeding up this code as in any GIL-free interpreter.

And usually long before you consider multithreaded code, you'll want to move the bottlenecks of your code over into Cython or something, since that can give speedup factors much larger than multithreading. In which case all you need is a "with nogil:" around the the meaty bit of you Cython code, and then it too will be able to get speedups from multithreading.

seunosewa · on Aug 14, 2017

The ideal solution is for someone to design a new programming language that is as similar to Python as possible without requiring a global lock. Rarely used features that make it hard to parallelize Python would be dropped. STM might be built into the language instead of being hacked into one implementation, etc.

mrsteveman1 · on Aug 14, 2017

> language that is as similar to Python as possible without requiring a global lock

Something like Pony[0] or Nim[1]? I'm not very familiar with either one, but Nim says it is inspired by Python, and on the surface Pony appears to be as well.

[0] https://bluishcoder.co.nz/2015/11/04/a-quick-look-at-pony.ht...

[1] https://nim-lang.org/features.html

rburhum · on Aug 14, 2017

So basically recreate an entire language and library ecosystem because there is one feature that is less than ideal? I hope you realize why a better approach may be to reengineer that one component...

dTal · on Aug 14, 2017

Python has many less-than-ideal features. Do you think we finally got it right, that we will use Python forever, and that the library work of the past decade or so is irreplaceable?

"Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?" -- Alan Perlis

ehsankia · on Aug 14, 2017

Just to go from py2 to py3, which was relatively a MUCH smaller change than a whole new language, it's taken a decade and it's still far from over. I don't see how a whole new language would be any better. And it's not like there's a lack of new languages popping up left and right. There's a reason most of them just die out. It's insanely hard to gain critical mass unless you have a huge backer, like a whole organization or company using the language.

rburhum · on Aug 14, 2017

At the end of the day, the purpose of why we write software as software engineers is to solve real-world problems, not to have a perfect beautiful language. What you are describing is equivalent to doing an amputation when all you need is antibiotics.

There are several things I personally _hate_ about python, but there is a cost-benefit that comes from engineering new things. What new problems are we going to be able to solve by using a new language? If the answer is clear (e.g. imperative programming vs declarative/functional programming let you solve different kind of problems) then it makes sense to do. If certain constructs enable you to completely avoid a recurring mistake (e.g. garbage collection), then it may make sense.

But this?!?!? No man, you don't need a new language to fix this.

digitalzombie · on Aug 14, 2017

> I hope you realize why a better approach may be to reengineer that one component...

Top comment is proposing basically Erlang or an actor model.

As for immutability... well they have to either have it or manage mutable state.

That task of engineering is not something to scoff at and I think building a new language or using an existing language with those ability would help. Erlang is not a number crunching language. But there are others such as Pony.

nine_k · on Aug 14, 2017

If you want other multiplatform, open-source, highly parallel languages with nice syntax and quick turnaround, we already have a few, like Elixir, Racket, or, well, even ES6.

Much of the Python's appeal is in its huge, colossal, powerful ecosystem, with modules for everything, and things like numpy or tensorflow using it as the high-level interface language. Not breaking this is probably more important for success than efficient in-process data sharing. (Yes, process pools, queues, and a shared DB cover most of my cases.)