in what was is it a pain that threading is not?

sevensor · on Aug 14, 2017

With threading, all of your threads can refer to the same objects. Multiprocessing means you have multiple interpreters running. That means no shared memory, and communication over pretty slow queues. I've definitely wanted to have multithreaded Python programs where all threads referred to the same large read-only data structure. But I can't do this because of the GIL. I mean, I can, but it's pointless. I can't do this with multiprocessing because of the limitations on shared memory with multiprocessing.

Edit: I realize I'm contradicting myself here. No shared memory is a first approximation. You can have shared memory with multiprocessing, but most objects can't be shared.

btilly · on Aug 14, 2017

And yet, if you could have what you want, would it actually be faster?

The costs of synchronizing mutable data between cores is surprisingly high. Any time your CPU thinks that the data that it has in its cache might not be what some other CPU has in its cache, the two have to coordinate what they are doing. And thanks to the fact that Python uses reference counting, data is constantly being changed even though you don't think that you're changing it.

Furthermore if you throw out the GIL for fine-grained locking, you then open up a world of potential problems such as deadlocks. Which look like "my program mysteriously froze". Life just got a lot more complicated.

It is easy to look at all of those cores and say, "I just want my program to use all of them!" But doing that and actually GETTING better performance is a lot trickier than it might seem.

sevensor · on Aug 14, 2017

Right, but like I said, I'd be fine with a read-only shared data structure. I have a problem that has a hefty data model. The problem can be decomposed and attacked in parallel, but the decomposition doesn't cut across the data. Right now I run n instances on n cores, but that means making n copies of a large data structure. This requires a lot of system memory, ruins any chance I have of not wrecking the cache (not that I have high hopes there, but still), and forces me into certain patterns, like using long-lived processes because it's expensive to set up the model, that I'd prefer to avoid.

btilly · on Aug 14, 2017

You might want to look at https://stackoverflow.com/questions/17785275/share-large-rea... for inspiration.

If you need to share a large readonly structure, the best way IMO is that approach. Implement the structure in a low-level language that supports mmap (be very sure to make the whole structure be in the mmap'd block - it is easy to wind up with pointers to random other memory and you don't want that!) and have high performance accessors to use in your code.

sevensor · on Aug 15, 2017

Thanks for the link! Might be worth going down that path.

btilly · on Aug 15, 2017

Good luck. Another benefit of this strategy is that you optimize that data structure using techniques that aren't available in higher languages. So, for instance, small trees can be set up to have all of the nodes of the tree very close together, improving the odds of a cache hit. You can switch from lots of small strings to having integers that index a lookup table of strings for display only.

The amount of work to do this is insane. Expect 10x what it took to write it in a high level language. But the performance can often be made 10-100x as well. Which is a giant payoff.

sevensor · on Aug 16, 2017

Thanks! I've already partly rewritten it in C once, but I misunderstood the access pattern and I ended up having a lot of cache misses. The speedup was measurable, but disappointing, and the prospect of doing another rewrite had put me off. I hadn't put two and two together about this being an effective way to share memory under multiprocessing until reading this thread, so it's worth revisiting now.

_euac · on Aug 14, 2017

Yeah, sharing memory between processes is a very delicate ballet to perform. That said, sharing a read-only piece of data is way simpler than you'd expect, depending on size and your forking chain. The documentation could do a better job of explaining the nuances and provide more examples.

sevensor · on Aug 14, 2017

Care to elaborate? All I've seen in the docs is how to share arrays or C structures between processes. It would take a substantial rewrite to use either. Is there some kind of CoW mechanism I'm missing?

emidln · on Aug 14, 2017

Serializing data for IPC is often undesirable (copies kill) which leads to multi process shared memory. Sharing memory across process boundaries safely is a problem you avoid entirely with threading. You still need to lock your data (or use immutable data), but the machinery is built into your implementation (and hopefully trustworthy).

hermitdev · on Aug 14, 2017

It's been a while, and my memory is fuzzy, but I recall either pyodbc or pysybase reacting very poorly with the multiprocessing module. With multiprocessing, Python would segfault after fork. With threading, it would "work" albeit slowly. Also, IIRC, it did not matter if the module was imported before or after the fork, still segfaulted. I never had the time to try and track down the issue that was causing it, though, deadlines and all that.

cyphar · on Aug 14, 2017

You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker. Also, copying large datasets between processes is not efficient. And also, there are cases where the fan-out approach is not the best way of parallelizing a task, and passing information back up to a parent task is more complicated than necessary.

dec0dedab0de · on Aug 14, 2017

"You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker."

the multiprocessing library allows you to launch multiple processes using your function definitions. It's almost the same as the multithreading library but does not share data.

It seems the real problem, as you pointed out, is the additional memory. I didn't consider situations where each process would need an identical large data set, instead of just a small chunk to work on.

ant6n · on Aug 14, 2017

It gets more interesting when you have a large data set that's required for the computation, but as you compute, you may discover partial solutions that can be cached and used by other workers.

So not only a large read-only data set, but also a read-write cache used by all workers. This sort of thing is relatively easy with threads, but basically impossible with multiprocessing.

dom0 · on Aug 14, 2017

Depending on where you want to go and the application, such things may be good idea for a low number of workers but can become a major bottleneck.