> have to use in-process parallelism for some weird reason
Needing shared memory paralellism is a weird reason now? Pretty much any parallel algorithm that's not embarassingly parallel is going to perform better with threads able to share memory than with message passing between processes.
Exactly, pretty much any parallel algorithm is embarrassingly parallel (gather or calculate a bunch of data, process it and merge it together) so I question the need for continuously needing in-process parallelism instead of solving it trivially.
Python is not great for fine-grained data parallelism (SIMD, GPU), which is increasingly the lion's share: non-starter for direct inline and pretty bad for DSLs. The result is runtime heroics for embedded ~dataframe DSLs (pyspark, rapids.ai) with high overhead.
OTOH, those heroics do happen, and been OK so far. Accelerating differentiable programming is basically an extra transform layer on accelerating data parallel programming. Thankfully, our team writes zero raw OpenCL/CUDA nowadays and instead fairly dense dataframes code. Similar to async/await being added to doing a lot for web programming on Python, curious what it'll take for data parallel fragments (incl. differentiable.) If it wasn't for language resistance for UDF + overhead, and legacy libs around blocking, we'd be happy.
I'm not sure what's your domain but for machine learning that will be slow:
- Parallel matrix multiplication may be embarassingly parallel, you have a reduction step that is not trivial to parallelize across processes. Also you need to take care of register tiling, L1 cache tiling and L2 cache tiling. It is way easier to do this in OpenMP
- Parallel Monte Carlo Tree-Search: it's much easier and more efficient to spawn/collect trees with a proper spawn/sync librairie.
Needing shared memory paralellism is a weird reason now? Pretty much any parallel algorithm that's not embarassingly parallel is going to perform better with threads able to share memory than with message passing between processes.