I'm not sure what's your domain but for machine learning that will be slow:

- Parallel matrix multiplication may be embarassingly parallel, you have a reduction step that is not trivial to parallelize across processes. Also you need to take care of register tiling, L1 cache tiling and L2 cache tiling. It is way easier to do this in OpenMP

- Parallel Monte Carlo Tree-Search: it's much easier and more efficient to spawn/collect trees with a proper spawn/sync librairie.