Parallel Programming in Multicore OCaml

sadiq · on July 5, 2020

(Multicore OCaml hacker here - speaking for myself though)

Nice to see people having a look at this. It's an early draft chapter that Sudha, one of the Multicore team, has been working on.

If you have feedback or suggestions I'm sure she'd welcome it on the draft's PR: https://github.com/prismlab/parallel-programming-in-multicor...

If you're new to Multicore OCaml the repo is at https://github.com/ocaml-multicore/ocaml-multicore . It has installation instructions near the bottom (easiest way is via OPAM).

The latest June update of our progress is available at https://discuss.ocaml.org/t/multicore-ocaml-june-2020/6047

For a technical deep dive on how Multicore OCaml works our recent ICFP paper has a lot of detail: https://arxiv.org/abs/2004.11663

Athas · on July 5, 2020

Some months ago, I used my lockdown time to make a repository containing implementations of a (bad) ray tracer in various parallel functional programming languages[0]. I was quite happy that someone contributed a multicore OCaml implementation[1]. I think the most interesting part was the series of pull requests where an expert (I think even a multicore OCaml contributor) significantly improved the performance of the ray tracer [2,3,4]. I think being able to see the individual steps is valuable, because functional programming has long promised "effortless parallelism", and hence it is enlightening to see how different well-optimised parallel functional code is from an original more or less idiomatic sequential implementation. In OCaml's case, I don't think the difference is all that great. Most of the improvements came from using a library that provides a proper task pool, since "multicore OCaml" on its own seems to just provide relatively low-level (but very well implemented) primitives. It's a little bothersome that the programmer has to manually provide a chunk size, rather than letting the runtime figure such things out on its down, but it's not terrible. It was far more annoying that multicore OCaml is (or was) missing functions for measuring wall clock time...

Although there was a later PR that improved performance 40% just by moving code around to minimise allocations... [5] Unfortunate that such minor changes can have such dramatic impact.

[0]: https://github.com/athas/raytracers [1]: https://github.com/athas/raytracers/tree/master/ocaml [2]: https://github.com/athas/raytracers/pull/4 [3]: https://github.com/athas/raytracers/pull/5 [4]: https://github.com/athas/raytracers/pull/6 [5]: https://github.com/athas/raytracers/pull/7

xvilka · on July 6, 2020

Yes, there was also a recent report [1] about the current state of Multicore. Apart from some unsolved problems the two largest problem now are 1) it's not in the mainline yet 2) lack of support in standard/third-party libraries. Sadly, it might take quite a while for ecosystem to catch up. Haskell has a similar problem with Linear Types[2][3].

[1] https://discuss.ocaml.org/t/multicore-ocaml-june-2020/6047

[2] https://gitlab.haskell.org/ghc/ghc/-/wikis/linear-types

[3] https://github.com/tweag/linear-base/

wk_end · on July 5, 2020

> The Multicore OCaml compiler comes with two variants of Garbage Collector, namely a concurrent minor collector (ConcMinor) and a stop-the-world parallel minor collector (ParMinor). Our experiments have shown us that ParMinor performs better than ConcMinor in majority of the cases. ParMinor also does not need any changes in the C API of the compiler, unlike ConcMinor which breaks the C API. So, the consensus is to go forward with ParMinor during up- streaming of the Domains-only Multicore.

Is there any use for ConcMinor at this point or is it at a dead-end?

sadiq · on July 5, 2020

The plans, and implementation effort, are focused on getting ParMinor upstreamed so I suspect ConcMinor will stay at version 4.06.

I think we were all surprised by how well ParMinor performed. There's work on-going to see just how far ParMinor can scale and there's also a few tricks we might be able to do to make it scale even better.

logicchains · on July 6, 2020

>ParMinor performs better than ConcMinor in majority of the cases

Does this just mean better throughput, or also better latency?

gadmm · on July 6, 2020

Regarding latency, beyond the kind of benchmarks proposed in the paper (from which this performance claim comes), keep in mind that the behaviour is qualitatively different between the two alternatives.

Systems programmers in OCaml can apply techniques to achieve a low latency, but these are unlikely to scale as well under ParMinor. Indeed, with the latter, it is not going to be possible to have a low-latency domain and a non-low-latency domain coexist: the whole program has to be written in the low-latency style.

This is clear from the qualification “stop-the-world” for ParMinor, but worth to note for people who would read the claim out of context. There can be further qualitative differences in the GC design that are not measured by the benchmarks.

kcsrk · on July 7, 2020

> but these are unlikely to scale as well under ParMinor

To be clear, the said systems programs are not going to see slowdown when run in a sequential setting (which is what I expect the majority of use cases will be). It is unlikely that these systems programs will immediately want to take advantage of parallelism. Moreover, Multicore OCaml aims to add support for shared-memory parallel programming to increase throughput of the program. Getting high throughput and maintaining very low latency is a big challenge generally, and can't just be solved by the runtime system. It needs a different way of writing programs altogether.

> Indeed, with the latter, it is not going to be possible to have a low-latency domain and a non-low-latency domain coexist: the whole program has to be written in the low-latency style.

This is wrong. If the low-latency code is written as it is (no allocations), then in ParMinor the low-latency domain will be responsive at the cost of increasing the latency on non-low-latency domain. Of course, no GC safe points should be inserted in those low-latency code. This design is not ossified.

Another details to remember is that ParMinor is a stop-the-world parallel minor collector. The major collection is concurrent mark-and-sweep which keeps the latency low. I'd recommend checking out the new concurrent GC for GHC which uses concurrent collection for the major heap, and parallel collection for the minor heap [1], which is similar to ParMinor.

Ultimately, Multicore OCaml aims to offer an easy way of helping 95% of the programs to take advantage of parallel execution for increasing throughput without compromising latency. It is quite hard to fully support different expert cases especially since they rely on the details of the existing runtime system, which may no longer hold when the runtime system changes.

[1] https://dl.acm.org/doi/pdf/10.1145/3381898.3397214

gadmm · on July 8, 2020

Indeed, OCaml multicore preserves the sequential low-latency setting, and this is important. Generalising this style is indeed a challenge, and I believe that there is something to learn from those expert cases.

I do not find realistic the suggestion that low-latency code could simply avoid safe points. To get a low latency, one in general avoids promoting values, but one still allocates on the minor heap. One can achieve much less without allocating at all, and moreover OCaml multicore adds extra safe points for a reason, even if they could be turned off.

In the end, what you write about details of the runtime system illustrates my point: the two minor collectors have qualitative differences regarding latency, that affect available programming styles, and this is not measured by the benchmarks.

kcsrk · on July 6, 2020

Better throughput, and almost no impact on latency overall. See https://arxiv.org/abs/2004.11663 section 6.2 for parallel performance evaluation.

badfrog · on July 5, 2020

How is Domain different/better than Deferred? http://dev.realworldocaml.org/concurrent-programming.html

sadiq · on July 5, 2020

Domain is a unit of parallelism in Multicore OCaml - it's effectively a heavyweight thread. The intention is that you tend to have as many Domains as you have cores you want to use in the computation.

Deferred relates to concurrency rather than parallelism, that you can have multiple overlapping computations. You can have concurrency without parallelism. OCaml has ways you can have parallel computations going on that don't hold the interpreter lock and I think some of the concurrent libraries can utilise this.

Multicore OCaml splits parallelism and concurrency, the former is via Domains and the latter is with Fibers. The paper I linked in my other comment on this thread touches on this a little but kc also has a short write-up of how you can use effects to write a scheduler for multicore's fibers: https://kcsrk.info/ocaml/multicore/2015/05/20/effects-multic...

sigrlami · on July 5, 2020

Is C Api changes stable? Last time I checked there was some breaking changes

kcsrk · on July 6, 2020

No breaking changes with the parallel minor collector (ParMinor). This is the design which will be upstreamed. See https://arxiv.org/pdf/2004.11663.pdf for details.

sigrlami · on July 6, 2020

Thanks! Looks very promising