More

shwestrick · 2024-08-13T13:43:38 1723556618

For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.

Related papers:

(2018) Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

(2021) Task Parallel Assembly Language for Uncompromising Parallelism. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Compiling Loop-Based Nested Parallelism for Irregular Workloads. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Automatic Parallelism Management. https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf

judofyr · 2024-08-13T15:17:21 1723562241

Oh this is super interesting. I was only aware of the two first while writing Spice. I’ll definitely look into the two last as well. Thanks for sharing!

shwestrick · 2024-05-17T19:50:56 1715975456

Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (https://github.com/MPLLang/mpl) and get competitive performance with respect to C/C++.

If you liked 210, you might also like https://futhark-lang.org/ which is an ML-family language that compiles to GPU with good performance.

amelius · 2024-05-17T22:19:28 1715984368

Huh, the Maple name is already used by a well known computer algebra project.

https://en.wikipedia.org/wiki/Maple_(software)

shwestrick · 2024-03-31T16:37:45 1711903065

The two systems have very different tradeoffs.

A few things in particular:

  * Separate compilation vs whole-program compilation. OCaml uses separate compilation and therefore has a very constrained heap object model which makes it possible to get polymorphism across separately compiled modules. In contrast, MaPLe uses whole-program compilation and therefore is able to monomorphize and optimize much more aggressively. Whole-program compilation can be slow for large projects.

  * The multicore OCaml effort was driven by backwards compatibility, especially in terms of performance -- they wanted to ensure that the performance of existing sequential OCaml code would be completely unaffected by the new run-time system.  In contrast, MaPLe focuses on efficiency and scalability for parallel code.

  * Multicore OCaml will let you implement your own scheduler, as a library, on top of coarse-grained threads. In contrast, MaPLe comes with a built-in scheduler, and it's not easy to change it.

We did a comparison with multicore OCaml in https://dl.acm.org/doi/10.1145/3591284, and found that MaPLe can be significantly faster, but that comes with all of the tradeoffs above. And, it's a cross-language comparison, so take it with a grain of salt. Our comparison in particular emphasized similar source code, but typically, fast code in OCaml just looks different from fast code in MaPLe. For example, in OCaml, you often need to manually unbox certain data structures to get better memory efficiency (but MaPLe will often do this for you, automatically).

By the way, there's a recent effort---led by the compiler team at Jane Street---to make it possible to get automatic data unboxing in OCaml! Some more info here: https://www.janestreet.com/tech-talks/unboxed-types-for-ocam...

shwestrick · 2024-03-31T16:16:06 1711901766

I'm one of the authors of this work -- I can explain a little.

"Provably efficient" means that the language provides worst-case performance guarantees.

For example in the "Automatic Parallelism Management" paper (https://dl.acm.org/doi/10.1145/3632880), we develop a compiler and run-time system that can execute extremely fine-grained parallel code without losing performance. (Concretely, imagine tiny tasks of around only 10-100 instructions each.)

The key idea is to make sure that any task which is *too tiny* is executed sequentially instead of in parallel. To make this happen, we use a scheduler that runs in the background during execution. It is the scheduler's job to decide on-the-fly which tasks should be sequentialized and which tasks should be "promoted" into actual threads that can run in parallel. Intuitively, each promotion incurs a cost, but also exposes parallelism.

In the paper, we present our scheduler and prove a worst-case performance bound. We specifically show that the total overhead of promotion will be at most a small constant factor (e.g., 1% overhead), and also that the theoretical amount of parallelism is unaffected, asymptotically.

All of this is implemented in MaPLe (https://github.com/mpllang/mpl) and you can go play with it now!

shwestrick · 2024-03-05T05:27:20 1709616440

Enjoyed playing with this!

N-queens search is another nice recursive example. E.g. call this with nqueens(0, 5, [])

    function nqueens(i:number, n:number, queens: number[][]) {
      if (i >= n) return 1;

      let count = 0;
      for (let j = 0; j < n; j++) {

        let is_threatened = false;
        for (let k = 0; k < queens.length; k++) {
          let x = queens[k][0];
          let y = queens[k][1];

          if (i == x || j == y || i - j == x - y || i + j == x + y) {
            is_threatened = true;
            break;
          }
        }

        if (is_threatened) continue;

        count += nqueens(i+1, n, Array(Array(i,j)).concat(queens))
      }

      return count;
    }

tromp · 2024-03-05T09:24:53 1709630693

Thanks for showing the syntax to use! I computed the number of closed lambda terms of size 10 bits within 0 binders with nclosed(k=0, n=10) :

    function nclosed(k:number, n:number) {
      if (n < 2) return 0;
      let sum = nclosed(k+1,n-2); // Lambda
      for (let i = 2; i < n-1; i++) {
        sum += nclosed(k,i) * nclosed(k,n-2-i); // Application
      }
      if (n-2 < k) sum += 1;      // Variable
      return sum;
    }

shwestrick · on Dec 7, 2023

It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.

The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.

    combine ((a1, b1), (a2, b2))  =  (a1 * a2, b1 * a2 + b2)

For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...

I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).

Anyway, it would be interesting to compare this against the solution in the arxiv paper.

----

EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1

And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

cs702 · on Dec 7, 2023

Yes! That's pretty much how I would have thought about this with my puny little brain.

But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

Or am I missing something?

shwestrick · on Dec 8, 2023

On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.

derf_ · on Dec 8, 2023

Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.

[1] https://people.xiph.org/~tterribe/pubs/gpusurf.pdf Section III.A

cs702 · on Dec 8, 2023

Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.

cs702 · on Dec 8, 2023

> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.

loxias · on Dec 8, 2023

> And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

I think so. And also potential issues with all the sums. :) (see: "compensated summation") god I love this shit. :)

leiroigh · on Dec 7, 2023

This is very very well known. Cf https://en.wikipedia.org/wiki/Affine_group

I don't see how people should glorify this with the word "algorithm". It is a trivial undergrad homework exercise, once you give the hint "use parallel reduce / fold / prefixsum".

This may involve more interesting tradeoffs if you deal with large or sparse matrices or matrix-free operators.

dang · on Dec 7, 2023

If you know more than others do, that's great, but instead of posting putdowns, please share some of what you know so the rest of us can learn.

The trouble with comments like this is that they degrade discussion put others down without really teaching us anything.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

https://news.ycombinator.com/newsguidelines.html

shwestrick · on Aug 10, 2023

Tons and tons of parallel algorithms use prefix sums. Typically the most common use is to compute a collection of offsets in parallel. Some examples:

- compact a hash table (i.e., remove the empty slots)

- flatten a jagged 2D array

- rewrite a dense matrix in compressed-sparse-row (CSR) format

shwestrick · on June 14, 2023

Parallelism is only about performance, that's it. If you need something to go faster, parallelism is an option.

Looking into the future, parallelism is one of the only remaining techniques for scaling classical computing. Processors have stopped getting faster. Instead, they're getting fatter (more cores).

shwestrick · on May 25, 2023

It's a doctoral degree with a heavy emphasis on pedagogy and teaching. From https://www.cmu.edu/math/grad/phd/index.html:

> The Doctor of Arts degree shares all requirements and standards with the Ph.D., except with regard to the thesis. The D.A. thesis is not expected to display the sort of original research required for a Ph.D. thesis, but rather to demonstrate an ability to organize, understand, and present mathematical ideas in a scholarly way, usually with sufficient innovation and worth to produce a publishable work. Whenever practical, the department provides D.A. candidates with the opportunity to use materials developed to teach a course. While a typical Ph.D. recipient will seek a position that has a substantial research component, the D.A. recipient will usually seek a position where research is not central.

shwestrick · on April 8, 2023

More and more people nowadays are programming at high levels of abstraction. If you're designing the frontend of a website, or making a mobile game, or developing a stock trading algorithm, or whatever else, then you probably don't want to constantly worry about details of memory management...

On top of this, GC is necessary for some algorithms. Any data structure with partial sharing (e.g. binary search tree with versioning via path-copying) needs GC to be space-efficient. You could either rely on a built-in GC, or write your own. If you write your own, I think you'll find that it is tedious and error-prone due to memory safety issues.

pjmlp · on April 9, 2023

We could already be doing that during the 1980's had it not been for UNIX getting out of Bell Labs with source tapes, and bad management decisions.

https://interlisp.org/

http://toastytech.com/guis/cedar.html

"Eric Bier Demonstrates Cedar"

https://www.youtube.com/watch?v=z_dt7NG38V4&t=2s

"Making Smalltalk"

https://www.youtube.com/watch?v=PaOMiNku1_M

http://www.edm2.com/index.php/VisualAge_Smalltalk

Also there is to note that most BASIC implementations had support for automatic memory management, at least for strings and arrays, the structured compiled dialects even better.

Also database programming with languages like Clipper and FoxPro.

And even if we stay in the UNIX world, that is exactly using stuff like Perl also allowed for, C like programming without the headaches of manual memory management.

Or the brief fad of 4GL languages.

einpoklum · on April 8, 2023

> then you probably don't want to constantly worry about details of memory management...

Oh, you actually don't have to, that's the whole point... in the past, you (effectively) had a choice between careful manual management of memory and garbage collection with its overheads. These days, you can use constructs which take care of that management for you, with very little or no overhead, which don't involve garbage.

It's true that sometimes GC is algorithmically necessary; but then the high-level-of-abstraction argument is irrelevant. And in those cases, a GC library does indeed come in useful. You don't need to write your own, others have likely done it already.