On modern multicore hardware this will be memory-bound; the amount of computatio...

derf_ · on Dec 8, 2023

Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.

[1] https://people.xiph.org/~tterribe/pubs/gpusurf.pdf Section III.A

cs702 · on Dec 8, 2023

Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.

cs702 · on Dec 8, 2023

> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.