> the pain is in writing good GPU code A viable alternative is to not write ...

rbabich · on Dec 14, 2011

As an author of that paper, I can tell you that the code generator was rather simple and mainly used to perform loop unrolling, avoid explicit indexing, and replicate bits of code that couldn't quite be encapsulated in inline functions. It's possible to go further, but this sort of metaprogramming doesn't really eliminate the need to write in CUDA C.

For what it's worth, we long ago abandoned scala in favor of python for the code generator, just to make it more accessible to others interested in working on the project (generally particle physicists by training): http://lattice.github.com/quda/

Groxx · on Dec 14, 2011

And who writes good GPU code generators if the libraries are poorly understood and / or closed source? Certainly, generators are the way to go for a lot of uses, but not all, and someone still needs to write the generators.

melonakos · on Dec 14, 2011

Over the last 5 years, I've seen a ton of hot air blown about wrt to auto-GPU code generation. The latest hot air is about how magical directives make everything run fast.

Truth is, compilers and code generators are crappy.

If you really want to get good performance, you either have to write your own low-level GPU kernels, or use a library of functions that have already been written at a low-level.

All other hot air, while interesting, has yet to be proven at scale on more than a few limited use cases.

Another disclaimer: I work on this, http://accelereyes.com/arrayfire

sharpneli · on Dec 15, 2011

There are 2 parts in writing good GPU code, parallelizing the algorithm and writing the kernels. Automatization of one part will not save time on other.

Based on practical experience the compilers are pretty good nowadays. The fine details of the kernel do not matter that much. The performance issues tend to float around usage of local memory, bank conflicts and how much one kernel instance does work, which require hand tuning and in these cases the compilers are underperforming. Thankfully the poor kernels are 'just' constant factor in the general time complexity of the algorithm.

On higher level the most important thing is to describe the actual algorithm. If the algorithm is described as serial one there is no automated way (and most likely will not ever be general way) of parallelizing it, except running it to check data dependencies after which you already have the result, and the dependencies can change based on inputs so result of one run cannot be generalized.

This could probably be proved by similar method as with halting. The program calls the autoparallelizer and if the parallelizer says there is no data dependency between 2 parts it will make them dependent, if it says there is it will make them independent.

Thus let it be clear: There is no way whatsoever to take the hard parts away (thinking in parallel). Nothing will take bunch of serial code in and spit parallel programs out.

andrewcooke · on Dec 14, 2011

are you confusing syntax and semantics? there's a hurdle that you need to cross with writing cuda code because it's C-like and easy to make "stupid mistakes". a code generator would help you there. but the harder part is getting the algorithm correct (and optimal, for a range of sizes of data). a generator is not so much use there (except for polymorphism, where templating helps).

or am i missing something? how do you see code generators helping you get algorithms right?

wmf · on Dec 15, 2011

Code generators like ATLAS let you generate a thousand variations of the code and pick the fastest one.