A viable alternative is to not write the GPU code yourself. Write a code generator in Scala that spits out GPU code in C. For details see Claudio Rebbi's work, which uses Scala as a higher level code genarator for CUDA to solve the Dirac-Wilson equation on the lattice( http://wwwold.jlab.org/conferences/lattice2008/talks/poster/... ). In finance, we are actively looking at CUDA for derivative pricing problems in risk analytics. None of us wants to actually write GPU code in C, and we do have considerable amount of risk analytics work being done in Scala, so a code generator might actually be the way to go.
As an author of that paper, I can tell you that the code generator was rather simple and mainly used to perform loop unrolling, avoid explicit indexing, and replicate bits of code that couldn't quite be encapsulated in inline functions. It's possible to go further, but this sort of metaprogramming doesn't really eliminate the need to write in CUDA C.
For what it's worth, we long ago abandoned scala in favor of python for the code generator, just to make it more accessible to others interested in working on the project (generally particle physicists by training): http://lattice.github.com/quda/
And who writes good GPU code generators if the libraries are poorly understood and / or closed source? Certainly, generators are the way to go for a lot of uses, but not all, and someone still needs to write the generators.
Over the last 5 years, I've seen a ton of hot air blown about wrt to auto-GPU code generation. The latest hot air is about how magical directives make everything run fast.
Truth is, compilers and code generators are crappy.
If you really want to get good performance, you either have to write your own low-level GPU kernels, or use a library of functions that have already been written at a low-level.
All other hot air, while interesting, has yet to be proven at scale on more than a few limited use cases.
There are 2 parts in writing good GPU code, parallelizing the algorithm and writing the kernels. Automatization of one part will not save time on other.
Based on practical experience the compilers are pretty good nowadays. The fine details of the kernel do not matter that much. The performance issues tend to float around usage of local memory, bank conflicts and how much one kernel instance does work, which require hand tuning and in these cases the compilers are underperforming. Thankfully the poor kernels are 'just' constant factor in the general time complexity of the algorithm.
On higher level the most important thing is to describe the actual algorithm. If the algorithm is described as serial one there is no automated way (and most likely will not ever be general way) of parallelizing it, except running it to check data dependencies after which you already have the result, and the dependencies can change based on inputs so result of one run cannot be generalized.
This could probably be proved by similar method as with halting. The program calls the autoparallelizer and if the parallelizer says there is no data dependency between 2 parts it will make them dependent, if it says there is it will make them independent.
Thus let it be clear:
There is no way whatsoever to take the hard parts away (thinking in parallel). Nothing will take bunch of serial code in and spit parallel programs out.
are you confusing syntax and semantics? there's a hurdle that you need to cross with writing cuda code because it's C-like and easy to make "stupid mistakes". a code generator would help you there. but the harder part is getting the algorithm correct (and optimal, for a range of sizes of data). a generator is not so much use there (except for polymorphism, where templating helps).
or am i missing something? how do you see code generators helping you get algorithms right?
A viable alternative is to not write the GPU code yourself. Write a code generator in Scala that spits out GPU code in C. For details see Claudio Rebbi's work, which uses Scala as a higher level code genarator for CUDA to solve the Dirac-Wilson equation on the lattice( http://wwwold.jlab.org/conferences/lattice2008/talks/poster/... ). In finance, we are actively looking at CUDA for derivative pricing problems in risk analytics. None of us wants to actually write GPU code in C, and we do have considerable amount of risk analytics work being done in Scala, so a code generator might actually be the way to go.