How is CUDA-C that much easier than OpenCL? Having ported back and forth myself, the base C-like languages are virtually identical. Just sub "__syncthreads();" for "barrier(CL_MEM_FENCE)" and so on. To me the main problem is that Nvidia hobbles OpenCL on their GPUs by not updating their CL compiler to OpenCL 2.0, so some special features are missing, such as many atomics.