I’m not saying that you cannot write vector code, but that it’s typically a special case. CUDA APIs and annotations are bolted onto existing languages rather than reflecting languages with vector operations as natural first class operations.
C or Java have no concept of `a + b` being a vector operation the way a language like, say, APL does. You can come closer in C++, but in the end the memory model of C and C++ hobbles you. FORTRAN is better in this regard.
It is always possible to inline assembler in C, and present vector operators as functions in a library.
Otherwise, R does perceive vectors, so another language that performs well might be a better choice. Julia comes to mind, but I have little familiarity with it.
With Java, linking the JRE via JNI would be an (ugly) option.
When the data is generated on CPU shoveling it to the GPU to do possibly a single or few vector operations and then shoveling it back to the CPU to continue is most likely going to be more expensive than the time saved.
No - a CUDA program consists of parts that run on the CPU as well as on the GPU, but the CPU (aka host) code is just orchestrating the process - allocating memory, copying data to/from the GPU, and queuing CUDA kernels to run on the GPU. All the work (i.e. running kernels) is done on the GPU.
There are other libraries (e.g. OpenMP, Intel's oneAPI) and languages (e.g. SYCL) that do let the same code be run on either CPU or GPU.
When you use a GPU, you are using a different processor with a different ISA, running its own barebones OS, with which you communicate mostly by pushing large blocks of memory through the PCIe bus. It’s a very different feel from, say, adding AVX512 instructions to your program flow.