MPI is by far the dominant method for communicating among nodes. So learn that. MPI works fine on a multicore machine as well (launch one MPI rank per core), so you can run on your own machine, no need to use the cloud or anything like that.
For using GPU's, there's CUDA, or offloading with OpenMP or OpenACC.
Indeed, OpenMP offload and OpenACC seem so close to each other (Disclaimer, I haven't used either) that it's probably better if the world would converge on one of them.
For using GPU's, there's CUDA, or offloading with OpenMP or OpenACC.