CUDA-streams probably already use io_uring under the hood.
The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).
Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.
-------
The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)
Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.
I/O is becoming indistinguishable from memcpy + atomics.
-------
EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).
Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.
I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?
Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.
Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.
What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.
Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.
The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).
Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.
-------
The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)
Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.
I/O is becoming indistinguishable from memcpy + atomics.
-------
EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).
Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.