CUDA-streams probably already use io_uring under the hood. The issue is that a C...

eptcyka · on Sept 13, 2021

I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?

Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.

touisteur · on Sept 14, 2021

Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.

What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.

Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.