There is a significant amount of difference for UDP. If you go ahead and profile any UDP based high pps or bps application, you will notice that a lot of time is spent in the kernels network stack - resolving routes, applying iptable rules, evaluating bpf scripts, copying data, and ultimately handing it over to the driver. On something like a QUIC server - this can easily account for 30% time - or even > 50% if optimizations like GSO (generic segmentation offload) are not used.
io_uring does pretty much nothing to help reducing this number. It reduces the syscall overhead, but that one actually isn't that high in those applications. What it also will do is moving the actual cost and latency of system calls from when the actual IO is attempted to the time when `io_uring_enter` is called, which essentially batches all IO. For some applications that might be useful to reduce overhead - for others it will just mean `io_uring_enter` becomes an extremely high latency operation which stalls the eventloop. This symptom can be avoided by using kernel-side polling for IO operations which doesn't require `io_uring_enter` anymore. But due to polling overhead this will only be a viable way for a certain set of operations too.
Kernel bypass (AF_XDP/dpdk/etc) will directly avoid the 30% overhead, at the cost of a reduced amount of tooling and observability.
For TCP the story might be slightly different, since the kernel overhead there is usually lower due to more offloads being available. But I think even there, there hasn't been a lot of proof that this actually provides a much different performance profile than existing nonblocking APIs.
first, if you're using io_uring_enter, you're probably not using io_uring in any way that truly minimizes latency. Kernel-side polling is required not just for that, see the latest work on how it integrates with the NAPI driver infrastructure.
second, there should be no spurious copies if you set it up right; it is designed with zero-copy capabilities in mind.
third, whether or not it evalutes ebpf or wastes other time in the kernel stack is a matter of configuration and tuning.
fourth, there are several articles that already demonstrate that it achieves performance close to that of DPDK, with a much easier and more flexible setup.
io_uring does pretty much nothing to help reducing this number. It reduces the syscall overhead, but that one actually isn't that high in those applications. What it also will do is moving the actual cost and latency of system calls from when the actual IO is attempted to the time when `io_uring_enter` is called, which essentially batches all IO. For some applications that might be useful to reduce overhead - for others it will just mean `io_uring_enter` becomes an extremely high latency operation which stalls the eventloop. This symptom can be avoided by using kernel-side polling for IO operations which doesn't require `io_uring_enter` anymore. But due to polling overhead this will only be a viable way for a certain set of operations too.
Kernel bypass (AF_XDP/dpdk/etc) will directly avoid the 30% overhead, at the cost of a reduced amount of tooling and observability.
For TCP the story might be slightly different, since the kernel overhead there is usually lower due to more offloads being available. But I think even there, there hasn't been a lot of proof that this actually provides a much different performance profile than existing nonblocking APIs.