This article, and indeed much of the discourse about the facility out in the wild, could really use some expansion on at least these two points:
> The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer.
That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?
> The kernel reads the syscalls from this shared memory and executes them at its own pace.
I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?
Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.
> That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?
Correct, the "check for new entries" system call is called io_uring_enter(). (0)
> Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work?
A kernel thread. The submit system call can be optionally made to wait for completion, but by default it will always immediately return.
> Whose scheduling quantum is consumed by this?
That's a good question. The IO scheduler correctly sees them as belonging to the submitting thread, but if you issue a bunch of computation-heavy syscalls, I would not be surprised if they were not correctly accounted for.
> you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue.
Sure. As I understand it, you need a handful of syscalls to get your async IO setup up and running (one io_uring_setup call and a few mmaps), and then you interact with it (both to submit new work and get the results from old work) through the io_uring_enter syscall.
The point is that you're batching things, so you only pay for the context switching of a single physical syscall to make several logical syscalls. It's effectively a solution to the n+1 request problem at the kernel level.
> Otherwise, how would the kernel know you'd done it?
Well, you can make a design where a submission queue spans N+1 pages, and to notify the kernel you write something in the last page which is actually write-protected, so it triggers the kernel trap. I believe VirtIO has a similar scheme?
> What thread is actually performing that work?
None? You don't need really need a user-space thread to execute code in the kernel: otherwise, starting process 1 would have to be the very first thing the kernel does when booting while in reality, that's about the last thing it does in the booting process.
With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?
> notify the kernel you write something in the last page which is actually write-protected, so it triggers the kernel trap
Right, that's a page fault; i.e., a context and privilege level switch. I can't imagine that's going to be any cheaper than a system call.
> None? You don't need really need a user-space thread to execute code in the kernel
I didn't mention user space. There are threads in the kernel. From a scheduling perspective, somebody needs to be billed for the work they're doing. If it's not the process that invoked the system call, that seems like a pretty easy way to induce a bunch of noisy work on the system in excess of what the regular scheduling algorithm and resource capping would allow.
> With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?
Perhaps! Without a design and some measurement it seems impossible to know. Logically, though, you'll still have (kernel) threads of some kind executing in that special partition. They'll compete for execution time, just like user mode threads compete for execution time, at which point for fairness you'll have to figure out how to bill the work back to the user process somehow.
> With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?
Maybe, maybe not; this would be very load dependent.
I feel that with different kinds of cores in newer cpus this might be a good fit for a dedicated core. Maybe a low clock speed (and low power consumption) is enough.
> Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.
This article seems to be an attempt to succinctly describe the high level goals of io_uring to people who don't know, and in that respect I find it succeeded. The questions you're asking seem more related to how io_uring (or its API) are implemented, which is something else. I would hope anyone deciding whether to build something on io_uring would then do more detailed research on the trade-offs before pushing anything to production.
I've been playing with io_uring for the last month or so, just for fun. I'm working on building an async runtime from scratch on top of it. I've been documenting (think of them more as notes to myself) the process thus far:
I'm kind of curious what Alex meant by this, as the security problems relating to io_uring are, to my knowledge, unrelated to the user-space program. It makes sense if you want to disable the feature in your own kernel or remove potential sandbox escape attack surface, but it's like saying "You might want to avoid win32k if you want to use features with a good security track record" (and I know this is kind of apples to oranges but you get the point).
IIUC io_uring surfaced a bunch of pre-existing-but-rarely-hit code paths that had issues, which was widely taken to mean “io_uring has issues”. Google also disabled it on all machines in GCP, not clear if Google disabled it because of the same issues, or some other thing. The aforementioned issues have been fixed.
io_uring also caused a ton of problems for our containers in Kubernetes when Node 20 had enabled it by default. They scrambled and turned it off by default in https://github.com/nodejs/node/commit/686da19abb
I'm extremely curious, what kinds of problems? We had entire Kubernetes nodes becoming unresponsive, and I think io_uring in Node 18.18.0 was responsible, but they had lots of stuff running on them so I was never able to pinpoint the exact cause.
Although I'm a fan of io_uring, a reason to avoid io_uring is it involves quite major program restructuring (versus traditional event loops). If you don't want to forever depend on Linux >= 6 but also support other OSes then you'll have to maintain both versions.
Are you saying io_uring is basically IOCP for Linux? If that's true, then I'm not happy about it being in the kernel. IOCP only goes 2x faster in my experience (vs. threads and blocking i/o) and that isn't worth it for the ugly error prone code you have to write. System calls are also a lot faster on Linux than they are on Windows, so I'd be surprised if io_uring manages even that.
> The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer.
That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?
> The kernel reads the syscalls from this shared memory and executes them at its own pace.
I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?
Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.