The interesting problem is that it’s still kind of a developer experience mess. We can get pretty far with libbpf skeleton, libbpf-rs, etc, but I think we’re still waiting on the “killer” framework for this (or some other kind of language support).
Here's a bunch of questions for you then: is the io_uring API similar to what we see in message based micro-kernels like Mach, where any syscall is implemented as a message posted to the kernel? If so, is this what Linux is heading towards?
I also see a similarity between this recent development and what we see in GPU libraries, where early API like OpenGL did not provide any control of the graphic stack "internals", but gradually opened them with each new major revision, so that it became possible to tell the stack to allocate memory for vertex and textures buffers for instance. Are we going to see a similar trend with OS kernels?
Big disclaimer: I support these guys but I don't tell them what to do, so these are just my dumb, manager opinions :)
I think we're going to continue to see more of this shared-memory message passing style for two reasons:
1. Hardware performance (NICs, SSDs, etc) is out accelerating CPU performance. You can't keep up if you're hitting a ton of context switches
2. Context switches are crazy expensive, and (at least temporarily) getting more expensive due to Spectre/Meltdown mitigations.
These considerations pique my interest more so than micro/exo/whatever kernel architecture considerations do.
In terms of stack openness, I think the biggest changes are what's happening with bpf. While it's always been possible to go hack the scheduler however you want, it's not really been feasible -- you're likely to break the thing, and carrying patches around is a giant pain. With bpf hooks, you can manipulate kernel behaviors in a very fine-grained fashion, which has already created a bunch of academic interest and really changed the way we build our low-level systems software (containers, networking, etc).
The biggest thing here is that visibility into internals and strong ABI guarantees are inherently at conflict, and it'll take some time to figure out the more nuanced view.
They're not that expensive; something like 1000 cycles. When you factor in all the atomic operations, which take 10-25x more cycles than normal operations, required to post and pull from shared memory queues, you're not actually saving much at all.
The value in io_uring, much like Windows IOCP, will end up being that it provides a single API that everybody targets. io_uring will displace libraries like libuv. No userland project can compete with a blessed kernel interface.
What performance improvements io_uring does provide are irrelevant to the vast majority of people excited about it, people using Python and Node.js and who generally leave orders of magnitude more performance potential on the table.
> They're not that expensive; something like 1000 cycles. When you factor in all the atomic operations, which take 10-25x more cycles than normal operations, required to post and pull from shared memory queues, you're not actually saving much at all.
I thought that io_uring buffers weren't shared? Or do you mean in a multithreaded app?
An io_uring context is a thread pool, except the worker threads are kernel threads. The userland application posts an operation request (poll, open, read/write, etc) to a special mmap'd buffer that is dequeued by a kernel thread dedicated to that context. That kernel thread then either performs the operation itself, or if it's a potentially blocking operation (i.e. file read) passes it to another worker thread. The results are then posted back to another shared buffer-based message queue. (I'm ignoring what happens when the shared buffers are filled, which has its own implications--good and bad--for performance.)
It's basically no different than a typical multi-threaded userland application using a ring buffer message queue, except the worker threads operate in kernel context and can access unpublished kernel APIs.
But like all multi-threaded designs, the performance choke points are always the places where you need to either obtain a lock or use atomic operations. The problem with locks are obvious. But atomics operations absolutely destroy the performance of pipelined CPU architectures, both directly and indirectly (e.g. if working data is being passed to a thread scheduled on a different core).
Ever wonder why software transactional memory never really succeeded despite the hype? Because a lock often only requires a single atomic operation, whereas lockless algorithms often rely on a whole series of atomic operations. In practice lockless algorithms, while they scale well asymptotically, have horrible absolute performance.
I'm not saying that io_uring's operation queue is intrinsically slower. My point is that 1) context switches, especially in Linux, aren't that slow and 2) all the atomic memory operations have real costs. So the relative benefit of the message passing and worker thread architecture (which is a fixed cost in io_uring, much like a context switch for a syscall) isn't as great as you'd think. It's almost de minimis from the perspective of whole-application architecture.
> Ever wonder why software transactional memory never really succeeded despite the hype?
You can implement software transactional memory with locks just fine. Are you talking about hardware transactional memory perhaps?
(In some sense, database transactions are pretty similar to software transactional memory.
And that points to a different possible reason why software transactional memory (STM) hasn't taken off:
STM works really well in a language like SQL or Haskell with carefully restricted side-effects. But bolting STM onto an impure language is asking for trouble.)
PS Your main point about locks vs atomic primitive still stands regardless.
IIRC mach kernel implements IPC by making context switches (syscalls) for every IPC. There are also other microkernels with different IPC impl tradeoffs, though. Is there any that implements async rpc through pure message passing via shared memory ring buffers?
I know that DragonflyBSD was born on the idea of implementing syscalls as a messages, while retaining the structure of a monolithic kernel (FreeBSD 4), but I don't know if the messages are kept in userland or there's still a context switch involved.
The annoying part of having a relatively large piece of shared data between userland and kernel is that the kernel needs a way to ensure that this data hasn't been tampered with "illegally".
Is this really fundamentally different to existing syscalls? They already have to guard against the possibility that another thread does funny business with the memory accesses by the syscall. You add ring buffer control structures, sure, but apart from that it seems largely the same problem?
I don't know much about kernel security dev, let alone plain kernel dev, so take what I said and about to say with a (big) grain of salt.
Indeed, it seems there's a similar issue with traditional syscalls. I suppose that a larger attack surface means more potential vulnerabilities. The whole buffer must be considered "unsafe" by the kernel. I don't know how that structure is allocated, if it can be relocated to some other existing area to trick the kernel into doing IO there or something else.
I really don't have a clear picture of what could be done, so it might just be my paranoid sense tingling. I definitely should be more trustful wrt what's done by the kernel devs, they know their craft.
> Is this really fundamentally different to existing syscalls?
Originally & perhaps still DragonflyBSD was attempting to become a Single-System-Image OS, was supposed to allow multiple boxes to work together in a way that looked like a single computer.
I could definitely see making syscalls be more like messages as being a useful starting abstraction for this scenario.
Where I work we spend lots of time doing highly off the beaten path FRP stuff for very asynchronous IO. The type of stuff everyone gave up on as too difficult long ago :). That should work quite well for this.
You should talk to the Haskell cohort at Facebook, and they can talk to us.
It is also interesting how Go, where io is part of the language not library, could be using these mechanisms to do meaningful stuff transparently in the future.
In the end it'll look pretty much like dataflow programming ; why not take a look at the visual systems people are using for that ? E.g. PureData and friends
The interesting problem is that it’s still kind of a developer experience mess. We can get pretty far with libbpf skeleton, libbpf-rs, etc, but I think we’re still waiting on the “killer” framework for this (or some other kind of language support).
[1] I work with Pavel, Jens, and Alexei