Descriptorless Files for Io_uring

marcodiego · on July 30, 2021

I hope one day async io-heavy programs will be written with a big user space part and many small ebpf parts running in kernel space communicating with userspace using io_uring without wasting a single syscall. This could bring system efficiency to new levels.

alexgartrell · on July 30, 2021

We’re working on it :) [1]

The interesting problem is that it’s still kind of a developer experience mess. We can get pretty far with libbpf skeleton, libbpf-rs, etc, but I think we’re still waiting on the “killer” framework for this (or some other kind of language support).

[1] I work with Pavel, Jens, and Alexei

fstrthnscnd · on July 30, 2021

> We’re working on it :) [1]

Here's a bunch of questions for you then: is the io_uring API similar to what we see in message based micro-kernels like Mach, where any syscall is implemented as a message posted to the kernel? If so, is this what Linux is heading towards?

I also see a similarity between this recent development and what we see in GPU libraries, where early API like OpenGL did not provide any control of the graphic stack "internals", but gradually opened them with each new major revision, so that it became possible to tell the stack to allocate memory for vertex and textures buffers for instance. Are we going to see a similar trend with OS kernels?

edit: added missing word (emphasized)

alexgartrell · on July 30, 2021

Big disclaimer: I support these guys but I don't tell them what to do, so these are just my dumb, manager opinions :)

I think we're going to continue to see more of this shared-memory message passing style for two reasons: 1. Hardware performance (NICs, SSDs, etc) is out accelerating CPU performance. You can't keep up if you're hitting a ton of context switches 2. Context switches are crazy expensive, and (at least temporarily) getting more expensive due to Spectre/Meltdown mitigations.

These considerations pique my interest more so than micro/exo/whatever kernel architecture considerations do.

In terms of stack openness, I think the biggest changes are what's happening with bpf. While it's always been possible to go hack the scheduler however you want, it's not really been feasible -- you're likely to break the thing, and carrying patches around is a giant pain. With bpf hooks, you can manipulate kernel behaviors in a very fine-grained fashion, which has already created a bunch of academic interest and really changed the way we build our low-level systems software (containers, networking, etc).

The biggest thing here is that visibility into internals and strong ABI guarantees are inherently at conflict, and it'll take some time to figure out the more nuanced view.

wahern · on July 30, 2021

> 2. Context switches are crazy expensive,

They're not that expensive; something like 1000 cycles. When you factor in all the atomic operations, which take 10-25x more cycles than normal operations, required to post and pull from shared memory queues, you're not actually saving much at all.

The value in io_uring, much like Windows IOCP, will end up being that it provides a single API that everybody targets. io_uring will displace libraries like libuv. No userland project can compete with a blessed kernel interface.

What performance improvements io_uring does provide are irrelevant to the vast majority of people excited about it, people using Python and Node.js and who generally leave orders of magnitude more performance potential on the table.

fstrthnscnd · on July 30, 2021

> They're not that expensive; something like 1000 cycles. When you factor in all the atomic operations, which take 10-25x more cycles than normal operations, required to post and pull from shared memory queues, you're not actually saving much at all.

I thought that io_uring buffers weren't shared? Or do you mean in a multithreaded app?

wahern · on July 30, 2021

An io_uring context is a thread pool, except the worker threads are kernel threads. The userland application posts an operation request (poll, open, read/write, etc) to a special mmap'd buffer that is dequeued by a kernel thread dedicated to that context. That kernel thread then either performs the operation itself, or if it's a potentially blocking operation (i.e. file read) passes it to another worker thread. The results are then posted back to another shared buffer-based message queue. (I'm ignoring what happens when the shared buffers are filled, which has its own implications--good and bad--for performance.)

It's basically no different than a typical multi-threaded userland application using a ring buffer message queue, except the worker threads operate in kernel context and can access unpublished kernel APIs.

But like all multi-threaded designs, the performance choke points are always the places where you need to either obtain a lock or use atomic operations. The problem with locks are obvious. But atomics operations absolutely destroy the performance of pipelined CPU architectures, both directly and indirectly (e.g. if working data is being passed to a thread scheduled on a different core).

Ever wonder why software transactional memory never really succeeded despite the hype? Because a lock often only requires a single atomic operation, whereas lockless algorithms often rely on a whole series of atomic operations. In practice lockless algorithms, while they scale well asymptotically, have horrible absolute performance.

I'm not saying that io_uring's operation queue is intrinsically slower. My point is that 1) context switches, especially in Linux, aren't that slow and 2) all the atomic memory operations have real costs. So the relative benefit of the message passing and worker thread architecture (which is a fixed cost in io_uring, much like a context switch for a syscall) isn't as great as you'd think. It's almost de minimis from the perspective of whole-application architecture.

eru · on Aug 4, 2021

> Ever wonder why software transactional memory never really succeeded despite the hype?

You can implement software transactional memory with locks just fine. Are you talking about hardware transactional memory perhaps?

(In some sense, database transactions are pretty similar to software transactional memory.

And that points to a different possible reason why software transactional memory (STM) hasn't taken off:

STM works really well in a language like SQL or Haskell with carefully restricted side-effects. But bolting STM onto an impure language is asking for trouble.)

PS Your main point about locks vs atomic primitive still stands regardless.

ithkuil · on July 30, 2021

IIRC mach kernel implements IPC by making context switches (syscalls) for every IPC. There are also other microkernels with different IPC impl tradeoffs, though. Is there any that implements async rpc through pure message passing via shared memory ring buffers?

fstrthnscnd · on July 30, 2021

I know that DragonflyBSD was born on the idea of implementing syscalls as a messages, while retaining the structure of a monolithic kernel (FreeBSD 4), but I don't know if the messages are kept in userland or there's still a context switch involved.

The annoying part of having a relatively large piece of shared data between userland and kernel is that the kernel needs a way to ensure that this data hasn't been tampered with "illegally".

I wonder how this is handled with io_uring.

atq2119 · on July 30, 2021

Is this really fundamentally different to existing syscalls? They already have to guard against the possibility that another thread does funny business with the memory accesses by the syscall. You add ring buffer control structures, sure, but apart from that it seems largely the same problem?

fstrthnscnd · on July 30, 2021

I don't know much about kernel security dev, let alone plain kernel dev, so take what I said and about to say with a (big) grain of salt.

Indeed, it seems there's a similar issue with traditional syscalls. I suppose that a larger attack surface means more potential vulnerabilities. The whole buffer must be considered "unsafe" by the kernel. I don't know how that structure is allocated, if it can be relocated to some other existing area to trick the kernel into doing IO there or something else.

I really don't have a clear picture of what could be done, so it might just be my paranoid sense tingling. I definitely should be more trustful wrt what's done by the kernel devs, they know their craft.

rektide · on July 30, 2021

> Is this really fundamentally different to existing syscalls?

Originally & perhaps still DragonflyBSD was attempting to become a Single-System-Image OS, was supposed to allow multiple boxes to work together in a way that looked like a single computer.

I could definitely see making syscalls be more like messages as being a useful starting abstraction for this scenario.

lathiat · on July 30, 2021

Sounds like the “file sealing” API combined with memfd handles that: https://lwn.net/Articles/593918/

Ericson2314 · on July 30, 2021

Where I work we spend lots of time doing highly off the beaten path FRP stuff for very asynchronous IO. The type of stuff everyone gave up on as too difficult long ago :). That should work quite well for this.

You should talk to the Haskell cohort at Facebook, and they can talk to us.

willvarfar · on July 30, 2021

It brings to mind how Spark driver programs work.

It is also interesting how Go, where io is part of the language not library, could be using these mechanisms to do meaningful stuff transparently in the future.

jcelerier · on July 30, 2021

In the end it'll look pretty much like dataflow programming ; why not take a look at the visual systems people are using for that ? E.g. PureData and friends

vidarh · on July 30, 2021

Given where it is now, I dream of seeing it go the next step towards what Alexia Massalin's Synthesis kernel did, and start JIT'ing across the user/kernel boundary. Of course, there be dragons.

gpderetta · on July 30, 2021

the specter of spectre raises is ugly head.

Veserv · on July 30, 2021

syscalls are really not that expensive. The hardware overhead to cross the privilege boundary both times is only like 100 nanoseconds on modern hardware with maybe a few hundred nanoseconds with full speculation mitigations enabled. Essentially everything else is work that someone in the system needs to do. That work would not go away even if you were to link everything into the kernel, so you really only save the hardware overhead by reducing the number of syscalls. So, unless your code is mainly hot loops <1-5 us in duration with a syscall needed per loop you would not really be getting that much efficiency by reducing the number of syscalls. To gain real benefits you need to be able to make the syscall implementations faster which is certainly possible with a different syscall paradigm.

vidarh · on July 30, 2021

It's expensive enough that the first thing I do when diagnosing slow software is to break out "strace -c", because shockingly often the cause is someone who thought syscalls wouldn't be that expensive.

You're right, they're not that expensive, but once who end up triggering multiple syscalls in a tight loop where a tiny fraction of them were needed, it often does totally kill performance.

Especially because while the call itself is not that expensive, you rarely call something that does no work. So the goal should be not just to remove the syscall, but to remove the reason why someone made lots of syscalls in the first instance.

E.g. pet peeve of mine: People who call read() with size 1 because they don't know the length of remaining data instead of doing async reads into a userspace buffer. Have cut CPU use drastically so many times because of applications that did that. The problem then is of course not just the context switch, but the massive number of read calls.

oshiar53-0 · on July 30, 2021

> read() with size 1

FYI shells do this when reading line-by-line from non-seekable streams (e.g. pipes). It's not really feasible to do buffer I/O when you have subprocesses around, and you can't just ungetc(3) into arbitrary file descriptors. A potential performance worth nothing IMO—that is, if you ever do text processing with a bunch of read commands and not sed/awk/whatever like a normal person. Of course this doesn't apply to seekable files, but it still has to rewind back to immediately after the last newline character so there's that.

vidarh · on July 30, 2021

Yeah, there are occasional acceptable reasons to do this, but they are thankfully rarely what causes an issue.

More like e.g. how the MySQL client library used to do this (many years ago; last I checked years ago it was fixed)

jcelerier · on July 30, 2021

> The hardware overhead to cross the privilege boundary both times is only like 100 nanoseconds on modern hardware

When your total time budget is 1 or 2 milliseconds that sounds really really expensive

Veserv · on July 30, 2021

100 nanoseconds out of 1 millisecond is 1/10000 or ~0.01% which is really not very significant. As I said, if you have hot loops with a 1 microsecond budget then you might see material benefits with a kernel architecture that requires fewer privilege crossings. However, in most such cases there are fairly easy ways of reorganizing the code to use the system architecture more efficiently, such as by batching multiple writes in userspace, that will also result in fewer syscalls in your critical path (and thus fewer privilege crossings) without changing the kernel architecture. It is actually a fairly rare problem where the only solution is to issue a vast numbers of syscalls quickly and in those cases that is usually a limitation of the kernel data structures and syscall implementation. Just to clarify the distinction, in the former you issue the same number of "syscalls", but the new kernel architecture results in fewer privilege crossing events where as my statement here is to rearchitect so you need to issue fewer syscalls.

fulafel · on July 31, 2021

Cases where you have soft realtime constrains like above are often interactive apps like browsers or games that do a lot of input, audio and graphics related syscalls that don't easily batch.

loopz · on July 30, 2021

Are there any differences in cache misses and mechanical sympathy?

anarazel · on July 30, 2021

I measure ~80ns here for the most-trivial syscalls (I measured getpid() and ftruncate(-1)).

However, the "real" cost often is higher, because a syscall essentially prevents out-of-order execution from hiding the cost of those cycles. Quoting the Intel manual:

> Instruction ordering. Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible)

On modern out-of-order CPUs that's often going to cause a lot of knock-on slowdown. Waiting for all prior instructions to retire will often mean waiting for memory accesses etc that would effectively be "free" without the syscall.

And then you have the icache, TLB costs of the syscall - harder to measure, because it won't show up in simple test programs...

Veserv · on July 30, 2021

Yes, there is some overhead due to not being able to speculate across the privilege boundary. However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code. You would see the same effects even if you were to link everything together into the kernel and it were just a subroutine call.

That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.

anarazel · on July 30, 2021

> However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code.

That's part of it, sure. But not the whole issue. You need code mapped elsewhere leading to icache/iTLB effects, the GP registers needs to be saved to memory costing you data cache entries, etc.

> That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.

I don't agree at all with this. When executing nontrivial syscalls, the various cache misses inside the kernel alone are a major performance issue.

Compare e.g. the perf stat output for these two fio runs. Both use the following "base" commmand: fio --time_based=1 --runtime=10 --name test --filename=/dev/nvme6n1 --numjobs=1 --rw randread --allow_file_create=0 --invalidate=0 --ioengine io_uring --direct=1 --bs=4k --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 128 One uses --iodepth_batch_submit=1 --iodepth_batch_complete_max=1 and the other --iodepth_batch_submit=128 --iodepth_batch_complete_max=128. Which leads the first to do

There's a decent IO throughput difference between the two (non-batched IOPS=402k, batched IOPS=586k)- but that's partially due to the different number of interrupts, so I don't want to focus on that too much.

The difference in numbers of instructions executed / IPC (34B to 46B / 1.27 to 1.7 IPC), iTLB misses (64M vs 20M), icache misses (2.9B vs 441M) IMO show pretty clearly that there's a significant difference in cache usage.

Non-batched:

         11,192.12 msec task-clock                #    1.081 CPUs utilized          
             2,249      context-switches          #  200.945 /sec                   
                40      cpu-migrations            #    3.574 /sec                   
             3,355      page-faults               #  299.765 /sec                   
    26,815,015,190      cycles                    #    2.396 GHz                      (39.24%)
    34,035,530,697      instructions              #    1.27  insn per cycle           (46.28%)
     7,219,263,051      branches                  #  645.031 M/sec                    (45.30%)
        70,070,545      branch-misses             #    0.97% of all branches          (44.46%)
    10,964,930,461      L1-dcache-loads           #  979.701 M/sec                    (44.52%)
       377,649,453      L1-dcache-load-misses     #    3.44% of all L1-dcache accesses  (44.86%)
        10,565,529      LLC-loads                 #  944.015 K/sec                    (31.40%)
         4,111,354      LLC-load-misses           #   38.91% of all LL-cache accesses  (33.55%)
   <not supported>      L1-icache-loads                                             
     2,937,881,568      L1-icache-load-misses                                         (33.44%)
     9,856,349,351      dTLB-loads                #  880.651 M/sec                    (33.01%)
            95,858      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (32.66%)
           202,652      iTLB-loads                #   18.107 K/sec                    (32.21%)
        64,404,931      iTLB-load-misses          # 31781.05% of all iTLB cache accesses  (31.82%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.349357908 seconds time elapsed

       4.271785000 seconds user
       6.925925000 seconds sys

Batched:

         11,047.76 msec task-clock                #    1.068 CPUs utilized          
             2,215      context-switches          #  200.493 /sec                   
                40      cpu-migrations            #    3.621 /sec                   
             3,349      page-faults               #  303.138 /sec                   
    26,607,362,936      cycles                    #    2.408 GHz                      (38.78%)
    46,169,616,345      instructions              #    1.74  insn per cycle           (45.89%)
    10,085,406,417      branches                  #  912.891 M/sec                    (44.98%)
        33,167,238      branch-misses             #    0.33% of all branches          (44.29%)
    14,626,458,068      L1-dcache-loads           #    1.324 G/sec                    (44.17%)
       384,373,355      L1-dcache-load-misses     #    2.63% of all L1-dcache accesses  (44.93%)
        11,998,821      LLC-loads                 #    1.086 M/sec                    (31.99%)
         5,558,319      LLC-load-misses           #   46.32% of all LL-cache accesses  (33.42%)
   <not supported>      L1-icache-loads                                             
       441,021,238      L1-icache-load-misses                                         (33.50%)
    13,012,218,216      dTLB-loads                #    1.178 G/sec                    (33.08%)
           101,259      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (32.59%)
           197,359      iTLB-loads                #   17.864 K/sec                    (32.13%)
        20,294,577      iTLB-load-misses          # 10283.08% of all iTLB cache accesses  (31.62%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses

      10.348157248 seconds time elapsed

       3.922330000 seconds user
       7.130332000 seconds sys

You can see similar effects even if you eliminate the actual storage device. E.g. when doing buffered reads from a file in page cache I get extreme differences in icache misses (1.7B vs 31M), and iITLB misses (27M vs 600k) yielding IOPS=790k vs IOPS=1098k (with memory copying being the bottleneck on this machine, thanks to intel server CPU coherency handling, but proportions are similar on my laptop).

fulafel · on July 30, 2021

Indeed. That's equivalent to one cache miss. Spending engineering effort on memory locality and making your data structures compact probably pays off better for most apps. But maybe eBPF could help with that as well?

flemhans · on July 30, 2021

If rsync was rewritten this way, would it bring big improvements when running from userspace as a normal user?

rektide · on July 30, 2021

> I hope one day async io-heavy programs will be written with a big user space part and many small ebpf parts running in kernel space communicating with userspace using io_uring without wasting a single syscall.

Nicely stated elevator pitch. Big CSP / Communicating Sequential Processes[1] / process-calculus feel, from the lots of little parts working together description you have! We're somewhat re-primed to move in this direction from having done serverless "lambdas"/functions-as-a-service, too, I'd say.

[1] https://en.wikipedia.org/wiki/Communicating_sequential_proce...

eru · on July 30, 2021

Learning from exo-kernels might be the alternative.

monocasa · on July 30, 2021

io_uring combined with ebpf is learning from exokernels AFAICT.

XOK for instance had at least three or four in kernel VMs exposed to user space depending on how you count them, and was sort of defined by user space using those vms to safely manipulate kernel contexts.

rwmj · on July 30, 2021

Those VMs were used to decide who owns (eg) a network packet, which is important for exokernels since the tiny kernel is only there to multiplex the hardware so needs to know ownership, but is really nothing at all like eBPF.

monocasa · on July 30, 2021

They did all sorts of other work too. For one example, their futex() equivalent was a program called a wake predicate that could read from that process's address space to decide if that thread should transition back to runnable.

And deciding who owns a network packet is literally how "extended Berkley Packet Filter" go it's start as well.

turbinerneiter · on July 30, 2021

OT: What does io_uring stand for? I get the io part, but what is uring? Is that an abbreviation for something?

leotaku · on July 30, 2021

The interface uses ringbuffers for communication between the kernel and userspace. That's how I have always understood it at least, not sure if it's actually correct.

10000truths · on July 30, 2021

Userspace ring buffers.

the_mitsuhiko · on July 30, 2021

It’s a ring buffer. Not sure what the u is. Micro?

yxhuvud · on July 30, 2021

No, it is two ring buffers, with the kernel in between moving requests from one side to the other. My guess is that the name describes the shape.

gpderetta · on July 30, 2021

userspace?

benschulz · on July 30, 2021

I always assumed it's the verb uring[1], but that might well be wrong.

[1]: https://en.wiktionary.org/wiki/uring

IshKebab · on July 30, 2021

Does this allow you to get around limits on the number of open files? I work on a system that only allows 4096 open files which means VSCode's file watching code doesn't work on large projects. It would be great it this provided a workaround.

10000truths · on July 30, 2021

I highly doubt it. Suppose you want to run untrusted user code with fork() + setrlimit(RLIMIT_NOFILE, &limit) + exec(...). It would be rather silly if the user code could just create an io_uring to bypass resource limits.

IshKebab · on July 30, 2021

But the open file limits aren't a trust issue, they're a resource issue.

If this allows avoiding the reason for the resource limitation in the first place then there's no problem as far as I can see.

10000truths · on Aug 1, 2021

> But the open file limits aren't a trust issue, they're a resource issue.

Not necessarily. I might use open file limits to deny all OS resources to user code except for some pipes that I explicitly provide it with from the parent process, like the following pseudocode:

    pipe(child_stdin_pipe);
    pipe(child_stdout_pipe);
    pipe(child_stderr_pipe);
    pid = fork();
    if(pid is child) {
        dup2(child_stdin_pipe[0], 0);
        dup2(child_stdout_pipe[1], 1);
        dup2(child_stderr_pipe[1], 2);
        close_range(3, INT_MAX);
        setrlimit(RLIMIT_NOFILE, 3);
        // maybe setrlimit a bunch of other stuff, like cpu time
        exec(...);
    }
    elif(pid is parent) {
        close(child_stdin_pipe[1]);
        close(child_stdin_pipe[0]);
        close(child_stdin_pipe[0]);
        
        // send info to untrusted user code
        write(child_stdin_pipe[0], ...);
        
        // retrieve info from untrusted user code
        read(child_stdout_pipe[1], ...);
        read(child_stderr_pipe[1], ...);
        
        wait();
    }

Of course, this can be easier done with platform-specific tools like seccomp, but the point remains.

ylyn · on July 30, 2021

I never thought Linux could become any more monolithic. But somehow it is!

(Not that it's a bad thing.)