Block-layer I/O polling merged into Linux kernel

xlayn · on March 1, 2016

Interesting resume from a comment on the article

>There are basically two types of polling on the block side. One takes care of interrupt mitigation, so that we can reduce the IRQ load in high IOPS scenarios. That is governed by block/blk-iopoll.c, and is very much like NAPI on the networking side, we've had that since 2009 roughly. It still relies on an initial IRQ trigger, and from there we poll for more completions, and finally re-enable interrupts once we think it's sane to do so. This is driver managed, and opt-in.

>The new block poll support is a bit different. We don't rely on an initial IRQ to trigger, since we never put the application to sleep. We can poll for a specific piece of IO, not just for "any IO". It's all about reducing latencies, as opposed to just reducing the overhead of an IRQ storm.

jpgvm · on March 2, 2016

It's worth mentioning that isn't just any commenter, that is Jens Axboe, the current block layer maintainer.

mangeletti · on March 1, 2016

Would somebody mind explaining what this means in layman terms? I understand what this change is, based on the article, but I don't know what implications it has, from a high level viewpoint, because I don't know much about an I/O system in the first place. For instance, does this mean that a program now be able to detect block level disk changes via this mechanism, or something completely different?

linuxguy2 · on March 1, 2016

I don't think the two other responders to your question quite have the whole answer so I'm going to chime in. Warning: I'm not a kernel developer, have only read the article and work with linux daily.

To answer your second question, this doesn't have anything to do with disk changes/inotify/etc that a program would use. My understanding is that currently \many\ IO devices respond to the kernel's request for data by triggering an interrupt that then takes time for the kernel to get to for reading. The interrupt process can be a bit slow leading to latency when waiting for the disk to respond. The new system, rather than waiting for an interrupt, continuously checks the driver for new data and as it doesn't rely on an interrupt can achieve far better latency. Lower latency means more operation per second.

With spinning disks who are only able to do <200 operations per second with latencies around 5ms this won't have much of an effect but with SSD who are able to do >2000 operations per second with latencies around 0.5ms trimming off 0.1ms per operation (made that number up) via polling rather than waiting on an interrupt can mean about 20% more operations per second.

truncate · on March 1, 2016

I fear, this may impact negatively in terms of power. If I'm not wrong, the point of having interrupts was to save CPU cycles wasted on checking status, i.e. polling. Overall, I think it would be interesting to see some benchmarks on real scenarios.

EDIT: Thinking a bit more about it, interrupts were introduced when CPU were much more slower than they are now. So, the tradeoff I'm thinking isn't that bad.

wallacoloo · on March 2, 2016

The power difference may actually be immeasurable, or even an improvement on some systems. Switching from one power state into a lower-power state may actually involve one-time energy payments, which are eventually negated by the power savings as time progresses. In an extreme scenario, a core might power-down its cache to reduce leakage current, which involves sending all of its pending writes onto some bus, which is costly, and then it'll be reading a lot of data back in from elsewhere once it is powered back on as it refills the cache.

In the worst case, you might spend more time and energy switching between power states than you actually spend in a lower power state.

But indeed, it does seem counter-intuitive even with that, as there are often power mode changes available that would pay off in just a few microseconds. It sounds to me like x86 may suffer from a limited IRQ system - there are other systems out there in which IRQ overhead is < 10 cycles.

mangeletti · on March 1, 2016

Aha. Very interesting. Thanks so much for taking the time to spell that out for me.

the_why_of_y · on March 1, 2016

Processing CPU interrupts has overheads. Previously for storage devices you always get one CPU interrupt per I/O operation. If you have many SSDs with some ridiculous IOPS then you spend a lot of CPU handling all those interrupts.

This change means that when the I/O load is high there is no longer one CPU interrupt per I/O operation, instead multiple operations are processed at once, so the CPU has more time to run user space applications to actually do something with all that data.

extrapickles · on March 1, 2016

Completely different. This is all about how the system handles a read/writing.

The original behavior was the OS/Hardware would tell your program it has data ready (hardware limited to hundreds of times/sec). This was changed to your program showing up occasionally with a large truck to load data (basically cpu limited).

The old way was perfectly fine for mechanical hard disks, but with SSDs they were running into limitations with how often you can process interrupts (think of them as a hardware level context switch, very expensive).

Etzos · on March 1, 2016

I don't know a whole lot about this area either, but it looks more like this is allowing for the queueing system used by each block device to offer a choice to the user (via sysfs). The user can set it to use this new mechanism if they want which should have significant speed gains for devices which have little latency between a request for a read/write and the data being actually read or written.

In short: I don't think this has anything to do with programs at all. This looks like it has nothing to do with userspace and I believe it's just the locking mechanism on the device.

devit · on March 1, 2016

I/O will hopefully be "faster" for PCIe SSDs and maybe other devices, with no other differences visible to non-kernel developers.

atomic77 · on March 1, 2016

Where this could be interesting is in cloud/VM environments where the block device may actually be mounted over the network.

The performance improvement for fast devices cited in a link on the article [1] are pretty dramatic, but I wonder about how slow the device needs to be before polling becomes a problem. That same link mentions that slow devices benefit, but, speculate that it may be due to the CPU not being able to go into a deeper sleep state.

[1] http://lwn.net/Articles/663543/

ArkyBeagle · on March 2, 2016

I would think that you'd have the network latency profile and the disk latency profile and that they would add in amusing ( read : non-intuitive ) ways. I read "mounted" to indicate a drive, not always a good assumption.