A new kernel polling interface

FooBarWidget · on Aug 27, 2018

Does anybody know why they won't just copy kqueue? kqueue seems to be a very good interface to me. It can batch multiple events and operations in a single syscall. It supports many types of events, not just file descriptor readiness changes. Lots of people are already familiar with it and there is already code out there to take advantage of it.

_wmd · on Aug 27, 2018

Kqueue doesn't support submission to begin with, no kernel<->ringbuffer support which is the point of this article, I'm not sure if it can be nested, and the cross section of features it supports differs fairly wildly from e.g. epoll.

Meanwhile I agree, it's a lovely interface, and vastly less system call-heavy than epoll

wahern · on Aug 28, 2018

> Kqueue doesn't support submission

What do you mean by this? If you mean that kqueue doesn't support adding or modifying events simultaneous with querying, then as the OP said it does.

> no kernel<->ringbuffer support which is the point of this article

That's an implementation detail. Indeed, a 2007 Linux kqueue patch used ring buffers. See https://lwn.net/Articles/233462/

> I'm not sure if it can be nested

Can you poll on kqueue descriptors recursively? (That is, install a notification event for kqueue descriptor C with kqueue B, which in turn is installed with kqueue A, such that readiness for kqueue C bubbles up to kqueue A) Yes, you can. Same thing with Solaris Port descriptors.

> cross section of features it supports differs fairly wildly from e.g. epoll.

kqueue is an API framework for exposing pollable notification events (i.e. event filters in kqueue parlance). If the semantics of an event type is different, just use different identifiers. Different BSDs support different identifiers (e.g. MacOS supports polling on a Mach Port with EVFILT_MACHPORT) or different flags (notes in kqueue parlance) to control semantics of pre-existing identifiers.

Expecting Linux to adopt kqueue is unreasonable, especially at this point when Linux has reproduced much (but hardly all) common kqueue event filters. What remains irksome, however, is how epoll+friends have ignored the design decisions and real-world experience of kqueue. Most annoying IMO is the behavior of epoll on fork. This despite the fact that kqueue was both ridiculously well documented and mature by the time epoll came out, not to mention the simple fact that the semantics are just atrocious for anyone familiar with Unix programming.[1]

[1] It seems obvious to me that this poor behavior of epoll was an implementation short-cut. Other poor behaviors were premature optimizations. All turned out, IMHO, to have been entirely unnecessary. Tragic considering how Linux once championed the idea that most of the time (if not all of the time) you can make the convenient or most sensible semantics at least as performant as designs that sacrifice ergonomic, composable semantics at the alter of optimization (and in particular optimizations with very specific, very niche use cases in mind). This notion is still preached in the Linux world but not practiced as much, a consequence of corporate influence. (You can see the same thing in FreeBSD, though to a lesser extent. FreeBSD is better at polishing their turds.)

_wmd · on Aug 28, 2018

> What do you mean by this?

OP's comment is n the context of an article describing extensions to AIO to support polling, one of the benefits of that mechanism is a single call can submit new IO operations simultaneous to registering readiness notifications. AFAIK neither epoll nor kqueue support that

Thanks for weighing in, plenty to digest here

Re: epoll-on-fork, do you mean the behaviour where it associates with the wrong kernel object? I knew about that one, although fork sounds curious, but I think you're probably just referring to the trouble fd leaks can cause due to the former issue

rhinoceraptor · on Aug 27, 2018

Linux is the shining example of NIH syndrome.

Everyone else (IOCP, kqueue, etc) solved this problem. And then, Linux created the famously broken epoll().

_wmd · on Aug 27, 2018

I downvoted this because you provided no evidence either IOCP or kqueue were the one true solution, any evidence the epoll designer was even aware of the existence of kqueue, and if they were, whether it was relevant to his work at all.

There is no reference to the context behind rejecting a kqueue-like API (opinion was that it looks a lot like ioctl - very true)

There is no rationale for why IOCP or kqueue is obviously a better design for Linux compared to epoll etc etc.

But just chalk it up to NIH, because that is the easiest and cheapest explanation for just about anything.

FWIW in the context of network APIs before Kqueue, there was the then-standard STREAMS, and the decisionmaking that led to Linux having epoll is the same that led to it avoiding the tragedy that was STREAMS. But if Linux had STREAMS today and not epoll or kqueue, the people who'd cry NIH and not-following-standards today would instead by crying about how much STREAMS sucks, and why Linux doesn't do its own thing

deathanatos · on Aug 27, 2018

> There is no rationale for why IOCP or kqueue is obviously a better design for Linux compared to epoll etc etc.

I can, I think, partially address this, though unfortunately in the negative.

My understanding of IOCP is that the model requires the following: if you want to, say, receive data on a socket, you supply the IOCP w/ a buffer/length. After some data has been received, the buffer and the amount received are returned to you.

The problem with this model is that the buffer is effectively locked up in kernel space until the I/O is complete. Compared to readiness-notifications (select() and derivatives, such as poll/epoll/kqueue), which can share the same receive buffer among all receiving sockets. (They might need to buffer things like partially received commands, but you can perhaps use a much smaller buffer there, and/or only allocate it in the event you require it.)

It has been a long time since I've done Windows programming, so please correct me if I've gotten the above wrong. But that's a fundamental difference and advantage that epoll/kqueue have over IOCP.

Now, I do think the IOCP model is very much conceptually simpler, and I would guess that it is easier to write correct IOCP code than epoll code, but at the expense of memory in some situations. But IOCP doesn't (I don't think) cover as many situations as epoll/kqueue.

wahern · on Aug 28, 2018

People confuse IOCP with Windows' Overlapped I/O. kqueue supports completion notifications; it all depends on the semantics of the event filter and its flags. Likewise for epoll. For example, both kqueue and epoll support pollable I/O completion notifications for AIO, which is the the analogous Unix API for Overlapped I/O. (Similarly, people conflate implementation details with architecture, such as when people explain that AIO isn't like Overlapped I/O by describing how AIO and Overlapped I/O is implemented, without explaining how the API necessarily makes it so.)

The benefit of IOCP and Overlapped I/O on Windows isn't the design. The benefit is that it comes complete out of the box, whereas on Linux and *BSD you either need to roll your own or supplement inconsistent kernel interfaces that people tend to avoid. But both IOCP and Overlapped I/O are higher-level APIs than traditional Unix readiness notification. The problem on Windows is that there's nothing like epoll or kqueue, which is critically important when you're trying to write library code that works with different event models. (Even on Windows Overlapped I/O isn't always ideal, especially in libraries trying to avoid callbacks or support multi-threading strategies different than those dictated by Overlapped I/O). Windows does implement something equivalent to traditional Unix readiness notification internally--it's how IOCP and Overlapped I/O are implemented--but its unpublished and exceptionally opaque. See https://github.com/piscisaureus/wepoll

ernst_klim · on Aug 28, 2018

>My understanding of IOCP is that the model requires the following: if you want to, say, receive data on a socket, you supply the IOCP w/ a buffer/length. After some data has been received, the buffer and the amount received are returned to you.

And then you have a language with GC and you want a async IO, and your runtime becomes so messy with IOCP support, so you just use poll on windows, as many languages do. kqueue is fine, but messy and incomprehensible, and I don't see how is it fundamentally better than epoll with ET.

jhayward · on Aug 28, 2018

> Everyone else (IOCP, kqueue, etc) solved this problem

Wasn't IOCP taken essentially intact from VAX/VMS? Like, from the 1970's?

I don't understand why someone wouldn't have been aware of it when creating epoll().

diegocg · on Aug 27, 2018

There was a linux patchset a long time ago (called kevent) which was pretty much a kqueue clone. There was a lot of discussion but finally those who opposed won. IIRC, the idea of fitting all what kqueue does (a lot) in a single syscall with a single interface was one of the things they didn't like.

benwills · on Aug 27, 2018

There are other major benefits of kqueue() when using sockets as a client that have essentially "forced" me to use bsd instead of Linux/epoll...

The way epoll() is written triggers so many syscalls that things like socket clients can get triggered for every single packet. Whereas kqueue coordinates with other socket options set on the descriptor, such as not returning until x bytes have been received and/or the server has closed the connection.

...not to mention how much more it's tied to across the IO subsystem.

_wmd · on Aug 27, 2018

I might be missing something, but Linux caters to this already:

   SO_RCVLOWAT and SO_SNDLOWAT
          Specify the minimum number of bytes in the buffer until the socket
          layer will pass the data to the protocol (SO_SNDLOWAT) or the user
          on receiving (SO_RCVLOWAT). These two values are initialized to 1.

On the transmit side:

   EPOLLET
      Sets the Edge Triggered behavior for the associated file descriptor.
      The default behavior for epoll is Level Triggered.  See epoll(7) for
      more detailed informa‐ tion about Edge and Level Triggered event
      distribution architectures.

benwills · on Aug 27, 2018

It's also worth noting that kqueue() also doesn't use the socket options for this. Instead, it uses a flag called EVFILT_READ. More here: https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2

benwills · on Aug 27, 2018

Correct. You can set those socket options and change the triggering behavior of epoll. But those socket options do not actually affect how epoll handles the data when it comes in. They are completely ignored by epoll.

_wmd · on Aug 27, 2018

Not sure how old your experience is, but I tested from Python via epoll a few minutes after posting (never used that option before!) and it worked as described.

benwills · on Aug 27, 2018

Behind the scenes, Python may have another layer in there to check the socket options, etc, before returning. ie: Just because it looks like it's working on the Python level doesn't mean it's working the same way underneath.

benwills · on Aug 27, 2018

My experience is across several years writing high performance http clients, writing in pure C.

kaendfinger · on Aug 27, 2018

This changed recently so both of you are right in different ways.

benwills · on Aug 28, 2018

Do you have a link to either an announcement of this change, or C code that demonstrates it? If it has changed, I might start doing some things differently...

caf · on Aug 28, 2018

(2008) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

benwills · on Aug 28, 2018

I dealt with all of this in 2015, and spent three months trying to get epoll to perform similarly to kqueue; namely, to avoid returning on any and every packet received and, preferably, to abide by some sort of minimum bytes received (or connection closing) before returning.

Unless I'm missing something (and I could very well be missing something), I'm not seeing how that is any different than anything I tried...and I went through every socket option I could find, even if it seemed irrelevant.

Have you written code that actually performs this way? Or are you speculating? I really do wish I could do this with epoll, but I (and several others) just never were able to figure it out. And kqueue simply performed at least an order of magnitude better of any variation I could hack together with epoll.

caf · on Aug 29, 2018

I had neither written code to test it nor was speculating - I was reading the kernel source, which is pretty clear (if you know in general how file polling is implemented in the Linux kernel).

However, because of what you have observed I wrote up a quick test, which confirms to my satisfaction that epoll_wait() respects SO_RCVLOWAT on TCP sockets: https://gist.github.com/keaston/d2473b8b996a34a5860b0744684f...

It seems likely that you must have been dealing with an old "enterprise" kernel in 2015 that hadn't had this fix backported to it?

benwills · on Aug 30, 2018

Wow. When I saw your code, I thought to myself "yeah, but does it work when the socket is for a client?" Turns out, it does.

My only guess is that I was botching something else when I was digging into this a few years ago.

I can't tell you how much I appreciate you putting that together. This seriously changes a lot of what I work on. Thank you.

The client code I wrote: https://gist.github.com/benwills/95ee844853d3b18588fb3df7d56...

caf · on Aug 30, 2018

Good example.

The errors are just because you need to check for EAGAIN or EWOULDBLOCK when read() returns -1, and not bail out when that happens.

cbsmith · on Aug 27, 2018

AIO != epoll()

benwills · on Aug 27, 2018

Given that I wasn't asserting that it is, I'm not sure what you're getting at. I was responding to the parent comment.

cbsmith · on Aug 28, 2018

I guess I misunderstood. It sounded like you were limited by the epoll() interface and felt forced to kqueue() and BSD... but you have other alternatives...

cbsmith · on Aug 27, 2018

AIO also can batch multiple events and operations in a single syscall.

hinkley · on Aug 27, 2018

The punchline:

> Multiple notifications can be consumed without the need to enter the kernel at all, and polling for multiple file descriptors can be re-established with a single io_submit() call. The result, Hellwig said in the patch posting, is an up-to-10% improvement in the performance of the Seastar I/O framework. More recently, he noted that the improvement grows to 16% on kernels with page-table isolation turned on.

Someone · on Aug 27, 2018

”But sometimes three is not enough; there is now a proposal circulating for a fourth kernel polling interface”

I’m not convinced. Can anybody explain what’s the set of desirable properties of polling interfaces, and why we need at least four different interfaces to implement all of them?

scottlamb · on Aug 27, 2018

> I’m not convinced. Can anybody explain what’s the set of desirable properties of polling interfaces, and why we need at least four different interfaces to implement all of them?

I'll give you a partial answer.

One desirable property is that each iteration of your event loop is not O(n) with the total number of descriptors being watched. select() and poll() are flawed for that reason—the entire list of file descriptors is passed in on each iteration and has to at least be compared to the previous iteration. No one writes things using these ancient interfaces anymore. (That's a bit unfortunate given that all the modern interfaces are single-platform, so everyone who cares about portability needs an abstraction layer, but it is what it is.) epoll() is better.

The kernel doesn't break old programs, so old interfaces stick around, basically no matter how bad they are. There are three interfaces now, so there have to be at least three. If there can be only three or if there have to be four comes down to if epoll is (or can become) good enough or not. If folks keep coming up with new requirements that can't be met with existing interfaces, there will just be more and more interfaces over time...

caf · on Aug 28, 2018

It's still reasonable to write things using the old select() or poll() interfaces, as long as it isn't something that has to scale up to thousands of file descriptors.

Someone · on Aug 28, 2018

So, the answer is “we don’t think we need three, let alone four, but we happen to have two bad ones that we don’t want to get rid of”?

If those old ones don’t have any unique properties, couldn’t they be implemented on top of a single syscall, or is the syscall interface sacred on Linux, and calls cannot be retired?

caf · on Aug 28, 2018

What tends to happen is that kernel is changed to implement the old syscall using the new infrastructure on the kernel side. There's no real cost to having another syscall number used.

Someone · on Aug 28, 2018

But it grows the amount of code in the kernel, making it more likely to be buggy, more so if some of these old interfaces get used less and less (and, hence, likely tested less and less), and the underlying root interface gets refactored again and again to support new polling interfaces.

caf · on Aug 29, 2018

We are generally talking about very simple wrapper functions here that don't have a lot of scope for hidden bugs to creep in.

LukeShu · on Aug 28, 2018

The syscall interface is considered to be Linux's public API, and must very strictly never break backward compatibility. They cannot retire syscalls.

agumonkey · on Aug 28, 2018

no way to instrument old source to compile old interface into new interface given it's proven equivalent ?

asveikau · on Aug 27, 2018

I can tell you some undesirable properties, which is what keeps motivating people to think of new ones.

Bad property 1. Limited number of things you can block on. select(2) can only do FD_SETSIZE descriptors.

Bad property 2. Entire list of descriptors needs to be re-examined on every blocking call. select(2) and poll(2) have this problem. Entering the syscall, the kernel needs to copy the list of descriptors, and after returning from it, the application needs to scan the list to see what changed - these are both O(N) operations. (Aside: Windows's WaitForMultipleObjects suffers from the former problem but not the latter, but has bad property #1 since it's limited to 64 handles.) epoll and kqueue fix this by having the kernel maintain the list of descriptors across calls, notifying you of which ones change.

Bad property 3. Too many syscalls, which epoll suffers from and this latest one is trying to solve. kqueue(2) tries to mitigate this by allowing you to batch add/remove descriptors, rather than doing a separate syscall for each.

megous · on Aug 27, 2018

Backward compatibility. Linux can never remove the older interfaces in this case. Only extension or creating new ones is possible. If you want new properties, you need a new interface.

digi_owl · on Aug 27, 2018

And with the kernel showing up in things like the spacex rockets, that is the sensible solution.

cbsmith · on Aug 27, 2018

This is from January...

doener · on Aug 28, 2018

Yes, but the new feature will only come to Linux 4.19, which is only released as rc1 so far:

https://lore.kernel.org/lkml/CA+55aFw9mxNPX6OtOp-aoUMdXSg=gB...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...