Arrakis: The Operating System Is the Control Plane

Animats · on Nov 16, 2014

We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation.

That's the way IBM mainframes have worked since IBM VM, 1972.

There's a lot to be said for this. For historical reasons, most microprocessor systems have an I/O architecture that's far too much like that of an 1970s minicomputer - device registers with meaning determined by the peripheral. Minicomputers had that because they couldn't afford enough transistors to do mainframe-type channels. That problem went away a long time ago, but the legacy architecture remains.

You need both channels with access control and peripherals designed for channels to make this work without OS intervention. Peripherals may have privileged functions - user programs may be restricted to an address range on a disk, or an IP and MAC address on a network controller.

justincormack · on Nov 16, 2014

SR-IOV is pretty much designed like this. The master PCI function has all the privileged ops while the slave devices have none.

The Intel network cards (all the 10g ones, some 1g) at least have IP and MAC filtering, and support at least 64 virtual network cards for each physical port.

It has taken longer for storage, but NVMe has SR-IOV, which means you can split out virtual drives without the OS having to check block ranges. Not widely available yet, although Google cloud now has support[1].

[1] https://cloud.google.com/compute/docs/local-ssd

stevelaz · on Nov 15, 2014

This is pretty awesome! In the past I worked on a project which we implemented a control-plane/data-plane separation architecture within Linux. The system had two CPUs and we had Linux running on one with configuration apps and all network related IO ran on the other CPU. The problem was that this was implemented at the kernel level and whenever an application needed to share data across the CPUs it was slow. The implementation could have probably been better, but this was a long time ago and I can't recall everything that was done. Regardless, Arrakis looks like a great project with a lot of potential.

Imagine this type of stack being used in an embedded system. I've worked on embedded projects that achieved high throughput but in most cases there were FPGAs and DSPs doing a lot work to help. Userspace-to-Kernel context switch delays have always been a latency issue with any Embedded Linux system I've worked on.

Arrakis looks like one would be able to achieve high performance without the need for FPGA and or DSP (depending on the use case of course).

Side note: Cool, I noticed they're using lwip from Adam Dunkels. He's an amazing programmer.

walterbell · on Nov 15, 2014

This is using hardware isolation (IOMMU, SR-IOV) to reduce the need for software (including kernel) isolation.

See also Intel SGX (https://www.virusbtn.com/virusbulletin/archive/2014/01/vb201...), disaggregation platforms (seL4, Qubes, Genode) and userspace networking (Intel DPDK, https://01.org/packet-processing).

jarcane · on Nov 15, 2014

The main website: https://arrakis.cs.washington.edu/ Github: https://github.com/UWNetworksLab/arrakis

Intriguing.

shmerl · on Nov 16, 2014

Are there any bootbale images to play with? Or it's all build from source at this stage?

marknadal · on Nov 15, 2014

This seems like genuinely good stuff. I'm not a systems guy, so question: How long would it theoretically take for somebody to hack NodeJS to use this? What about getting NodeJS to use zero-copy buffers as be mentioned, by somehow overriding JSON.stringify? Thanks.

Animats · on Nov 16, 2014

Don't get too excited about zero-copy. Copying is cheap if the data was recently referenced and is in cache. Conversely, if the data was put in memory by a peripheral device, it costs almost as much as a copy to get it into the CPU's caches.

The bookkeeping associated with zero-copy often exceeds the copying cost. This was the curse of the original message-passing Mach implementation, which gave microkernels a bad name.

marknadal · on Nov 16, 2014

What are your thoughts on http://kentonv.github.io/capnproto/ ?

Animats · on Nov 16, 2014

That the sender can probably crash the receiver with a malformed offset in a message.

I'd like to see marshalling as a language feature. It's compilable, done often, and has an effect on performance. Many marshalling systems, from OpenRPC to protocol buffers, use a precompiler. But that adds another level of language.

geofft · on Nov 16, 2014

> That the sender can probably crash the receiver with a malformed offset in a message.

Errr, that would be a major bug in capnproto. While it's definitely possible the software has bugs, it's certainly a design constraint that the sender absolutely cannot crash the receiver.

http://kentonv.github.io/capnproto/faq.html#arent-messages-t...

cwp · on Nov 15, 2014

Impossible to know up front.

I think one issue you'd run across when doing something like that it that Node is not an application in the sense that Redis is an application. It's mostly a JIT-compiler for applications that are written in Javascript. Porting a Node application to run on Arrakis would probably involve changes to both Node and the Javascript application code, particularly if you want to run close to the hardware performance limits. That's doable, but I doubt you could preserve the APIs that Node currently exposes to applications. (The FS module would be especially problematic!)

What could be useful is to ditch Node's existing IO APIs and provide bindings to the Arrakis IO library directly.

binarymax · on Nov 15, 2014

I'm not a systems guy either, but from watching the video, I don't think it would be applicable. The example used was Redis and Memcached which, while both having rich APIs, are very purpose built for data tasks. The purpose of Arrakis is to remove the kernel layer from the data-plane for minimum overhead when doing that type of I/O.

Node is not a data specific application and won't live entirely in the data-plane. It is general purpose and libuv especially has a lot of moving parts that might blur between the data-plane and control-plane. I would imagine you would have better luck building a unikernel dedicated for node than to try and shoehorn node into Arrakis.

arstneio · on Nov 15, 2014

The main enabler seems to be the fact that isolation is now redundantly enforced in two places - both by the kernel, and by IO device drivers. In return for making the hardware support non-optional, Arrakis eliminates the kernel's involvement.

rbanffy · on Nov 15, 2014

> The main enabler seems to be the fact that isolation is now redundantly enforced in two places

Can the kernel realize it and, if the hardware can manage isolation with OS semantics, step aside?

This isolation support in hardware reminds me of the "Software on Silicon" thing Oracle has shown on their new SPARC. Is offloading more and more application and OS level logic to hardware going to explode?

MrDom · on Nov 16, 2014

Darn, I thought this was going to be about audio consoles[1].

[1]: http://arrakis-systems.com

digi_owl · on Nov 16, 2014

So in essence this is akin to running DOS inside a hardware assisted VM?