This is pretty interesting, but unfortunately their intro page doesn't do a good job of explaining why. (For instance, they use the word "unprecedented" without really saying what makes Arrakis different from previous exokernel-like designs.) I'll try to summarize what I got out of skimming the paper[1]:
When you have multiple applications running on the same machine, you need some way to safely share resources between them; for example, incoming network packets are a resource. A kernel handles this by keeping a data structure mapping sockets to processes, and demultiplexing data that comes in from the network card. Hypervisors work the same way, except at the level of virtual machines rather than processes.
Arrakis does the same thing, but relies on hardware support in the network card to dispatch packets to the right process. This relies on a standard called SR-IOV[2] which allows the OS to configure a PCI device to present itself as multiple virtual subdevices. The kernel programs the NIC to dispatch packets to different buffers depending on the incoming MAC address; after that, packets can be dispatched with no kernel involvement at all. Similarly, you can tell a disk controller to present a particular extent of a disk as a new virtual storage device.
The blurb about memory protection seems to be a red herring, because as far as I can see they haven't done anything to change that. There's still a kernel, which handles requests for resource mappings, and processes are still isolated from each other. But once they've requested the mappings that they need, the normal execution path doesn't involve any syscalls, and so there's no kernel overhead. The real contribution of the paper is designing an API around this idea and proving that real applications like Redis can be ported to it.
But where does one get SR-IOV devices to experiment with? It looks like the Intel 82576 chipset has SR-IOV, and can be had in a $50 card, two ports, 8 filters per port. The Intel 82599 is a 10Gb chip with more filters per port. (with linux support for the SR-IOV, it can manifest as multiple devices)
Disk storage is less obvious. The LSI 2308 and 3008 controller chips probably support it, but I'm not finding a commodity card or a motherboard integrating one.
They mention in the paper that existing devices have problems which prevent them from actually securely isolating clients of the sub devices. Combine this with seeing a lot of web activity about SR-IOV in 2009 and not much now. Either it became too common to mention, or is dwindling into an idea that didn't catch on. The wait for secure SR-IOV might be interminable.
In the paper, they say they are using "an Intel MegaRAID RS3DC040 RAID controller with 1GB cache of flash-backed DRAM, exposing a 100GB Intel DC S3700 series SSD as one logical disk". I'm not familiar with it, but is that raid controller insufficiently commodity?
Interesting, its a fork of Barrelfish [1] which is the one-core-one-OS OS. When I first heard about it, it sounded like Multi-DOS (several instances of MS-DOS running at once) but its a bit more sophisticated than that :-). Other than cache contention (which is always going to be a problem) its an interesting approach.
"Applications are becoming so complex that they are miniature operating systems in their own right and are hampered by the existing OS protection model"
Sure, that's true for browsers, as they mention, and a few other degenerate cases (eg. virtualization software?) - but that's certainly not the case for the vast majority of applications I run (text editor, terminal, mail client, IM client, etc.). How does this argument hold?
But the point still hold. Let's say the script interpreter of MS Office (or gimp or sublime, etc.) needs access to the hard drive. The system, no matter how locked up, still needs to give full access to the hard drive, unless they want to break the app.
From there, the same exploits that were previously possible are possible again - they can, if they break out of whatever sandbox is in place, access everything. I guess the OS might work better for apps that don't need these rights to begin with, but then these apps usually aren't much a problem in regular OSes anyway.
The thing is, parts of the app might need access to the hard drive, but that doesn't mean the whole app needs it. For example, your email client as a whole needs hard drive access, but the email parser just needs a channel to receive the messages and return a data structure, so you can isolate it and then if an email is sent that tried to explore some bug in the parser that achieved code execution, it still couldn't delete or read your files.
Most of the time the os doesn't have a general purpose interpreter in the kernel even if they have an interpreter for important os functionality(sh/bash/powershell). And even then most of the time these interpreters aren't meant to be embeded in applications.
It's a fair point about postgresql but dbs that need max speed/control at times are an exception not the rule.
As the OS gets thinner and more like a hypervisor, the apps and app-like VMs get more OS-like. One day the browser might have daemons.
I'd guess at the same time all these pieces will continue to become more distributed. Hypervisors and apps will present distributed environments (storage, processing, failover), even more than they do today.
At one time, all an Operating System did was load other programs into memory so the processor would run them instead. All of the sandboxing, garbage collection, security, and other features were added to make it more usable.
Correct. The usability gain was not having to (e.g.) write your own printer drivers if you wanted your program to use a printer. But if nowadays (e.g.) printer controllers are smart enough to take on most of the tasks that the OS abstracts from processes running in under it, then maybe we can turn that abstraction into a much thinner layer between the processes and the hardware.
It already does. Just write an extension and schedule a function. The one thing I don't think you can do yet is listen on a socket. At least not without doing a websocket connection to a separate service. Or bringing a native component with you.
Firefox extensions can, and have always could. ChatZilla, for example, does direct access to IRC servers from Mozilla browsers since 2000, but even server sockets are possible[1]. Native components are also possible - they're called XPCOM components. Or you can use js-ctypes[2].
In linux, I imagine this would be implemented through cgroups. Since in systemd those are starting the play a prominent role in service configuration (for resource limiting and process location and control), this could eventually be quite easy to configure if implemented correctly.
Reading the linked page i feel this is geared towards vms. I can only imagine this being convenient for machines running nothing but something like docker
If you like this concept, you may also find Mirage[1] interesting. Mirage compiles the application code into the kernel to run directly on the Xen hypervisor. (Thus system calls become ordinary function calls. They do some tricks to maintain security.)
It's interesting to see many concepts and general design philosophy of exokernels make their way into modern systems. Zero-copy, mmap, RDMA, vectored I/O, fibers/switchto, FUSE - all of these are attempts to push as many policy decisions into user space as possible so that the OS only deals with securely multiplexing the hardware.
The irony is that rather than new kernels, these are being added on as new APIs to the Linux kernel. I suppose that makes a lot of sense because it's much easier to expose a new syscall and see if it gets any adoption rather than convincing everybody to switch to a whole new OS.
It looks similar. MIT's Exokernel seems to use kernel-level hooks to control access to things like disk blocks and network packets. Arrakis takes a similar approach, but restricts those hooks to a subset that can be implemented in hardware instead of in the kernel.
In reply to many comments about the lack of safety of the approach: They claim that "[they] demonstrate that operating system protection is not contradictory with high performance".
Abstract of their latest paper:
Recent device hardware trends enable a new approach
to the design of network server operating systems. In a
traditional operating system, the kernel mediates access
to device hardware by server applications, to enforce pro-
cess isolation as well as network and disk security. We
have designed and implemented a new operating system,
Arrakis, that splits the traditional role of the kernel in
two. Applications have direct access to virtualized I/O
devices, allowing most I/O operations to skip the ker-
nel entirely, while the kernel is re-engineered to provide
network and disk protection without kernel mediation of
every operation.
We describe the hardware and software
changes needed to take advantage of this new abstraction,
and we illustrate its power by showing 2-5x end-to-end
latency and 9x throughput improvements for a popular
persistent NoSQL store [i.e. Redis] relative to a well-tuned Linux
implementation.
>"The application gets the full power of the unmediated hardware, through an application-specific library linked into the application address space."
This is pretty concerning, actually. I don't think I want or trust shady companies like Adobe to be running DRM-laden code directly on my hardware.
Vendor lock-in is an increasingly common phenomenon and I'm picturing a really alarming future if this sort of OS takes off. I like that the linux kernel sits between my software and my hardware.
Want to watch a Sony DVD? Better hope you have a webcam so that the media player application can directly access your facial reactions to the media being played and upload it to Sony's servers.
There's no reason why you couldn't run those apps on a sandbox, even inside Arrakis. And if Sony wanted to force you to have a webcam, they could do it now - they don't need Arrakis.
Been done before (some would say to death), and the reason we have memory protection between applications is forgotten because people don't realize how nice it is. Sure, sure, your big "well engineered" web browser needs direct access to the hardware for speed, but painful experience has taught us that giving apps programmers direct access to hardware is a recipe for failure. Besides, there are already plenty of workarounds to get faster (eg, mmap) or even direct access to hardware from userland, not to mention the myriad of virtualization and protection schemes and levels in userland (eg, SELinux). This seems like a solution in search of already solved problems. Although as a research project, it does seem interesting . . .
Arrakis doesn't eliminate memory protection or process isolation, nor does it give applications direct hardware access. It takes advantage of modern hardware virtualization capabilities to present certain virtual devices to the guest applications, eliminating the kernel from the I/O path for critical operations like receiving a packet.
Good points. Indulge me for a minute while I try to formulate what they might say in response: most of the complexity of an OS is device drivers.
You're right, the basics of process isolation, scheduling, and resource allocation are pretty dull.
Maybe the UW project makes the move to userspace official? At 1,000,000 packets per second it is more efficient to put the whole network stack in userspace. Assume similar gains can be had for all types of hardware, and shared libraries eliminate the overhead of loading the whole network stack every time you run 'ping'.
That's just a 30-second pitch for the idea. Userspace is not new (see: Mach). I think what is changing is the average lifetime of each process. Both desktops and servers are beginning to become a commodity to the point where the old reasons for not using a microkernel are starting to become irrelevant: e.g. message passing cost isn't as much of a problem if the file server runs 'smbd' 99% of the time and that is the only really performance-sensitive application. Likewise for a desktop, except the running process is 'Chrome', or for a tablet running 'dalvikvm'.
Really it's a move away from code complexity that wasn't possible before but is becoming possible.
So are we talking Microkernel here or something else? Sorry to constrain by categorizing, but it's what I do sometimes :) and I am thinking of some OSes (mostly research/academic) that went this way (of no layers between userland and the hardware) that weren't Microkernels, but I can't remember any names right now. I have to admit, part of what keeps my interest in this is that I used to dabble in OSes and I haven't gotten to for quite some time :(
EDIT: Exokernel was what I was remembering (I think; https://en.wikipedia.org/wiki/Exokernel). I can see some of the advantages, but as a current maintenance programmer and previous RTLinux hacker, I have to say I have my doubts about giving most programmers direct access to hardware :) That being said, giving programmers (and admins!) easier access to containerization, virtualization, or just intra-process protection in general would be a good idea. As I've said, there are plenty of options right now, but none seem to be well documented or advertised. There was a CACM article a while back that predicted a future in which you had one smartphone, but it had split domains for work and personal apps and data. Interesting stuff, and as you say, finally becoming possible.
The idea behind exokernels isn't that you give "most programmers" direct access to hardware, it's that you give a library writer direct access to hardware, and then programs link in the libraries that are most appropriate for their application. Basically, separate protection from policy & abstraction: the kernel still provides protection, but the library handles all policy & abstraction details. So instead of having your database mimic disk blocks on top of a filesystem, just expose the raw disk blocks to the DB. Instead of having your garbage collector fight the virtual memory system, expose the TLB and physical pages to the language runtime. Instead of requiring that every read() go through the kernel, have the NIC write directly into application buffers.
The big change that makes this desirable now but undesirable before is that machines are becoming increasingly single-tasked. When everybody owned a desktop and had a dozen programs on it, you needed to write an abstraction that could support all of those programs. When everybody just runs a web browser that connects up to a web server in the cloud which connects up to appservers and databases running on separate machines, each machine only needs to support one type of application. So let that app link in the abstractions it needs and only those abstractions, ensure that existing daemons don't trample over each other, and have the kernel get out of the way.
> So instead of having your database mimic disk blocks on top of a filesystem, just expose the raw disk blocks to the DB. Instead of having your garbage collector fight the virtual memory system, expose the TLB and physical pages to the language runtime. Instead of requiring that every read() go through the kernel, have the NIC write directly into application buffers.
This sounds great. Do you know of any literature/writings on these kind of things for programming languages, specifically? For example the thing you mentioned about language runtimes.
There's little about exokernels because to my knowledge no exokernel has had widespread production use. Lisp Machines had a garbage collector that was integrated into the kernel's pagefault system, though, which meant that most collections would not trigger page faults:
How is that George Santayana thing? "Those who cannot remember the past are condemned to repeat it"?
I like seeing new ideas being tested in the OS arena. It's a real shame the two dominant OSs these days are Unix and VMS. I refuse to believe these two are the best humans can come up with.
From "The Rise of 'Worse is Better'", written in 1989:
"The good news is that in 1995 we will have a good operating system and programming language; the bad news is that they will be Unix and C++."
The predicted date was a little early, but otherwise he pretty much got it right. I think that, from a late 80s Lisp perspective, modern Windows fits into the "UNIX" category, and Java/C#/ObjC/whatever are close enough to C++ to count.
Windows and UNIX are not so different. Monolithic kernels written in C or C++, permissions largely done with user granularity, byte-addressed memory, virtual memory with per-process address spaces, no hardware support for tagged pointers or garbage collection.... Compared to the variety that's gone before, they look nearly identical in their fundamentals.
Hrm, I wonder what they were using for their source control before making it available on GitHub, they've squashed all commits into one giant commit which is really, really unfortunate, especially for people who might want to contribute or understand the code base better. There are tools to port commit history over into git, they should have used such a tool.
Or they found the history too messy, didn't want to publish full author information with email (possibly, because it would mean getting permission) and similar cases.
It can be bad to do that if there might have been some private stuff in the history especially if its pretty long. This is one of the things that led to the 4chan hack (private key deep in history)
Although this might be the best operating system ever in existance, I think it's quite hard to get anywhere in the desktop OS market. How can they find users? And if they don't find users, how can they find people who work with them on their software?
I think the idea is to give each app dedicated cores. (And maybe save one core for miscellaneous stuff.) If the app wants threads it can use a libOS that provides whatever flavor of threads it prefers.
until your application implements memory protection, etc... ie. their premise is that modern complex applications contains an OS inside it [ i don't fully agree with such premise, though remembering "nspr" state even back in 1999 i'm not surprised that 10+ years later Mozilla dind't have a problem coming up with their own OS, i'm more surprised that it didn't happen much earlier :) ]
As an operating system for a dedicated, single purpose server this may be okay. As an operating system for mobile phones, this may one day be alright (when phones get about 12 cores). As an operating system for workstations and desktops, this is probably the worst idea I have heard in a long time. It sounds like a hipster version of multidos. At one application per core (or at least I think that's the idea), you are severely limiting the ability to multitask. So, on an 8 core system I have one core running the exokernel (1), another core running a gui (2), another core running an audio application (3), another running my web browser (4), another with my editor (5), another with git (6), another with a torrent going (7), and another with email (8). Due to the description, I am hoping that the GUI using a core and other applications having access to it is possible. I also hope that audio services don't need a core, or else the audio application developer will need to reimpelement OSS4 and/or ALSA in his/her application. That's just about idiotic... oh well...
When you have multiple applications running on the same machine, you need some way to safely share resources between them; for example, incoming network packets are a resource. A kernel handles this by keeping a data structure mapping sockets to processes, and demultiplexing data that comes in from the network card. Hypervisors work the same way, except at the level of virtual machines rather than processes.
Arrakis does the same thing, but relies on hardware support in the network card to dispatch packets to the right process. This relies on a standard called SR-IOV[2] which allows the OS to configure a PCI device to present itself as multiple virtual subdevices. The kernel programs the NIC to dispatch packets to different buffers depending on the incoming MAC address; after that, packets can be dispatched with no kernel involvement at all. Similarly, you can tell a disk controller to present a particular extent of a disk as a new virtual storage device.
The blurb about memory protection seems to be a red herring, because as far as I can see they haven't done anything to change that. There's still a kernel, which handles requests for resource mappings, and processes are still isolated from each other. But once they've requested the mappings that they need, the normal execution path doesn't involve any syscalls, and so there's no kernel overhead. The real contribution of the paper is designing an API around this idea and proving that real applications like Redis can be ported to it.
[1] http://arrakis.cs.washington.edu/wp-content/uploads/2013/04/... [2] http://blog.scottlowe.org/2009/12/02/what-is-sr-iov/