Hacker News new | past | comments | ask | show | jobs | submit login
Snap: A Microkernel Approach to Host Networking (ai.google)
144 points by sandGorgon on Oct 28, 2019 | hide | past | favorite | 75 comments



"While early microkernel work saw significant performanceoverheads attributed to inter-process communication (IPC)and address space changes [15,16,20], such overheads areless significant today. Compared to the uniprocessor systemsof the 80s and 90s, today’s servers contain dozens of cores,which allows microkernel invocation to leverage inter-coreIPC while maintaining application cache locality. This ap-proach can evenimproveoverall performance when there islittle state to communicate across the IPC (common in zero-copy networking) and in avoiding ring switch costs of systemcalls. Moreover, recent security vulnerabilities such as Melt-down [43] force kernel/user address space isolation, even inmonolithic kernels [25]. Techniques like tagged-TLB supportin modern processors, streamlined ring switching hardwaremade necessary with the resurgence of virtualization, andIPC optimization techniques such as those explored in theL4 microkernel [29], in FlexSC [58], and in SkyBridge [44],further allow a modern microkernel to essentially close theperformance gap between direct system calls and indirectsystem calls through IPC."


You're right to quote some text that addresses the idea that "microkernels are slow", which keeps popping up.

To that, I add this article[0], which is sufficient in destroying that idea.

[0]https://blog.darknedgy.net/technology/2016/01/01/0/


IPC is a tad slower than not IPC.

However, stuff like WebAssembly and other sandboxing methods can be used to leverage two processes into the same address space. Your filesystem driver then simply lives as a module in address space and it's a normal process. The IPC turns into a simple jump using a pointer value provided by the kernel (which depending on the trust level can provide parameter validation or can be a plain pointer to the correct function).


How would that work? The entire premise of Microkernels is that everything runs in a dedicated processes such that if one crashes, the system can continue all other processes and only need to recover that single process.

The cost of switching processes are not a problem of address space, the MMU makes sure that is not a problem as shared memory is a reasonable tradeoff for certain domains where performance is necessary. Normally you'd rather want use messages but that is a different topic.

In either case, the cost of a process switch is handling the registers and cache - that will not go away no matter how you do it, which is why multicore implementation with messages can actually turn out being faster. Less switching and more locality.


The experimental Singularity OS [0] from Microsoft implemented processes in the same address space using so-called "Software Isolated Processes" or SIPs [1]. Singularity was a microkernel design with the filesystem, networking, drivers implemented outside the kernel in a variant of C# called Sing#. It seems Sing# had ownership semantics similar to Rust which allowed SIPs to share memory by transferring ownership through lightweight message passing.

[0] https://en.wikipedia.org/wiki/Singularity_%28operating_syste...

[1] https://courses.cs.washington.edu/courses/cse551/15sp/papers...


Well, with webassembly the process can run under ring0 but still be isolated as if it was running in ring3. The crash mechanism isn't different; if the driver crashes then the kernel can kill it and deallocate it's resources, a device manager process can then restart the driver. The advantage is that you can include small, audited and verified binaries that can run code as ring 0 to remove abstraction from accessing the raw hardware.

The cost of switching processes is significantly reduced when you don't need to switch privilege levels and not having to invalidate the TLB as well as not having to change address space at all will make a context switch not significantly more expensive than a function call.

You get compile-time isolation and can still take advantage of the MMU when needed.

And using such an implementation does not prevent you from implementing your driver such that it can run on each core and take advantage of it or even passes messages. An ethernet driver could still, for example, pass a message when the "send data to tcp socket" function is called while allowing another program to use the same function without any message passing, depending on what is better for your use case.


I'm not sure I believe it's a good idea to give up on the hardware protection, as that leaves it to the software implementation to ensure it's secure. If you compromise an application in user mode, it will not get you far without another exploit in something that runs in supervisor mode. The hardware makes that certain, and it's well verified. We've seen time and time again that simple buffer overflows exploit things, and the more the runs in supervisor mode, the larger the attack surface is.

If a driver runs in user mode, an exploit needs to exploit the hardware as well - and that is for all intents and purposes something that we see very rarely.

If the same driver runs in "software user mode", but executing as supervisor (basically inside an VM environment), we need constant security checks in software, and an exploit now have the VM code to further exploit, if successful that will automatically grant it supervisor access.

In both cases it's assumed that neither implementation has access to more interfaces than necessary for it to do it's work. For instance, a driver for a mouse does not need access to the disks.


The thing is, you don't have to give up hardware protection. If you don't trust code, you can still run it in ring 3 with all the associated overhead. The point is being able to choose how close an application runs to "no overhead" until you're at a level where the driver is a function call away.

From my experience, a lot of hardware is terribly insecure against exploits. Not necessarily the CPU but stuff like your GPU or HBAs, ethernet cards, etc.

With software containment, the advantage is that you can set it up that drivers need to declare their interfaces and privileges beforehand. In an ELF or WASM you have to declare imported and linked functions, it should not be difficult to leverage that to determine what a driver can effectively do. With WASM you get the added benefit that doing anything but using declared interfaces results in a compile-time error.

A driver can be written so that a minimal, audited interface exists to talk to the hardware almost directly with some security checks and then the WASM part that handles the larger logic parts and provides the actual functionality.

WASM isn't a supervisor, so exploits on VM code aren't that relevant. Exploiting the WASM compiler/interpreter/JIT is more interesting but those are exposed to the daily internet shitstorm exploits, so I think they are fairly safe.


I suppose it remains to be seen if someone can make a PoC. I'm skeptical but ultimately I do not know enough to decide either way.

> it should not be difficult to ...

Famous last words.


It works by defining "process" in a way that doesn't imply "registers and cache." If the concern is the ability for one process to survive despite arbitrary misbehavior from another process (and not, say, Spectre-style attacks), software-based fault isolation like NaCl or PittSFIeld or using a restricted language runtime like wasm or Lua demonstrably works fine for that, no hardware context switch required.


That sounds like a lot of overhead for a kernel. There is a reason it's normally written in a mix of C and ASM.


It wouldn't be that much overhead in reality, you can still write the kernel in a mix of C and ASM (or more modern; Rust and ASM). The kernel doesn't even need to know about these isolation mechanisms, since you run stuff in ring 0, you can hook the appropriate APIs and interfaces in a process. The kernel itself would be insanely small and thusly more easy to defend.



Nebulet [0] was an interesting take on this -- a Rust microkernel compiled to WebAssembly. Unfortunately, the project seems to have run out of steam.

[0] https://github.com/nebulet/nebulet


So, you'd like to expose the whole kernel to Spectre-type attacks. Wonderful!


Well, for trusted code that doesn't expose any mechanism to run foreign code (ie browsers), spectre is largely a non-issue.

So the trusted core part of the OS can run without any spectre prevention, though you can still enable the various hardware protections available in the chicken bits.

And if it's necessary to protect against spectre attacks, you can use shim layers or even isolation into ring3 to take preventative measures. This allows leveraging performance were important and security where necessary.

If it's in webassembly, you can even run two versions of a driver; one with spectre-mitigations compiled in and one without, sharing one memory space and the kernel can choose to invoke either one depending on the call chain.


Trusted code has to be free from vulnerabilities to be immune, so it's still an issue even for trusted code. And I'm pretty sure neither webassembly nor other sandboxing methods can fully mitigate speculative attacks on out-of-order CPUs within the same address space, you'd need a programming language with a compiler designed from scratch for it.


Well, it doesn't have to be free from vulnerabilities, not any more than any other OS code. The sandboxed code that is running trusted (ie without trampolines and spectre-defenses) would still hold the guarantees given by the sandbox (WASM), which are pretty much on par with what a modern browser can do for JS and WASM. And keep in mind that both WASM and JS now have spectre-defenses, so there is no need for a PL from scratch for this.


> And keep in mind that both WASM and JS now have spectre-defenses, so there is no need for a PL from scratch for this.

As far as I remember they weren't able to defend from side channel attacks within the same process completely and decided to rely on process isolation instead, estimating it would be too much work to address all known spectre class vulnerabilities on their existing compilers and too hard to ensure for defenses not to be broken later by compiler developers.


Another point is that macOS and iOS are moving towards a user space driver stack and away from Kernel space.


Because, save few exceptions (e.g. timers, early debug serial...) running drivers in supervisor mode isn't very smart.

Drivers tend to have high bug density, and supervisor mode means high potential for damage.

Much can be mitigated by running drivers in userspace[0][1], especially with some IOMMU help.

[0] https://www.cs.vu.nl/~herbertb/papers/minix3_dsn09.pdf

[1] https://wiki.minix3.org/doku.php?id=www:documentation:reliab...


Android and Windows as well.


Linux has been moving towards hybrid drivers for certain components for a while now.

I think you can run large parts of the filesystem and network stack in user-land now.


There is a big difference is that the other OSes are doing it no matter how many pitchforks come into their way, while with Linux you need to find a security conscious distribution.


I think it’s hard to say that without knowing what’s going on in that companies and how many internal disagreements had been there.

And even if Apple now talks about user-Mode drivers - do we even know what percentage they aim for?


Yes we do, as they presented at WWDC, all of them.

They are following a two release steps, in release N, the user space drivers for a specific class get introduced and the respective kernel APIs are automatically deprecated. In release N + 1, those deprecated APIs will be removed.


I disagree, these are VERY old papers (~30 years old) of comparisons containing very little data from what I can see at a quick glance.

I saw a talk a couple of years ago by Tanenbaum where he said he would be ok with Minix being 20% slower than a monolithic kernel like Linux, indicating that it was currently slower than that.

Granted, Minix has not seen the type of optimizations that popular monolithic kernel has due to lack of manpower.

So, I really look forward to seeing benchmarks made between monolithic kernels and new micro kernels like Google's Zircon, and Redox once they've had sufficient time to mature.


> Tanenbaum where he said he would be ok with Minix being 20% slower than a monolithic kernel like Linux, indicating that it was currently slower than that.

It doesn't indicate anything of the sort. He was making a comment that the reliability and security benefits of microkernels are simply more important than performance in this mind.


They may be old, but what major advances in monolithic kernels have there been to invalidate them now? There seem to have been significant advances in microkernels. (I was glad to have been using fast microkernel-ish systems daily in the 1980s, and not VAX/VMS.)


>but what major advances in monolithic kernels have there been to invalidate them now?

What are the significant advances in micro kernels that does not apply to monolithic kernels ?

Also they are not only VERY old, they seem intentionally vague when it comes to actual data about the systems they are comparing against. In short, I welcome the new micro kernels so that we can see a comparison between modern monolithic and modern micro kernels and actually get a good representation of what the performance difference is.

Because if not for performance, there is no reason not to use a micro kernel.


The L4 work is usually quoted for relatively recent advances. Presumably monolithic kernels don't benefit from performance work that's specific to micro-kernel message passing, if performance was all that mattered.

I don't remember the OS4000 context switching time but it was fast in comparison with other systems of the 1980s, and it was very fast in actual use (running real-time and interactive processes). The performance of L4Linux is quoted within the typical margin of speculative execution mitigations relative to Linux. However, it's a strange idea to me that speed is all that matters for an OS, and not reliability, trustworthiness, etc.


It always was a lot less important than proponents of monolithic kernels made it seem, there were two things that were left out of such discussions:

- paging is a very efficient way to copy a block of data from one process to another.

- the perceived speed of an interactive system has everything to do with responsiveness and very little with actual throughput. And responsiveness is associated with near real-time operation which happens to be something micro kernel based systems excel at.


With the advent of multi core CPUs monolithic kernels jumping in on doing slow shared memory concurrency kind of killed the idea of monolithic kernels to be fast. They can't be fast unless they do actor model, but it goes against the idea of monoliths, hence they can't do better than microkernels.


There are already so many software projects called Snap.


For Google that's a non-issue, they will simply promote their own use of the term over the previous uses.


Even Google has a limit though, they could've named their programming language "The".


I would say taking the '+' character and attempting to make it their own is proof positive that they don't have limits.


Wonder if their MicroQuanta microsecond granularity Linux scheduler would be useful for other near realtime things like audio or electronics interfacing. I guess the low latency scheduler is a key part of making Snap work in userspace.

https://lkml.org/lkml/2019/9/6/177


"A change to the kernel stack takes 1-2 months to deploy; a new Snap release gets deployed on a weekly basis."


At many companies rolling out a new kernel in a month would be an outright miracle.


The actual paper, where their TLS certificate is not yet expired, is https://storage.googleapis.com/pub-tools-public-publication-...


Unpopular opinion ahead:

The deal with user facing libraries like this is that I'd rather they generalize this and expose the existing network library drivers with a uring interface, and the user processes can take care of packet decap the way they want it.

Of course, helpful to have the stub to map the PCI BARs to userspace, and hopefully without any message signalled interrupts.

These two alone, may be hard to do, with all the existing network drivers out there. Engineering feats like these are good but not helpful to most people unless they are simple and generic enough to work on most devices.

I hope the guys who wrote this take note and eventually layer out and open source this library.


looks like certificate for ai.google expired today, what a coincidence!


And with HSTS, modern browsers will make sure you can't ignore it :) In the meantime https://storage.googleapis.com/pub-tools-public-publication-... works.


Not sure about Firefox but in Chrome you can type "thisisunsafe" on the page to bypass even HSTS (you can test at https://subdomain.preloaded-hsts.badssl.com/ )


Oops, the TLS cert on this site expired just nine minutes ago.


Is it still safe?


"Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems."

Are there many systems using microkernels in production environments? I mean at least at this kind, or a somewhat similar, scale?

3 years seems like a long time, while up until this moment microkernels seemed fairly niche to me and something reserved for more experimental systems.

The only one I can think of off the top of my head is Fuchsia.


QNX has been around for quite a while and used across many areas: https://blackberry.qnx.com/en

Edit: The Wikipedia entry has a little less fluff than the official homepage: https://en.wikipedia.org/wiki/QNX


Well Snap isn’t a microkernel. It’s a Linux user space with microkernel inspiration. So you can’t use it as a basis for comparison.


The Secure Enclave processor on Apple devices runs a version of the L4 microkernel

[0] https://support.apple.com/en-us/HT209632


OKL4 is a microkernel that’s shipped in a billion mobile devices.


I'm surprised that microkernel-like IPC still hasn't found its way into any *NIX system. The closest we've gotten is System V IPC, which has never really taken off.

Is this just difficult to design well or are people genuinely okay with socket(AF_UNIX, SOCK_SEQPACKET, 0)?


I don't think it's difficult to design an IPC mechanism. But I DO think it's difficult to design an IPC mechanism that everyone is happy with. Some people want only synchronous IPC, others also asynchronous (out of order responses), some want multicast, some want events without associated request, some want pub/sub, etc.

If at the end people continue building their own IPC mechanisms on top of TCP/IP or unix domain sockets then the in-kernel mechanism will just be another thing to maintain.

There had been some endaveurs to bring new IPC mechanisms into the Linux Kernel (AF_BUS, KDBUS, BUS1). I think those failed for similar reasons (although I'm not sure were BUS1 now is - the others are definitely discontinued).


KISS: QnX MsgSend, MsgReceive, MsgReply. That's really all it takes.

http://www.qnx.com/developers/docs/6.5.0/index.jsp?topic=%2F...

You can go all the way from the lowest level kernel uses to very high level application constructs with that. Re-inventing existing wheels badly is something the software world excels at.


> That’s really all it takes.

Not quite. Those functions need established connections, so you need ConnectAttach and ConnectDetach as well. But none of that is useful unless you can identify clients so you also need ConnectClientInfo.

This isn’t a dig, having worked on a custom IPC system I found the QNX approach to be the best of all worlds.


Sure, but the essence of the actual transfer, the part where performance matters is send/receive/reply.


> But I DO think it's difficult to design an IPC mechanism that everyone is happy with. Some people want only synchronous IPC ...

Every form of IPC can be implemented on top of asynchronous message passing. Interface is not the problem. The problem is high performance designs with all the batching, memory mapped buffers, no syscalls, etc.


> Every form of IPC can be implemented on top of asynchronous message passing

Sure, if you want to introduce inherent DoS vulnerabilities into your IPC subsystem, not to mention slow down IPC so much that it's practically unusable. Many early microkernels were asynchronous, and synchronous microkernels like L4 beat them easily every time.

Furthermore, synchronous IPC can be made immune to DoS which is inherent to async IPC: https://www.researchgate.net/publication/4015956_Vulnerabili...


Haven’t even some L4 designs moved to async IPC for performance reasons? I remember having read about OKL4/Genode being async.

Fuchsia certainly is async, but that’s not if L4 family.


> Haven’t even some L4 designs moved to async IPC for performance reasons?

I know only of async notifications, which I believe require no allocation of storage and so don't open up DoS opportunities.


> Sure, if you want to introduce inherent DoS vulnerabilities into your IPC subsystem, not to mention slow down IPC so much that it's practically unusable.

This is nonsense. There could be DoS vulnerabilities in implementations, but they are not inherent to async message passing.


Yes they are. Who owns the buffers needed to store the async messages?

1. If they're booked to the receiver, then clients can easily DoS receivers by flooding them with message.

2. If they're booked to the sender, then receivers can easily DoS senders by blocking indefinitely.

3. If they're booked to the kernel (which is most common for true async message passing, unfortunately), then senders or receivers can DoS the whole system by the above two mechanisms.

And that's only the most basic analysis. I suggest you read the paper I linked and its references if you want a more in-depth analysis of IPC vulnerabilities and performance properties.


You are over simplifying it and dropping too many important details so that it ceases to be a real system and becomes some weird system susceptible to DoS.

In high performance scenario it would be more like this: process shares fixed ring-like buffers with the kernel where it can put messages, messages it can't put there it either accumulates locally until it can or just drops, there would be some kind of polling or event notification mechanism to know when it can put and get more messages into and from shared buffers.

P.S. I can't access the paper, but presumably they are making the same faulty assumptions if they claim the same things you did.


No, you are oversimpifying by assuming mutual trust between processes, which is not a suitable assumption for a system-level message passing system.

> process shares fixed ring-like buffers with the kernel where it can put messages

I am making a claim like "Turing machines can't solve the Halting problem", and you are saying, "If you put a limit the number of computation steps, then the Halting problem is decidable". But such a system is no longer a Turing machine.

What you are describing is not asynchronous IPC. With async IPC, you ought to be able to send a message at any time without blocking. That's what async IPC means.

If you must sometimes block or throttle before you can successfully send a message, even if only in principle, then it's no longer async IPC. It is instead a mixed sync/async system, which invariably becomes necessary in order to address the inherent limitations of async IPC.

> messages it can't put there it either accumulates locally until it can or just drops

So DoS against the sender, like I said. Try assuming a less liberal threat model and see how far async IPC takes you.


There is no assumption of mutual trust between processes, and there is nothing synchronous about it, you still don't know whether any of the messages reached their destinations. This is just backpressure, asynchronously propagated (or synchronously, depending on your interpretation of it). And still no DoS, it's completely up to the application to decide what to do with the messages it generates too fast.


> This is just backpressure, asynchronously propagated

The need to handle back pressure is exactly why it's not pure async IPC.

> And still no DoS, it's completely up to the application to decide what to do with the messages it generates too fast.

And if the program can't discard messages, then it's a DoS. If the program can instead rely on the receiver to keep up so it doesn't need to make this choice, then there's a trust assumption between these processes.

There's no escaping this tradeoff with async IPC.


> And if the program can't discard messages, then it's a DoS. If the program can instead rely on the receiver to keep up so it doesn't need to make this choice, then there's a trust assumption between these processes.

It doesn't work like that and is getting into hypothetical non real world systems again. Trust is especially interesting in this context, because if you don't trust other processes, you absolutely have to be able to discard their messages. They can misbehave, crash, stop responding at any time.

But say somehow you can't discard messages and don't trust them. It's still only about backpressure handling. For example, kernel can just refuse to send messages to a particular recipient it knows is not consuming its incoming messages and instead can return messages back to senders into their incoming buffers or rejected buffers or whatever. Senders can decide what to do with that information, wait for a particular recipient to become ready again, waiting is not DoS, stop generating messages for that recipient or accumulate them or just drop them, minimal cooperation is required of course, but not trust, if they don't cooperate they don't hurt anyone but themselves. It's all still pure asynchronous stuff. And in fact, all high performance real world asynchronous communications deal with backpressure all without DoS and trust, they all also can discard messages though.


Android has it since Project Treble, classic Linux drivers are deemed legacy on Treble architecture.

They make use of Android IPC to communicate among themselves and the kernel.

https://source.android.com/devices/architecture/hidl/binder-...

Now given the role of Linux on Android, and what is accessible to userspace, maybe we shouldn't anyway consider it an *NIX system.


It's difficult to tack on to an existing kernel syscall interface. You really want a capability based interface to keep most of the permission checks out of the data plane. And even then modern microkernel IPC goes to extreme lengths. For instance the L4s tend to do crazy stuff like not save all of the registers, but make sure to only clobber the registers that can't be arguments for the call itself. You might be able to tack that onto the front of the syscall interface sort of like how objc_msgsend works, but it'd be a huge pain.


Google's Fuchsia seems to have it, though I'm not sure how it's implemented. https://fuchsia.dev/fuchsia-src/reference/syscalls#channels


Mach?


Projects need to stop calling themselves snap


ans to dead question.

> Do they pay people to do this?

Yes you fucking bet they do.


> 0 points by Gibbon1

Makes my point




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: