Hacker News new | past | comments | ask | show | jobs | submit login
Experimental KVM-based VMM, Written in Go (github.com/google)
221 points by simonebrunozzi on Jan 7, 2015 | hide | past | favorite | 45 comments



Because it's not obvious without glancing at the code, this relies on 9P-over-virtio to implement the filesystem, which means its pretty much always going to be limited to running Linux images. This sounds like a nice and clean solution, but at least it forces the guest to do basically no useful caching of its file system, in order to remain coherent with the host fs.

The remaining alternative is on-demand synthesis of a block image given a directory tree, which IIRC Qemu supports.

As always, making a couple of syscalls isn't a huge amount of work, and this project punts on anything difficult (e.g. supporting traditional boot environment & emulated devices) needed for handling basically any other OS. It can't even boot a standard vmlinuz.

Not sure what gap this is supposed to fill. If you're running a Linux host, and you want ultra-quick dev machine booted from your filesystem, User Mode Linux already exists and has way more support and flexibility. If you need support, performance, and code that's been security audited, you probably want Qemu/Xen/libvirt

edit: just because it may not be clear to a lot of people.. Literally KVM setup is a handful of ioctl() system calls, almost everything non-device-emulation related is handled by Linux for you. So when you say you're implementing a KVM-based hypervisor that doesn't implement any device emulation, that is to say there's very little real work being done. It's easy to claim to be fast when your code does nothing. The only special sauce left is a fundamentally slow method for handling the root filesystem which is worth avoiding for the reasons I mentioned.


9p works from OSes other than Linux; there's a driver for Windows as well. That said, it will likely always work best in Linux. Unfortunately, even over virtio, 9p has awful performance.

> The remaining alternative is on-demand synthesis of a block image given a directory tree, which IIRC Qemu supports.

If you mean vvfat, that hasn't worked in quite a while, and it never supported writes.

> If you're running a Linux host, and you want ultra-quick dev machine booted from your filesystem, User Mode Linux already exists and has way more support and flexibility.

UML tends to bitrot; it's hit or miss whether it even compiles in each new release, and it doesn't get any significant new development. It also relies on ptracing the target process, which is rather awful; unfortunately there's no better way to implement userspace syscalls.

I wonder if UML could be accelerated by adding some new features to the BPF-based seccomp to support syscall emulation? BPF could translate each syscall into some efficient communication to the UML kernel process.

Alternatively, with a 64-bit address space available, UML could learn to do its own memory and process management, and not rely on host Linux processes at all.


I imagine seccomp and ptrace have pretty similar performance, since they both wake the 'host' using a signal, and stay out of the way until their trigger is met. As for UML stability, Debian have been providing packages of it for years, I can't remember the last time I built it directly.


Could you elaborate on 9p/virtio performance? I've had pretty good results booting qemu/kvm VMs with 9p root filesystems...

UML used to have some host kernel patches that were a lot faster for intercepting syscalls. There was never enough interest (even before Vanderpool/Pacifica) to merge it though.


> Could you elaborate on 9p/virtio performance? I've had pretty good results booting qemu/kvm VMs with 9p root filesystems...

The throughput is passable, if you do large enough reads; however, the latency for individual operations is terrible. Try booting via virtio-9p and building a kernel inside the virtual machine, as compared to the host system. Last time I tried it, it took several orders of magnitude longer, due in large part to the latency of simple operations like stat or open/read/close.


On high-latency links, 9P is slow, but on low latency links (like should be provided by QEMU), 9P should be very fast. Looks like a QEMU implementation problem to me.

That being said, calling QEMU's 9P "9P", is a stretch. 9P was designed wit multiplexing in mind, but it's not multiplexed in QEMU, in fact multiplexing in QEMU uses a completely different mechanism. You can't mount QEMU's thing in Plan 9 (unless someone added support for this when I wasn't looking).


I wouldn't find it surprising at all if the fault lies with qemu's implementation rather than with some inherent property of 9p. The performance test I ran was with qemu's virtio 9p implementation: compared to a kernel compile on the host, one in the guest took a few percent longer on a virtual block device, or hundreds of times longer on virtio-9p.


> 9p works from OSes other than Linux; there's a driver for Windows as well.

Do you have a link to the Windows virtio 9p driver? I'm not aware of such a thing existing.


https://code.google.com/p/ninefs/ . I don't know if it supports the virtio transport, but there's a Windows virtio base driver, so it likely wouldn't be difficult to adapt ninefs to use that.


There are several emulated devices simply because there's no way around it (the rtc, PCI bus, uart, etc.). Obviously KVM handles a fair amount of emulation (PIT, IOAPIC) which is very nice.

Although the Linux implementation for 9p isn't great, there's nothing "fundamental" that limits the caching in the guest. For many use cases, you will have no shared mutable state in the host. Think docker: you have some shared read-only state, but the write bits are yours alone. You may cache anything freely.

This project was simply about playing with these ideas (and implementing a VMM using Go). You should check the slides out to understand the gap that I was looking at, but suffice to say I think there's a wide opportunity for something that gives you a process-like model but whose interface is that of a virtual machine. The key point is that you don't really want an ultra-quick "machine", you probably want to run a process ala Docker. I still want it to be contained as a VM however, for security and compatibility reasons.


If you take a glance at Paragraph 1 of the Readme (its the main description for the project), it says, "Its goal is to provide an alternate, high-performance Linux hypervisor for cloud workloads." So you can probably answer your own question about the intentions of this project.


> it forces the guest to do basically no useful caching of its file system

Of a filesystem was the impression I got. The author implied that you could, in addition, mount any block device provided to you in any way you wanted.

> It can't even boot a standard vmlinuz.

It was mentioned at the end that while it won't run the real-mode code therein, it will attempt extract the elf binary and run that.


One could write a Windows IFS driver for 9P if that had some benefit. As it is, there's a Windows IFS FUSE driver that has a 9P server written for it.


You could, but you almost certainly don't want to.. Network filesystems that try to emulate regular filesystem semantics fundamentally suck, not least because you basically can't do any useful caching, and the caching that is done (because it must be done) can introduce very hard to detect bugs (e.g. coherence between a file and its in-memory mapping. If you want to execute a program from the filesystem you need such a mapping).


Ignoring details like the impossibility of loading the kernel (because the Windows kernel requires to be loaded by its own bootloader, which does a lot of setup), you will not be able to boot from 9P.


>but at least it forces the guest to do basically no useful caching of its file system, in order to remain coherent with the host fs.

Sure, but I assume the host/hypervisor is doing the caching on its side, so this may not be a huge loss for the gain.


Is it interesting if only because it's written in a "memory-safe"* language?

* Terms and conditions apply.


Could this in any way explain why Google Cloud does not support Ubuntu?



Author here.

As noted at the top of the README (and by others here), this is not an official Google product. It's an experimental project, mostly to play with some ideas (and Go).

I gave a talk at LinuxCon in August about novm. The slides may also be of interest, and are available here: http://events.linuxfoundation.org/sites/events/files/slides/...


Thank you very much for your Huptime project! It has now become an indispensable tool in my sysadmin arsenal.


Do you mean that you've just discovered it, or that you've actually been using it?

I have no idea how used it is. If it's useful to you, I'll happily invest more time to address some niggling issues (flakey tests, etc.) and maintaining it.


We've been actively using it. Just a couple of weeks ago - I managed to get it to restart processes in docker containers in a downtimeless way! I can't thank you enough for it :)


That's awesome. Totally makes my day.


i can't figure out which one of the several Go9P implementations you started with. or is this something you rolled out yourself? any differences between this and others?


It was go9p (https://code.google.com/p/go9p/).

But it was heavily modified (nearly rewritten, except fmt.go and p9.go) to suit my specific needs.


cool! i really like what you're doing.


Thanks for the note - well done, by the way. Very interesting.


Don't leave me hangin'! I got to "How do you build a VMM? (part 3)" and ... that's it? I feel like I need some kind of epilogue.


Ah, those were actually a few backup slides. The last one was "Thanks! Questions?" :)


(Off topic) Do also checkout the author's Huptime project (https://github.com/amscanne/huptime) A really simple, language agnostic way to do downtimeless restarts for any process.

It's been a lifesaver at my startup, especially when we are running many different (micro)services.


Incidentally, someone is working on Go bindings for the OS X Hypervisor (https://github.com/penberg/go-osxhv).


On a related note, Google's compute service is supposed to be kvm+not qemu. Does anyone know anything about the latter?


I've only heard rumors of its code name: Vanadium (but would love to get confirmation of that). Would be great to get this open source -- the world could use KVM + not-QEMU. Speaking for us in SmartOS, we didn't like where QEMU was headed, and we have essentially forked it for our KVM implementation -- we would love to explore Vanadium, if that's even what it's called.. ;)


What is wrong with where QEMU is headed?


Off topic. In README, it says "This is not an official Google product.". I've seen that line in some repos under github.com/google organizations. What does it really mean ?


(Edit: comment below from DannyBee is authoritative.)

Google has two paths to open sourcing code: in one Google retains the copyright (but grants a permissive license like Apache), and in the other you retain the copyright but cannot work on it on Google time or Google hardware. (You can tell which category a given piece of software is under by looking at the license headers on the code; the linked software is the first category.)

It's easier to release software under the first category, even if it's just some random hack you tinker with. (I frequently hack on stuff on my corporate-provided laptop.) But if you do so then the code has the word Google all over it even if it's not something the company intends to support. I don't know for certain the reasoning but I imagine the sentence you quote helps reduce confusion about who is sponsoring the project.


"but I imagine the sentence you quote helps reduce confusion about who is sponsoring the project."

Yes. I've updated the sentence a bit for future projects. But historically, what has happened, is that people make a lot of assumptions about code Google releases and what it means for X or Y.

I've seen entire press stories about how Google has created some new product that does X or Y, when it's just some random googler's code.

(This started even before it was in the google namespace on github, and it was just a random code.google.com project).

It seems without some disclaimer, it is roughly impossible to get people to disassociate the two.


It'd probably help to only publish official things under the google/ namespace on github, but I guess it's too late for that. :)


github namespacing at the time made this kind of a pain (we still needed to be able to admin the projects, etc), but yeah.


20% projects maybe? If nothing else, it's a nice perk.


No, evmar has it right. It means it is some random googler's code. It means Google has literally nothing to do with it, other than happening to own the code.

(IE it's not an experimental product, it's not a product at all. It's just some guy releasing some code)


Seems like a perfect thing to complement libcontainer of the docker project.


Can this be used to run Docker images?





Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: