Ways we harden our KVM hypervisor at Google Cloud: Security in plaintext

_qc3o · on Jan 25, 2017

> We have built an extensive set of proprietary fuzzing tools for KVM. We also do a thorough code review looking specifically for security issues each time we adopt a new feature or version of KVM. As a result, we've found many vulnerabilities in KVM over the past three years. About half of our discoveries come from code review and about half come from our fuzzers.

Why aren't these tools being open sourced?

andyhonig · on Jan 25, 2017

This is a valid question. Syzkaller is definitely one such tool that is open source, but the tool we've been using for the longest is internal only.

The reason it's not open source is because it's built on a lot of internal tools and it would be hard to separate out those dependencies. For example it relies on our custom VMM. We've discussed refactoring it so that we could get rid of the dependencies and open sourcing it, but that work has to be balanced with other priorities.

strstr · on Jan 25, 2017

Some of the tools have been: https://github.com/google/syzkaller

devoply · on Jan 25, 2017

Maybe because they think it's a competitive advantage.

SEJeff · on Jan 25, 2017

More likely it is just heavily integrated with internal google tooling like chubby and borg.

I suspect they still use American Fuzzy Lop[1] for much of their fuzzing even if they wrote some of it from scratch. The Linux Foundation's Core Infrastructure Initiative's[2] "Fuzzing Project" is more or less continuous integration running AFL[3]

[1] http://lcamtuf.coredump.cx/afl/

[2] https://www.coreinfrastructure.org/grants

[3] https://fuzzing-project.org/

cpeterso · on Jan 25, 2017

Because the fuzzers can be used by blackhats to find vulnerabilities to be exploited. The blackhats surely write their own fuzzers, but why give them a head start?

SEJeff · on Jan 25, 2017

Say it with me, "Security through obscurity is no security at all". Surely you'd know this working for Mozilla :)

Probably due to business reasons in that it isn't meant to be a general purpose fuzzer and is likely tied to internal google infra.

nickpsecurity · on Jan 25, 2017

Say it with me: "Obfuscation on top of solid, security practices provably reduces risk against talented hackers." It's why the Skype attack took forever. They spent so long on reverse engineering where open software with obvious weaknesses would've cracked almost instantly. Now, put that concept in for runtime protections, NIDS, recovery mechanisms, etc. Hacker has much more work to do that increases odds of discovery during failures.

Obfuscation was also critical to spies helping their nations win many wars. There's so many ways for well-funded, smart attackers to beat you when they know your whole playbook. When they don't, you can last a while. If it changes a lot, you might last a lot longer. If not well-funded or maximizing ROI (most malware/hackers), they might move on to other targets that aren't as hard.

SEJeff · on Jan 25, 2017

That's fair, but I still believe in the idea of Kerckhoff's principal. The gist being you should build a system based on the the idea that all parts are open, because any security researcher (whitehat OR blackhat) worth their salt will eventually figure it out. Of all of the layers for good defense in depth, obscurity is generally a mediocre one.

Edit: Spelled Auguste Kerckhoffs's name correctly.

bitexploder · on Jan 25, 2017

I am with you, and I think everyone generally agrees with the principle in infosec, but I just wanted to add on that from this whole "attacker" math perspective often you just want to make your app a little less fun to play with than the next guys. The bored security researcher will just move on to a less hardened target. When you design for real security and mitigate low and medium risk issues with obfuscation when it makes sense it can lower the total cost of your information security program because you control the exposure and information out there.

As long as you know why and when to obfuscate vs. implement real security measures it is a valuable tool. If I can make something that much more expensive for an attacker I haven't bought any iron clad security but I have moved a system from "likely to be hacked by a bored security researcher" to "likely to be hacked only by a highly motivated or well funded attacker".

When doing time boxed black box assessment work and I have no choice but to expend effort figuring out things (stripped symbols, obfuscated names, purposefully unfriendly backend things, etc.) it means I have less time to play with the important stuff. This exactly simulates and helps inform the "real attacker" picture.

So, it can be mediocre technically (no actual security increase in a technical sense), but perfect when it was really low cost to implement. When you spot those low cost obfuscation opportunities they are often worth doing :)

And on the content of the article, I was right there saying, "Okay, just because you now have your own pile of code doesn't mean it is any more secure!" In an absolute sense true. But if you can save a bundle of money not having to rush out to emergency patch some feature, distracting from important work, because QEMU got a new widely disclosed vuln... you are winning :)

nickpsecurity · on Jan 26, 2017

Kerchkoff's principle is similar to the principle of TCB's in high-assurance security. You want your main trust to be something small and highly vetted. The openness part introduces extra attack or defense potential to that. This can only help if significantly more effort is going into bug-fixing the open tool than exploiting it. Currently, that's backwards for software that's not incredibly popular.

It should be straight-forward to test our hypotheses with a few examples plus a highly-simplified deployment. Scenario is private information served via web server over TLS with people only seeing what their credentials authorized them to. NIDS looks for baseline activity and attack profiles. Attacker wants all of it. Attacker has one month to get in.

Option 1: Regular http server on Linux leveraging OpenSSL for protection of secrets. All configuration data except passwords or private keys are published. This includes how traffic looks.

Option 2: Unknown, non-mainstream OS w/ decent quality & defaults running unknown server with unknown crypto & compiler-assisted protections in unknown configuration with unknown traffic patterns.

Which do you think will succeed easiest? Your interpretation of Kerckhoff suggests No 1 is going to be hardest for attacker. Mine says No 2 is going to give them a lot of detective work that's also likely to set off the NIDS. We do know No 1 gets smashed regularly and with low effort. Evidence leans in my favor so far.

Option 3: Runs OpenBSD on POWER processors with HSM doing the crypto. The HSM is black box w/ tamper resistance that's also a black box. It cost a fortune like the POWER server itself. Reverse engineering the HSM's, esp if each attack bricks it, can take tons of time and money. The server is advertised as an Intel server running FreeBSD or Linux with options turned off. Think your attacker will hack it since obfuscation = obscurity = no security? And if they could, how many are even able to try to given economic cost and skills required?

Option 4: Runs on Boeing SNS server w/ HSM. That's one of earliest systems in high-assurance security certified through NSA pentesting in early 1990's. No reported hacks to this day (20+ years) although undoubtedly something to hit in there. Also unavailable for purchase outside defense. If available, you're probably spending $60-150k a unit if XTS-400 is any indicator of cost of low-volume, high-security servers. Docs say it has Xeon CPU's, custom firmware, a "transactional" kernel (also tiny), and MLS policy. Think your attacker will do better than the others did over two decades?

Option 5: Uses LOCK platform. Ancestor of SELinux made by Secure Computing Corporation. Did Type Enforcement at the level of CPU & memory interactions with security kernel on software layer. Built-in crypto-processor called SIDEARM. UNIX layer running deprivileged for the server-side app. Security-critical components developed & reviewed in rigorous way. Although no longer available, this organization still has the installation media that it uses on obsolete computers it buys off the Internet. The untrusted, networking interface just says it's a BSD. So, it's a high-security product that's not available for sell or on eBay that looks like an old UNIX box on outside. How you think the remote attacker will get in?

I hope I've amply demonstrated Kerckhoff's principle or the interpretation you're bringing are incorrect. The best approach is a combo of solid, vetted security with obfuscation. Some obfuscations can even make it impossible for vast majority of attackers to hack the system. They'll go for supply chain poisoning or infiltration before trying to hack SNS or LOCK. If physical and personnel security are good, then that obfuscation just bought you a lot.

zenlikethat · on Jan 25, 2017

> They spent so long on reverse engineering where open software with obvious weaknesses would've cracked almost instantly

Not necessarily true. If the software is available to everyone white hats are much more likely to find bugs and help fix them. They might actually have skin in the game alongside you.

White hats won't bother with proprietary software at all, and baddies sure aren't going to turn in their exploits, they'll just sit on them or sell them. If you're being targeted by sophisticated nation-state attackers keeping the code private isn't going to help you. These are people who make worms like Stuxnet, MITM major Internet services, and pop government employee Gmail accounts for their full time job.

You're just reciting the same tired old rhetoric that security through obscurity is a valid defense mechanism. It's just not.

nickpsecurity · on Jan 26, 2017

" If the software is available to everyone white hats are much more likely to find bugs and help fix them. They might actually have skin in the game alongside you."

The state of most FOSS security says otherwise. A better assumption is virtually nobody will review the code for security unless you get lucky. If they do, they won't review much of it. Additionally, unless its design is rigorous, the new features will often add vulnerabilities faster than casual reviewers will spot and fix them. This situation is best for the malware authors.

"White hats won't bother with proprietary software at all"

You mean there's never been a DEFCON or Black Hat conference on vulnerabilities found in proprietary systems + responsible disclosure following? I swore I saw a few.

Regardless, proprietary software should be designed with good QA plus pentesting contracts. Those relying on white hats to dig through their slop are focusing on extra profit instead of security. ;) White hats will also definitely improve proprietary software for small or no payment if they can build a name finding flaws in it. Some even do it on their own for same reason. This effect goes up if the proprietary software is known for good quality where finding a bug is more bragworthy.

"You're just reciting the same tired old rhetoric that security through obscurity is a valid defense mechanism. It's just not."

You're misstating my points to create a strawman easier to knock down. I said attacking unknowns takes more effort than attacking knowns. I also said, if monitoring is employed, the odd behavior that comes with exploration increases odds alarms will be set off. These are both provably true. That means obfuscation provably can benefit security. Whether it will varies on case-by-case basis per obfuscation, protected system, and use case.

Feel free to look at my obfuscated options in recent reply to SEJeff to tell me how you'd smash them more easily than a regular box running Linux and OpenSSL whose source & configs are openly published to allegedly benefit their security.

cpeterso · on Jan 25, 2017

There's a difference between the system being open and the test tools being open. Mozilla has open sourced most of its fuzzers [1], but only after they are no longer finding existing bugs. The fuzzers are then used to prevent regressions.

[1] https://github.com/MozillaSecurity

felipemnoa · on Jan 25, 2017

Not sure why you are getting down-voted. "Security through obscurity is no security at all" is a very true statement. It is the reason why the code for AES is known to everybody.

dgfgfdagasdfgfa · on Jan 25, 2017

Except it's not true. Obscurity still has cost to decipher, and the cost may be the ends you're looking for. Not all security-oriented goals are make-or-break.

Hell, if you're talking about a time scale of hours (not uncommon with 0-days) even using a trivial cypher could slow people (attempting to understand (and then fix) your vulnerability) down for long enough to "get away" with the data/transfer/rootkit/whatever.

Obfuscation has its role; it's to retard understanding, not to prevent understanding.

kajecounterhack · on Jan 25, 2017

Obfuscation as a way to prevent copyright violation makes sense. Obfuscation as a way to purposefully hide security holes is terrible. "Security through obscurity is not real security" is true, and has nothing to do with obfuscation in general. It has more to do with auditability.

Real security has a quantifiable difficulty to break through. Security through obscurity means the quantity of effort to needed to break through is an unknown.

Example:

We do know what it takes to break bcrypt. So if you've implemented bcrypt for security, great. Not obscure, but known to be safe.

We don't know how long it'll take random black hat to find out you're storing passwords in plaintext but hiding the fact cleverly.

If you release your source code auditors / community can see quickly that "oh storing plaintext passwords is a bad idea" and fix the bug. If you don't you might not know you're vulnerable and the obscurity will ultimately cost you for your ineptitude.

dgfgfdagasdfgfa · on Jan 26, 2017

I guess you can call certain forms of protections people use "real" security versus "unreal" security, but I don't see your demarcation in practice.

> Obfuscation as a way to purposefully hide security holes is terrible.

I misspoke; I meant to say 'obscurity', which is the relevant concept in this thread, and there are most certainly reasons to have security through obscurity: once you've found a flaw, you must fix it before its obscurity vanishes. This is certainly relevant it the development of fuzzers where novel approaches could reveal 0-days.

londons_explore · on Jan 26, 2017

Hashing algorithms have historically been mostly obscurity. It turns out we're really good at coming up with functions we think are one way and later find aren't.

MD4 and SHA0 were both once believed to be good...

kajecounterhack · on Jan 27, 2017

I don't think it's so much about obscurity as it is about an arms race. Hashing algorithms are constantly being measured up against new exploit methods, faster cracking speeds, etc. It's a feature that we found collisions and other problems, not a bug.

A bug would be us continuing to use those algorithms without being able to mitigate their flaws.

The fact that we can find out that these functions are not as good as we hope and improve upon them is argument against obscurity. You can't do those things unless knowledge of these functions is common knowledge.

nealmueller · on Jan 25, 2017

The authors have noteworthy bona fides. Andy used to work for the NSA, and Nelly has 16 patents with names like "Informed implicit enrollment and identification".

https://www.linkedin.com/in/andrew-honig-82239750 https://www.linkedin.com/in/nelly-porter-708a10b

Scaevolus · on Jan 25, 2017

Google Compute Engine has live migration, so host/hypervisor updates (including security updates!) can be applied without taking VMs offline. AWS doesn't support live migration. Azure has a partial solution with in-place migrations, which involves taking VMs down for ~30 seconds.

(I work on Google Container Engine.)

cperciva · on Jan 26, 2017

Google Compute Engine has live migration, so host/hypervisor updates (including security updates!) can be applied without taking VMs offline. AWS doesn't support live migration.

I don't work for AWS, so I can only speculate about their internal technology; but they've demonstrated an ability to apply security updates without requiring guests to reboot.

jvolkman · on Jan 26, 2017

I believe in certain situations they've been able to use ksplice or similar mechanisms.

regularfry · on Jan 26, 2017

Yeah, but you can do that with plain-old qemu. It's a mystery to me why more places don't do this.

kim0 · on Jan 25, 2017

Can you explain the azure solution some more please, to the best of your knowledge

Scaevolus · on Jan 25, 2017

This links to the announcement and has some speculation about how it's implemented. It looks like it might be basically a snapshot and a fast host reboot? https://www.petri.com/new-azure-iaas-features-announced-augu...

rdmsr · on Jan 25, 2017

Probably means that upgrades are done in-place. They don't move the VM between machines, but they do something like move it from the old process to and upgraded version (or in some other way upgrade the software underneath).

throwaway7767 · on Jan 26, 2017

> Non-QEMU implementation: Google does not use QEMU, the user-space virtual machine monitor and hardware emulation. Instead, we wrote our own user-space virtual machine monitor that has the following security advantages over QEMU: [...] No history of security problems. QEMU has a long track record of security bugs, such as VENOM, and it's unclear what vulnerabilities may still be lurking in the code.

Has this alternative VMM/hardware emulator been released? As far as I can tell, the answer is no. In that light, it's more than a little weird to congratulate yourself on not having a "long track record of security bugs" in your internal-use-only unreleased tool compared to generally available software in wide use.

Skunkleton · on Jan 25, 2017

I see that google has open sourced a VMM written in go [1], but I doubt this is the VMM discussed in the article. Has google open sourced this software?

[1] https://github.com/google/novm

andyhonig · on Jan 25, 2017

Google has not open sourced this for a variety of technical and business reasons. I'm not going to get into all of them, but for example a lot of our device emulation is built upon internal Google services. It wouldn't work outside a Google data center. It's not out of the realm of possibility that Google open sources some parts of it in the future, but I don't know if that will ever happen.

jsolson · on Jan 26, 2017

Adding to Andy's point here -- if you consider the services that back GCE in terms of the features offered by our virtual networks, virtual storage devices, etc. they necessarily have tight dependencies on "How Google is Built" to hit the performance targets we aim for.

I, personally, am optimistic that we'll be able to do it for the core at some point. Something not mentioned in the original article is that architecturally our VMM allows us to easily change the set of available functionality (virtual devices, backing implementations of those, etc.). So, for example, the fuzzing Andy described as depending on our custom VMM isn't even linked into the VMM run in GCE, nor are custom devices we keep around for testing and development. I'd very much like it if we could layer the VMM in such a way that we could release a useful core that forms the basis of the VMM we consume internally. Today, though, the layering just isn't quite there (among other technical and business reasons, as Andy says), and there are Googley specifics in places that would make even the VMM core (minus devices, etc.) unviable outside of Google.

nellydpa · on Jan 25, 2017

Hey folks, thanks for great comments, keep them coming.I am one of the article authors, AMA.

koolba · on Jan 25, 2017

> Non-QEMU implementation: Google does not use QEMU, the user-space virtual machine monitor and hardware emulation. Instead, we wrote our own user-space virtual machine monitor that has the following security advantages over QEMU: Simple host and guest architecture support matrix. QEMU supports a large matrix of host and guest architectures, along with different modes and devices that significantly increase complexity.

I've noticed that a lot of projects that do support multiple architectures, particularly obscure ones, tend to find oddball edge cases more easily than those that don't. For example, not assuming the endianness of the CPU arch forces you to get network code to cooperate well.

> Because we support a single architecture and a relatively small number of devices, our emulator is much simpler.

No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

Then again when you're Google and have the resources to build a VM runtime from the ground up it's easier to convince management that "This is the right decision!".

luu · on Jan 25, 2017

> I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

> Then again when you're Google and have the resources to build a VM runtime from the ground up it's easier to convince management that "This is the right decision!".

Is it possible you're either underestimating the effort it takes to make QEMU solid or overestimating the effort it takes to write an emulator?

I worked at a company where I hacked up QEMU as a stopgap before we switched to an in-house solution (this wasn't Google, although I've also worked at Google). I made literally hundreds of bug fixes to get QEMU into what was, for us, a barely usable state and then someone else wrote a solution from scratch in maybe a month and a half or two months. I doubt I could have gotten QEMU into the state we needed in a couple months. And to be clear, when I say bug fixes, I don't mean features or things that could possibly arguably be "working as intended", I mean bugs like "instruction X does the wrong thing instead of doing what an actual CPU does".

BTW, I don't mean to knock QEMU. It's great for what it is, but it's often the case that a special purpose piece of software tailored for a specific usecase is less effort than making a very general framework suitable for the same usecase. Even for our usecase, where QEMU was a very bad fit, the existence of QEMU let us get an MVP up in a week; I applied critical fixes to our hacked up QEMU while someone worked on the real solution, which gave us a two month head start over just writing something from scratch. But the effort it would have taken to make QEMU production worthy for us didn't seem worth it.

bonzini · on Jan 25, 2017

KVM does not use any of the CPU/instruction emulation in QEMU. It only uses the device emulation code and the interface to the host (asynchronous I/O, sockets, VNC, etc.).

We are adding unit tests for a lot of new code, and some parts of the code (especially the block device backends) have a comprehensive set of regression tests.

Also, distributions can disable obsolete devices if they wish. Red Hat does that in RHEL, for both security and supportability reasons. So if you want a free hardened QEMU, use CentOS. :-) Several other companies do so, including Nutanix and Virtuozzo.

j_s · on Jan 25, 2017

disable obsolete devices

Highly recommended!

Venom – A security vulnerability in virtual floppy drive code (~2 years ago)

https://news.ycombinator.com/item?id=9538437

bonzini · on Jan 25, 2017

Unfortunately VENOM was not so easy because some OSes (ehm Windows XP but also 2003...) only support driver floppies as opposed to driver CD-ROMs.

But we disable a bunch of old SCSI adapters, NICs, most audio cards, the whole Bluetooth emulation subsystem. All the cross-architecture emulation is also compiled out (x86-on-x86 emulation is still left in, until nested virtualization matures---which the Google folks are helping us with too!---but we only support it for libguestfs appliances).

Furthermore, in RHEL most image formats are forbidden or only supported read-only in the emulator (you can still use qemu-img to convert to and from them). Read-only support can be useful because of virt-v2v, an appliance that reads from VMware or Hyper-V images and tweaks them to run as KVM guests.

pm215 · on Jan 25, 2017

Mmm, in this case I suspect a lot of the benefit is not having to carry around the 90% of QEMU that's completely unused in the x86-KVM usecase (ie all of the emulation, all the devices for non-x86, all the random PCI devices you don't care about, etc) -- you don't have to security-audit that (it won't run if you're using QEMU but you'd have to convince yourself of that). Plus you don't need to care about maintaining compatibility with previous QEMU command lines or migration data formats.

Incidentally for instruction emulation the quality is rather variable: for instance I trust the 64-bit ARM userspace instructions pretty well because we were able to random-instruction-sequence test them and worked all the bugs out; x86 emulation I trust rather less because very few people want to emulate that these days because everybody's got the hardware, so bugs don't get found or fixed. QEMU is an enormous million-line codebase which satisfies multiple use cases several of which barely overlap at all, and its level of robustness and testing depends a lot on which parts you're looking at...

bonzini · on Jan 25, 2017

They have moved MMIO instruction emulation from KVM to userspace though. This is not yet part of upstream KVM.

I'm not sure how much of the emulation they left in the kernel, but something probably is there because handling simple MOV instructions in the kernel can have a massive effect on performance. Andy, what can you say? :)

amluto · on Jan 26, 2017

That VMX is seriously unfriendly toward full exits to user mode. I have some ideas to mitigate this. Intel could step up and fix it easily if they cared to.

For those unfamiliar with the issue: in a hypervisor like KVM on arcane hardware like x86, switching from guest mode to host kernel mode is considerably faster than switching from guest mode to host user mode. The reason you'd expect is that guest -> host user involves going to host kernel first and then to host user, but the actual kernel->user transition uses SYSRET and is very fast. The problem is that, in VMX (i.e., Intel's VM extensions), a guest exit kicks you back to the host with a whole bunch of the host control register state badly corrupted. To run normal kernel code, the host only needs to fix up some of the state, but to go all the way to user mode, the kernel needs to fix up the state completely, and Intel never tried to optimize control register programming, so this takes a long time (several thousand cycles, I think). I don't know if SVM (AMD's version) is much better.

As just one example, many things on x86 depend on GDTR, the global descriptor table register. VMX restores the GDTR base address on VM exit, but it doesn't restore the GDTR size. Exits to host user mode need to fix up the size, and writing to GDTR is slow.

How hard would it be to instrument the in-kernel emulation to see which instructions matter for performance? I bet that MOV (reg to/from mem) accounts for almost all of it with ADD and maybe MOVNT making up almost all the balance. Instructions without a memory argument may only matter for exploits and for hosts without unrestricted guest mode.

Hmm. Is SYSCALL still busted? The fact that we emulate things like IRET scares me, too.

Edit: added background info

bonzini · on Jan 26, 2017

Well I was thinking of andyhonig but I am not surprised to see you here, either...

pm215 · on Jan 25, 2017

Wait, x86 still requires instruction emulation for non-weirdo non-legacy cases? My vague recollection of the KVM Forum talk G. did was that you don't need it for "modern" guests.

(We were talking about emulation-via-just-interpret-one-instruction in userspace in upstream QEMU the other day -- you'd want it for OSX hypervisor.framework support too, after all. And maybe for the corner cases in TCG where you'd otherwise emulate one instruction and throw away the cached translation immediately.)

bonzini · on Jan 25, 2017

Apart from the legacy case, you need it for MMIO---KVM for ARM also has a mini parser for LDR/STR instructions.

x86 however has all sorts of wonderful read-modify-write instructions too. You need to support those, but it would still be a small subset of the full x86 instruction set if you all you want to support is processors newer than circa 2010.

pm215 · on Jan 26, 2017

KVM for ARM doesn't parse instructions -- you can just use the register info the hardware gives you in the syndrome register, which covers everything except oddball cases like trying load-multiple to a device, which doesn't happen in practice and so we don't support it.

jsolson · on Jan 26, 2017

Yeah, it still gets hit now and then. It should not get hit often in the typical steady state, though, which is why you can punt it to userspace with little performance penalty.

(I work on the custom VMM we run)

strstr · on Jan 26, 2017

(Echoing Bonzini) You don't need it to be in the kernel for modern guests (performance wise), but you still need it.

strstr · on Jan 25, 2017

Current implementation has everything in userspace. The perf hit hasn't been compelling enough to make even minor perf improvements.

(Yet-another-Googler: I worked on this and spoke about it at KVM Forum)

bonzini · on Jan 26, 2017

Interesting. So ioeventfd is also handled in userspace, I guess.

A couple years ago I measured a huge slowdown on userspace vmexits for guests spanning multiple NUMA nodes, because of cacheline bouncing on tsk->sighand->siglock. Maybe you're not using KVM_SET_SIGNAL_MASK.

(Steve, I suppose?)

jsolson · on Jan 27, 2017

ioeventfd for PIO exits is still handled in the kernel, but that one is easy since it's a dedicated VMEXIT type.

We do very little that typically requires trapping MMIO, particularly in places that are performance sensitive (VIRTIO Net and VIRTIO SCSI do not, and honestly there's not too much that guests do inside GCE that isn't either disk or networking :).

nellydpa · on Jan 26, 2017

You are right, some instructions are not suitable for userspace due to their performance implications and have to stay in the kernel. We identified a small set of them, for example, some parts of IOAPIC support have to stay put.

bonzini · on Jan 26, 2017

LAPIC I think? But those get their own special vmexit code so they do not need emulation (on Ivy Bridge or newer Xeons).

IOAPIC is legacy and replaced by MSI. I am surprised you don't use ioeventfd though!

jsolson · on Jan 27, 2017

> I am surprised you don't use ioeventfd though!

We do in some cases, for both networking and storage. Since our devices are (mostly) VIRTIO (of pre-1.0 vintage), we're using it for OUTs into BAR0 (which again of course get their own VMEXIT and don't require emulation).

By and large we try to elide the exits entirely if we can, naturally, although in today's GCE production environment serialized request/response type workloads will see exits on every packet. Streaming workloads fare better, as we do make use of EVENT_IDX and aggressively trying to find more work before advancing the used.avail_idx field.

koolba · on Jan 25, 2017

> Is it possible you're either underestimating the effort it takes to make QEMU solid or overestimating the effort it takes to write an emulator?

Not just possible, it's highly likely.

walterbell · on Jan 25, 2017

Are there open-source efforts to create a better emulator?

wmf · on Jan 25, 2017

Intel is working on QEMU-lite and there are some QEMU alternatives that aren't focused on emulating the full PC architecture like novm and kvmtool.

bonzini · on Jan 25, 2017

Depends on how you define better. If better means running old games and demos more faithfully, there's DOSEMU and DOSBOX. If better means emulating newer processors on older ones, there's Bochs. None of them supports KVM (except maybe DOSEMU??).

For KVM use, QEMU is pretty much the only choice with support for a wide range of guests, architectures, and features. lkvm (aka kvmtool) doesn't support Windows, UEFI, s390 hosts, live migration, etc.

At the same time, QEMU's binary code translator is improving. As pm215 said elsewhere, 64-bit ARM guest support is much better than x86 support, and we're also working on SMP which is scalable and pretty much state-of-the-art for cross-architecture emulators (of course QEMU is already scalable to multiple guest CPUs when using KVM, but doing it in an emulator is a different story).

rrdharan · on Jan 25, 2017

> No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

A little later in the post I believe this is somewhat addressed: > QEMU code lacks unit tests and has many interdependencies that would make unit testing extremely difficult.

Personally based on my previous experience at VMware and passing familiarity with QEMU, I think they made the right call.

(I work at Google, but not on the hypervisor)

jsolson · on Jan 26, 2017

> No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

There are some folks who have this view, but doing it from scratch also has the advantage that it integrates much more cleanly with Google's global shared codebase. There's a huge body of existing work that I can leverage more or less trivially. This includes things like Google's internal metrics and monitoring framework, our RPC framework, etc. Yes, you could bolt these onto the side of qemu, but qemu is a C codebase and most of Google (including the custom VMM described in the article) is not.

Additionally, when software is built using the same style, tools, and best practices as the rest of Google's codebase, it makes it easy for other engineers in the company to contribute. We benefit from Google-wide code cleanups, *SAN analysis tools, codebase-wide refactorings that make code easier to reason about the correctness of, etc.

Several years ago I think the question would've been a lot more difficult to answer, but today I the advantages of the route taken are unambiguous.

(my team owns the virtual network devices visible to GCE VM and the first chunk of the on-host dataplane, one virtual hardware component of the custom VMM we run :)

bonzini · on Jan 26, 2017

So you don't run vhost? This post is getting interestinger and interestinger! :-)

jsolson · on Jan 26, 2017

We do not. That it's not vhost you can infer (correctly) from the topology of our VIRTIO Net device; specifics on that front will have to wait for another day, though :)

aseipp · on Jan 25, 2017

> I've noticed that a lot of projects that do support multiple architectures, particularly obscure ones, tend to find oddball edge cases more easily than those that don't. For example, not assuming the endianness of the CPU arch forces you to get network code to cooperate well.

It's a balancing act, like anything is. Do you add or reject this patch from a contributor? This new feature someone wants, or a bug fixed? Is it better off designed/done in a different way? Is this kind of work maintainable e.g. 3 years from now when I'm not working on it? Can we reliably continue to support these systems or know someone who will care for them? Can we reasonably understand the entire system, and evolve it safely? Do you need more moving parts than are absolutely required? Does that use case even matter, everything else aside? The last one is very important.

Of course there's something to be said about software systems that maintain high quality code while supporting a diverse set of use cases and environments, like you mention.

But -- I'd probably say that, that result? It's likely more a function of focused design than it is a function of "trying to target a lot of architectures". Maybe targeting lots of architectures was a design goal, but nonetheless, the designing aspect is what's important. No amount of portability can fix fundamental misunderstandings of what you're trying to implement, of course. And that's where a lot of problems can creep in.

In this case, they have a very clear view of what they want, and a lot of what QEMU does is very irrelevant. It may be part of QEMU's design to be portable, but ultimately it is a lot of unnecessary, moving parts. You can cut down its scope dramatically with this knowledge -- from a security POV, that's very often going to be a win, to remove that surface area.

(Also, I'm definitely not saying QEMU is bloated or something, either. It's great software and I use it every day, just to be clear.)

mappu · on Jan 25, 2017

The KVM API is quite simple: https://lwn.net/Articles/658511/

andyhonig · on Jan 25, 2017

I'm one of the authors of this post, AMA.

mjg59 · on Jan 25, 2017

Right now there doesn't seem to have been a lot of work in exposing security features to guests. Is there interest in supporting stuff like Secure Boot or any form of runtime attestation?

bonzini · on Jan 25, 2017

QEMU + KVM + OVMF supports Secure Boot (including SMM emulation for the trusted base), or are you talking specifically of GCE?

mjg59 · on Jan 26, 2017

Yeah, specifically GCE

andyhonig · on Jan 25, 2017

Thanks for the feedback, unfortunately I can't publicly comment on future product plans.

deegles · on Jan 26, 2017

AM(almost)A :)

fulafel · on Jan 25, 2017

This sounds much more secure than AWS and their use of Xen.

_vvdf · on Jan 25, 2017

How so? What leads you to believe that Amazon hasn't taken similar measures?

j_s · on Jan 25, 2017

https://aws.amazon.com/security/security-bulletins/

search "Xen Security" and "XSA Security"

Many do say AWS is not affected; it is a bit of effort to review them all. The newest I found is: https://aws.amazon.com/security/security-bulletins/XSAsecuri... (Dec. 2015)

Interestingly, AWS was not affected by the August 2016 XSA Security Advisory (XSA-182) that impacted Qubes OS: http://blog.quarkslab.com/xen-exploitation-part-3-xsa-182-qu...

https://aws.amazon.com/blogs/aws/ec2-maintenance-update/ (Sep. 2014)

numbsafari · on Jan 25, 2017

Amazon isn't confident enough in their solution such that they will only support customers with high levels of regulatory and compliance requirements (e.g. HIPAA) if they are using dedicated physical hosts.

https://aws.amazon.com/ec2/dedicated-hosts/

https://aws.amazon.com/blogs/security/frequently-asked-quest...

nnx · on Jan 26, 2017

Does Google Cloud support HIPAA workloads on regular shared instances?

I thought HIPAA itself mandates physical dedicated host separation.

vgt · on Jan 26, 2017

Yup:

https://cloud.google.com/security/compliance

jron · on Jan 26, 2017

Does anyone have a better summary of the Xen vs KVM security debate? The best I know of is the brief Qubes architecture document from 2010: https://www.qubes-os.org/attachment/wiki/QubesArchitecture/a... (Section 3).

KVM using Google's hardened device emulation may eliminate some of Qube's QEMU concerns.

Qubes author critique on the state of Xen security: https://lists.xen.org/archives/html/xen-devel/2015-11/msg006...

neom · on Jan 25, 2017

If there is a DigitalOcean engineer in the house, I'd be curious to know how this compares to how they run their KVM setup.

_vvdf · on Jan 25, 2017

Given that it took 4 years of continuous customer requests just to get support for booting a custom kernel on Digital Ocean, I'd be surprised if they're doing anything this elaborate.

bitexploder · on Jan 25, 2017

This is a great example of the advantages only a tiny number of technical companies have. It is great technology and great that Google can do this, but who has the resources to competently implement a hypervisor and customize it to their infrastructure?

moreorless · on Jan 25, 2017

Vultr supported custom ISOs for a long long time now.

regularfry · on Jan 26, 2017

Among several others. DO are going for the 80% case, not full spectrum capability.

floatboth · on Jan 26, 2017

It's usually the elaborate, "cloudy" stuff that prevents booting custom OSes (mostly just networked storage, e.g. https://github.com/scaleway/image-proposals/issues/11 Scaleway would need a Network Block Device driver to run FreeBSD).

Normal "VPS" providers often support this, e.g. prgmr just lets you boot into a recovery image (Linux, FreeBSD, NetBSD) and do whatever you want to the virtual hard drive, install anything that runs on Xen there. It's like running a Xen VM on your own computer, except it's someone else's computer.

Interestingly, Digital Ocean uses local storage. I guess the issue was just with the boot process.

madez · on Jan 25, 2017

> Alerts are generated if any projects that cause an unusual number of Rowhammer errors.

Is that correct english?

mcinjosh · on Jan 25, 2017

Almost; I think it parses correctly by substituting 'if' with 'for'.

nealmueller · on Jan 25, 2017

Thanks for catching. Fixed.

floatboth · on Jan 26, 2017

By the way, FreeBSD's vmm doesn't use QEMU, a fresh new userspace part (bhyve) was written. I think the same is happening with OpenBSD's vmm. It seems kinda weird that Linux KVM is commonly used with QEMU…

ascotan · on Jan 25, 2017

Q. How do you secure your infrastructure Google? A. We hire thousands of programmers to write custom software. There are no known vulnerabilities. No. Known. Vulnerabilities.

aiiane · on Jan 26, 2017

That's not even an accurate representation of what is publicly claimed; Google runs a VRP specifically because it acknowledges that software sometimes has bugs and it's committed to fixing them. Google also employs quite a few of the security researchers that contribute to identifying and fixing security issues both within and beyond Google.

stefanha · on Jan 26, 2017

I don't work for Google but have seen KVM Forum presentations and mailing list discussions where they contributed back to KVM. They have world-class security researchers working on keeping KVM secure.

Your comment implies they rely on security through obscurity and it misses the legitimate security work they are doing. Did you read anything at all before posting your comment?

Start here: http://www.linux-kvm.org/images/f/f6/01x02-KVMHardening.pdf