An EPYC escape: Case-study of a KVM breakout

bluedino · on June 29, 2021

The old Theo quote... You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes.

tptacek · on June 29, 2021

There's three ways of looking at this, one in which he's wrong, and two in which he's saying something much less interesting than it sounds like he's saying.

As this P0 post says repeatedly: KVM has a relatively small attack surface. It's audited pretty carefully relative to the rest of the kernel. The idea behind KVM-based workload isolation is that it trades a very large attack surface (the entire kernel) for one that is by definition much smaller (a subsystem of the kernel that handles virtualization). The rest of the kernel runs behind that subsystem, as far as isolated applications are concerned. This is a very good trade in practice. The Linux kernel (really, all Unix kernels) are much less trustworthy than kernel VMM drivers.

So if de Raadt means to be saying that virtualization is by design less secure than shared-kernel isolation, he's wrong, just sort of plainly.

On the other hand, there are at least two valid points he can be making with this line.

First: virtualization systems consist of more than just the kernel virtualization driver. If you include QEMU in the mix, for instance, it's debatable whether you've gained much over a single exposed Linux kernel. Especially at the time de Raadt wrote this, it would be totally fair game to say that virtualization was a security shitshow compared to jailing processes or whatever. Of course, the future belongs to memory-safe VMMs that use an smaller and smaller subset of memory-unsafe kernel code.

Second, it's just hard to write anything without security holes, so if "it's not bug free" is the dunk here, well, let him cast the first stone, &c.

wahern · on June 29, 2021

Option #4, Theo was responding to virtualization advocates who claimed VMs offered as good or better security isolation than physically separate boxen.

At the time and in some cases still today, plenty of advocates still make that claim. Many others just assume the truth of it, if only because to question it would cause cognitive dissonance with the prevalence of cloud hosting. (Notwithstanding that some savvier companies use EC2 much like they would a traditional server leasing provider, using instance types that take up the entire machine. Security and convenience is sometimes a trade-off, but some bargains are better than others if you don't succumb to simplistic, categorical claims. Which is what Theo was railing against.)

EDIT: For context, here's the original post https://marc.info/?l=openbsd-misc&m=119318909016582. It's from 2007, when hardware virtualization extensions were new and all VMMs had to emulate network interface cards and similar hardware. These days virtio devices are common place, which helps to substantially reduce footprint. OpenBSD even has its own native VMM, which of course only supports virtio devices.

tptacek · on June 29, 2021

Sure. I 100% think there's a reasonable reading of de Raadt's virtualization take; you don't even have to be charitable to find it. It's just not the take that people who try to dunk with it on message boards are reading.

antonvs · on June 30, 2021

Surname convention trivia: The lower case "de" is only used when preceded by another part of the name, like first name, initials, or another part of the last name. When used on its own, or prefixed by a title like Mr., then "De Raadt" is correct. Here's a description of this: https://www.dutchgenealogy.nl/how-to-capitalize-dutch-names-...

So e.g. you would write Vincent van Gogh, V. van Gogh, Mr. Van Gogh, or Van Gogh.

throw-8462682 · on June 29, 2021

Taking that quote at face value, it is shallow and needlessly binary. It’s shallow because it is a truism (paraphrasing, you cannot write bug free software). It is needlessly binary because there is such a thing as the size of your trusted computing base.

tester34 · on June 30, 2021

bug free maybe not, but without security holes? you definitely can

gU9x3u8XmQNG · on June 29, 2021

This is a ridiculous statement.

The context, and forum, are what makes it important.

Given your response logic; we can take any quote by any person and apply the same. Arguably; your response deserves the same.

To remain constructive, I suggest you research further. As my ongoing engagement (employment), and general interest, in Information Technology continue - i cannot help but further relate to Theo’s attitude.. lol

legulere · on June 29, 2021

A smaller attack surface should still lead to less security holes though.

mdaniel · on June 29, 2021

While reading over that kernel code, I kept wondering how it could possibly be code reviewed before landing. The amount of context required to spot mistakes must be overwhelming

If there was ever a case for Literate Programming, I'd vote for using it on kernel code before I would on rando CRUD app

mhh__ · on June 29, 2021

People think that because what they are doing is arcane or complicated that it then has the right for the code to be arcane and complicated.

This is not a local effect. Many compilers have more use of goto than you might expect - because goto is acceptable in some parts of a compiler, it becomes less unacceptable in parts where it should remain so.

It doesn't help that C is an awful programming language, which actively encourages repetition and the expansion of the surface area from which bugs can occur.

strstr · on June 30, 2021

KVM is stable, small, and well fuzzed. I'm not super surprised this CVE is in the nested support since it's complicated and in flux. Last time I ran CLOC on x86 KVM it was only ~50k lines, but that was years ago.

The amount of context necessary for virt is sort of silly. IMO most of the complexity is not from the code style (though kernel code style doesn't help). It's mostly that reading KVM requires thorough context on the underlying architecture, and its virtualization features.

That said, I'd really like to see parts of KVM replaced with rust code (particularly the x86 instruction emulator). Shoving it into a safe language would give me some extra faith.

jcims · on June 29, 2021

It makes you wonder what else is lurking in there, yet to be discovered.

Agree on your concern around code review. It's even worse than that, though, because it seems that the same internalized context that allows you to see mistakes also allows your brain to buy in to the same assumptions that allowed the bug to be written in the first place.

I spent most of my career in infosec. When I write code I try to follow most of the practices I've been preaching, but I certainly don't make fewer mistakes.

formerly_proven · on June 29, 2021

Huge difference between knowing better and doing better.

saddlerustle · on June 29, 2021

Remember too that the majority of the linux kernel isn't covered by any unit or integration tests.

tptacek · on June 29, 2021

A unit test isn't going to spot a bug like this, nor would a pre-written integration test. The kernel is extensively fuzzed.

pas · on June 29, 2021

What methods/techniques/processes would have caught this? What's the most cost effective of those? (Are those even the right questions?)

wbl · on June 30, 2021

Formal verification assuming you can even specify the problem well enough.

ylyn · on June 30, 2021

Fuzzing is probably the most likely, but even then..

tyingq · on June 29, 2021

That was a really good read. And nothing that something like Rust would solve. Just knowing that userspace with access to more than one core could outrace the kernel between the validation checks and the action.

wyager · on June 29, 2021

I believe that Rust actually would allow you to protect against this sort of thing if you integrated the low-level API with rust’s ownership/linearity system. Most rust APIs for embedded SOCs (“HALs”) do this correctly.

In the example in the OP, ownership rules were violated as there were two direct mutable handles to an object.

amluto · on June 30, 2021

Trying to impose Rust-like “only one mutable reference at a time” semantics on a VM host’s management of guest memory is somewhere between a performance disaster and a complete nonstarter. We’re talking about TOCTOU where the writer is a vCPU. There are two big problems with Rust-like semantics:

1. Revoking access from other vCPUs is very expensive. It would be a huge performance loss to freeze all vCPUs or TLB-shootdown them just to read guest memory for emulation.

2. Multiple vCPUs at once fundamentally mutate the same memory. Shared-memory ISAs aren’t Rust.

What could be done is to have all access to guest memory be unsafe and to strongly encourage reading the whole block at once. This might be tricky to make work cleanly efficiently.

wyager · on June 30, 2021

I think the way you’d want to implement it would be to only allow atomic copies (no aliasing) of the configuration blocks. There might be a way to do it efficiently without just pausing all the virtual cores. For example, you could blow out the memory mapping to that block so that it would trigger a page fault if and only if the guest tried to access that block, and you could make the page fault handler block until the supervisor was done copying. Or something along those lines.

amluto · on June 30, 2021

Blowing away the memory mapping is a rather expensive operation. It’s also unclear why an atomic copy is useful. To prevent TOCTOU, you also need to prevent two consecutive atomic copies that inadvertently assume they get the same result each time.

monocasa · on June 29, 2021

I've done something really similar (albeit in modern C++) that created a type safe guard around DMAed structures to provide mutual exclusion between the device and the driver (and to handle the appropriate cache management as the buffer transitions on non coherent systems, but that part isn't really applicable to this article).

nyanpasu64 · on June 29, 2021

How do kernels written in {C, Rust} handle that userspace code can create UB situations in kernel code, and prevent userspace memory unsafety from causing the kernel to malfunction?

monocasa · on June 29, 2021

Typically by simply not having important data structures like that in untrusted memory. You're either dealing with raw, fairly untyped buffers, or structs that get marshalled into kernel space from user space before even being validated. I did hear about some similar TOCTOU bugs though with seccomp filtering, that then required a fix to move the marshalling to happen before the BPF filter was run.

pjmlp · on June 30, 2021

True, although this kind of reasoning against Rust, or any other memory safe systems programming language for that matter, always misses the point.

Seatbelts, protected jackets, helmets also don't prevent death in all accidents, yet plenty of people appreciate still being around by having weared them whent it came to be.

jeffbee · on June 29, 2021

Wonder no more: hardly anything in Linux has ever been meaningfully reviewed. Code looks like it couldn't possibly pass review because it can't, and didn't.

yjftsjthsd-h · on June 29, 2021

> There is, instead, a somewhat involved (if somewhat informal) process designed to ensure that each patch is reviewed for quality and that each patch implements a change which is desirable to have in the mainline.

- https://www.kernel.org/doc/html/latest/process/2.Process.htm...

jeffbee · on June 29, 2021

That's basically the process by which gatekeepers tell newcomers to go away. It doesn't really apply to the vast majority of changes from established contributors. You only have to hang out on l-k for a few minutes to get the flavor of things.

The only patches getting thorough code reviews are getting them from organizations with their own internal code review culture.

ylyn · on June 30, 2021

Do you have evidence for your claims?

test_epsilon · on June 30, 2021

> Wonder no more: hardly anything in Linux has ever been meaningfully reviewed.

Incorrect.

> Code looks like it couldn't possibly pass review because it can't, and didn't.

What's this supposed to mean? Are you talking about a specific piece of code, all of Linux, or some strange general statement? No matter what it seems wrong, if nothing else because you seem to be ascribing your own unique definition to the concept of "review" or what code must look like in order to "pass review".

It's clear that you have some dislike for Linux, maybe technical maybe personal, maybe justified maybe not. Unfortunately though even if it is a well founded technical dislike, it has clouded your vision so much that you're lashing out and making trivially disproven nonsense like this on an internet forum. I think it's time for you to take a breath and think what you are hoping to achieve.

Linux has extensive review efforts many of which are public but also many more private ones that you nevertheless can see the evidence of, both pre and post code merge, which I'm almost certain you would know of unless you're just a complete clueless troll.

So given that, do you think it bolsters your credibility or achieves your goals of readjusting peoples' opinion of Linux, by blurting out such garbage? I don't think it does, but I think you sound frustrated and angry, perhaps because nobody will listen to you.

Let's start with a clean slate and let's hear your misgivings about Linux, the people involved with it, the process, etc.

SubzeroCarnage · on June 29, 2021

> Assuming our guest can get full unrestricted access to any MSR (which is only a question of timing thanks to init_on_alloc=1 being the default for most modern distributions)

Can someone elaborate on how init_on_alloc would be helpful to an attacker?

tptacek · on June 29, 2021

My guess is because the exploit requires zeroing out an ACL bitmap that the host uses to control MSR access, and init_on_alloc zeroes out memory as it's allocated, which is the state you want as an attacker here.

ece · on June 29, 2021

I wonder if KASAN could've helped here.

strstr · on June 30, 2021

It's a toctou bug, so probably not.

Agingcoder · on June 29, 2021

This is a work of art.

Thanks for posting.

dundarious · on June 30, 2021

I wonder if SeKVM is also affected https://news.ycombinator.com/item?id=27434073

z3t4 · on June 30, 2021

The article make a decent job trying to explain for non-experts, there are some acronyms that are non obvious though, like Model-specific register (MSR) and Extended Feature Enable Register (EFER). Maybe it will be too hard read if you don't know those though.