Hacker News new | past | comments | ask | show | jobs | submit login
We replaced Firecracker with QEMU (hocus.dev)
409 points by hugodutka on July 10, 2023 | hide | past | favorite | 149 comments



At CodeSandbox we use Firecracker for hosting development environments, and I agree with the points. Though I don't think that means you should not use Firecracker for running long-lived workloads.

We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).

It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].

[1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...

[2]: https://codesandbox.io/blog/cloning-microvms-using-userfault...


I don't know if this is relevant, but I've been intrigued by DragonflyBSD's "vkernel" [0] feature which (supposedly) allows for cloning the entire runtime state of the machine (established TCP connections, etc.) into a completely new userland memory space. I think they use it mostly for kernel debugging right now, but it's interesting to think about the possibilities of being able to just clone an entire running operating system to a new computer without interrupting even a single instruction.

[0] https://www.dragonflybsd.org/docs/handbook/vkernel/


These blogs are wonderful. I'd read them before figuring out firecracker snapshot/restore, but wanted to say it here.

> i/o speed we use io_uring

custom io_uring based driver for the VM block devices? or what do you mean here?


Thank you!

> custom io_uring based driver for the VM block devices? or what do you mean here?

We're using the async io backend that's shipped with Firecracker for our scratch disks.


Someone posted this and then immediately deleted their comment: https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht...

I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:

> microvm is a machine type inspired by Firecracker and constructed after its machine model.

> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.


"the fork was very very bad for eating soup - this is a story about how we migrated to a spoon"

...firecracker does fine what it was designed to - short running fast start workloads.

(oh, and the article starts by slightly misusing a bunch of technical terms, firecracker's not technically a hypervisor per se)


it's not that simple many other companies running longer running jobs, including their competition, use Firecracker

so while Firecracker was designed for thing running just a few seconds there are many places running it with jobs running way longer then that

the problem is if you want to make it work with long running general purpose images you don't control you have to put a ton of work into making it work nicely on all levels of you infrastructure and code ... which is costly ... which a startup competing on a online dev environment compared to e.g. a vm hosting service probably shouldn't wast time on

So AFIK the decision in the article make sense the reasons but listed for the decision are oversimplified to a point you could say they aren't quite right. Idk. why, could be anything from the engineer believing that to them avoiding issues with some shareholder/project lead which is obsessed with "we need to do Firecracker because competition does so too".


..so is it more to support directly deploying functions to the cloud? Like, what AWS Lambda and CloudFront Functions might be built on?


I'm pretty sure firecracker was literally created to underlie AWS Lambda.

EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-web-service... says my "pretty sure" memory is in fact correct.


That being said, firecracker also runs long-running tasks on AWS in the form of Fargate


As does the paper [1] with details in section 4.1.

[1]: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf


yes, it was created originally for AWS Lambda

mainly it's optimized to run code only shortly (init time max 10s, max usage is 15min, and default max request time 130s AFIK)

also it's focused on thin server less functions, like e.g. deserialize some request, run some thin simple business logic and then delegate to other lambdas based on it. This kind of functions often have similar memory usage per-call and if a call is an outlier it can just discard the VM instance soon after (i.e. at most after starting up a new instance, i.e. at most 10s later)


"Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system."

Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.

https://github.com/firecracker-microvm/firecracker/blob/main...


The first footnote says If you squint hard enough, you'll find that Firecracker does support dynamic memory management with a technique called ballooning. However, in practice, it's not usable. To reclaim memory, you need to make sure that the guest OS isn't using it, which, for a general-purpose workload, is nearly impossible


> is nearly impossible

for many mostly "general purpose" use cases it's quite viable, or else ~fly.io~ AWS Fargate wouldn't be able to use it

this doesn't mean it's easy to implement the necessary automatized tooling etc.

so it's depending on your dev resources and priorities it might be a bad choice

still I feel the article was had quite a bit a being subtil judgemental while moving some quite relevant parts for the content of the article into a footnote and also omitting that this "supposedly unusable tool" is used successfully by various other companies...

like as it it was written by and engineer being overly defensive about their decision due having to defend it to the 100th time because shareholders, customers, higher level management just wouldn't shut up about "but that uses Firecracker"


> which, for a general-purpose workload, is nearly impossible

That depends on the workload and the maximum memory allocated to the guest OS.

A lot of workloads rely on the OS cache/buffers to manage IO so unless RAM is quite restricted you can call in to release that pretty easily prior to having the balloon driver do its thing. In fact I'd not be surprised to be told the balloon process does this automatically itself.

If the workload does its own IO management and memory allocation (something like SQL Server which will eat what RAM it can and does its own IO cacheing) or the VM's memory allocation is too small for OS caching to be a significant use after the rest of the workload (you might pair memory down to the bare minimum like this for a “fairly static content” server that doesn't see much variation in memory needs and can be allowed to swap a little if things grow temporarily), then I'd believe is it more difficult. That is hardly the use case for firecracker though so if that is the sort of workload being run perhaps reassessing the tool used for the job was the right call.

Having said that my use of VMs is generally such that I can give them a good static amount of RAM for their needs and don't need to worry about dynamic allocation, so I'm far from a subject expert here.

And, isn't firecraker more geared towards short-lived VMs, quick to spin up, do a job, spin down immediately (or after only a short idle timeout if the VM might answer another request if one comes in immediately or is already queued), so you are better off cycling VMs, which is probably happening anyway, than messing around with memory balloons? Again, I'm not talking from a position of personal experience here so corrections/details welcome!


I'm struggling to understand how qemu with free page reporting isn't exactly the same as a firecracker balloon.


Yeah it's pretty hard problem as you'd need to defragment physical memory (while fixing all the virtual-to-physical mappings) to make contiguous block to free


A bit disingenuous to make a broad sweeping claim, then have a footnote which contradicts that claim, and upon closer inspection even that claim is incorrect.

It's absolutely usable in practice, it just makes oversubscription more challenging.


That and the fact that this was after "several weeks of testing" tells me this team doesn't have much virtualization experience. Firecracker is designed to quickly virtualize 1 headless stateless app (like a container), not run hundreds of different programs in a developer environment.


Yes, we use this at CodeSandbox for reclaiming memory to the host (and to reduce snapshot size when we hibernate the VM).


I really want VM's to integrate 'smarter' with the host.

For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).

To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.


This seems like a security issue waiting to happen when you’re running code from different users.


https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...

zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs


KVM has KSM (kernel samepage merging) since a long time ago that de-duplicates pages.


It has side channel attacks so be careful when enabling: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)


But that makes a copy first, and only later notices that the pages are the same and merges them again.

Better to not make copies in the first place.


How are you going to know in advance that the pages are going to be the same?

e.g. your guest kernel is loading an application into memory, by reading some parts of an ELF file from disk. Presumably each VM has its own unique disk, so the hypervisor can't know that this is "the same" page of data as another VM has without actually reading it into memory first and calculating a hash or something.

If the VMs share a disk image (e.g. the image is copy-on-write), then I could see it being feasible - e.g. with KVM, even if your VMs are instantiated by distinct userspace processes, they would probably share the pages as they mmap the same disk image. You would still need your virtualised disk device to support copy-on-write, which may or may not be possible depending on your use case.

But your copy-on-write disk images will probably quickly diverge in a way that makes most pages not shareable, unless you use some sort of filesystem optimised for that.

Lastly, since you mentioned Chromium or Slack in another comment - I'm sure you'll find nearly all of the loading time there is not spent loading the executable from disk, but actually executing it (and all its startup/initialisation code). So this probably won't be the speedup you're imagining. It would just save memory.


> pages not shareable, unless you use some sort of filesystem optimised for that.

btrfs on the host would have support for deduplication of identical pages in the disk images. It's true that a CPU-costly scan would be needed to identify new shared pages, if for example, two VM's are both updated to the latest distro release.


Chromium load time with an empty profile on my system is 4.5 seconds with a cleared disk block cache, and 2.0 seconds with a warm disk cache.

So startup time could be better than halved. Seems worth it.


it's not really possible in hypervisor as it doesn't know what guest will be putting in its memory beforehand


Doubt it is worth the hassle. How many do you really expect to be identical?

An OS isn't large. Your spotify/slack/browser instance is of comparable size. Says more about browser based apps but still.


> An OS isn't large. Your spotify/slack/browser instance is of comparable size.

A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)

They're not even close to "being comparable in size," although I guess that says more about Windows.


My reading was that the "comparable in size" was more about memory footprint and less about usage of storage


Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.

And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...


If you already know so much about your application(s), are you sure you need virtualization?


The second I read "shared block cache" my brain went to containers.

If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)

Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.

Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.

[1]: https://www.youtube.com/watch?v=coFIEH3vXPw


People run containers for two reasons: #1. They cannot control their devs with python dependencies. #2. Everyone runs containers! Can't be left behind.


I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.


OpenVZ does this. If you have 5 VMs each loading the same library then memory is conserved, as I understand it.


kvm does the same with KSM.


Not precisely, in that KSM does it after the fact while OpenVZ has it occur as a consequence of its design, on the loading of the program.

See (OpenVZ) "Containers share dynamic libraries, which greatly saves memory." It's just 1 Linux kernel when you are running OpenVZ containers.

https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_c...

See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."

https://access.redhat.com/documentation/en-us/red_hat_enterp...

In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.


KSM is a Linux kernel feature, not directly related to KVM.


Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.


Pretty straightforward for disk blocks. Many VM disks are already de-duped, either through snapshopping or through copy on write host filesystems.

The host block cache will end up deduplicating it automatically because all the 'copies' lead back to the same block on disk.


I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.

https://learn.microsoft.com/en-us/windows/security/applicati...


Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...


The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.

By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.


No mention of Cloud Hypervisor [1]…perhaps they don’t know about it? It’s based in part on Firecracker and supports free page reporting, virtio-blk-pci, PCI passthrough, and (I believe) discard in virtio-blk.

[1]: https://www.cloudhypervisor.org/


We do, and we'd love to use it in the future. We've found that it's not ready for prime time yet and it's missing some features. The biggest problem was that it does not support discard operations yet. Here's a short writeup we did about VMMs that we considered: https://github.com/hocus-dev/hocus/blob/main/rfd/0002-worksp...


Thanks for the link to the elaboration! FYI footnotes 3 and 4 seem to be swapped.


> footnotes 3 and 4 seem to be swapped

Maybe they are async footnotes and there is a race condition. /s


The article did an ok job of explaining the firecracker limitations they ran into but it was extremely skimpy when it came to qemu and just rushed to the conclusion “we did a lot of work so try our product.”


yeah I was reading so I could find out what they did.

I understand that they need to sell their product but jeez. don't leave us hanging like that


I didn't want to go into all the technical details, but we have another write-up that goes into details about RAM management: https://github.com/hocus-dev/hocus/blob/main/rfd/0003-worksp...

Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd


> I didn't want to go into all the technical details

HN is here for the technical details ;)


that should be the HN motto!


Thirded.. Ed.


Fly uses Firecracker, and they host long-running processes. I wonder what's their opinion about it.


I think their usecase makes a lot of sense as their workloads consume a predefined amount of ram. As a customer you rent a VM with a specified amount of memory so fly.io does not care about reclaiming it from a running VM.


Depends on if they're using smart memory allocation to keep costs lower, IE, if they can pattern that certain workloads only need N amount of memory at Y time, they can effectively borrow memory from one VM for usage in another that has an opposite statistical likelihood of needing that memory.

This is why paying for dedicated memory is often more expensive than its counter part, because that dedicated memory is not considered as part of pooling.


Generalized oversubscription like that is very challenging if not impossible to do securely, since you want to keep workloads isolated to single tenant numa nodes.

E.g. using the firecracker jailer: https://github.com/firecracker-microvm/firecracker/blob/main...


I think it would be easier to just enable KSM (KVM can use it to deduplicate memory and save some RAM on duplicate blocks like people loading same libraries into memory)


Shit like this is how you burn customers and then they find out you aren't changing the sheets, they change hotels.

Just because you can doesn't mean you should.


Almost everything is oversubscribed in our world.

Malloc will happily “return” the 15 TiB you asked for.

If 10000 people called 911 at the same time, only a tiny fraction would get through (and even fewer would get help).

Evacuating a large city by road would result in giant traffic jams.

There are 5-8x as many parking spots as there are cars (and we still can’t find a goddamn spot).

And of course… the great toilet paper shortage of 2020.


It’s a common technique though. I believe it’s called oversubscription, where you rent the same hardware to more tenants hoping they won’t use it all at once.

Fly.io themselves admitted they’re oversubscribed and AWS is doing the same for years now

Source: https://fly.io/blog/the-serverless-server/


We're not using virtio memory ballooning right now (we've never had the time to get it working).


I don't see the problem. Is this not the value proposition of the cloud? At scale things like block level de-duplication and over-provisioning of compute are more efficient to say nothing of power and cooling. This efficiency drives down cost. As long as your data isn't leaking and you get the hardware you pay for when you need it what's the problem?


Value to whom? Sell a thing and not deliver it. This this our current millieu to scam people and make the difference.

The entire vibe of this thread is

1) everyone is doing it

2) efficiency drives cost down (to the vendor) but those savings are not passed to the consumer

3) "hardware you pay for when you need it", all predicated on the at IF and it doesn't happen.

Oversubscription should always be opt-in, otherwise it is an underhanded scam.


>.. they change hotels

And then they learn all hotels are doing exactly same thing. One hotel doing is risk, all hotels doing is industry standard.

Airlines, hotels, restaurants, doctors and so on oversubscribe all the time. Whoever complains are free to move and add to their further disappointments.


Memory (DDR4) is like 1.50€ per GB nowadays. There is no need to skimp on it. The most expensive component is still the CPU if you actually want to do work and not just idle your webserver 99% of the time.


this ignores just how competitive these services have become and how much competition there is in this marketplace. Anything that allows a company to balance user experience with cost of hardware will be evaluated. The fact is this is super common because even though RAM is super cheap, at scale, these things add up. If you as a provider don't have to spend 120K on RAM and can use that on more CPUs or something else, it can mean having a competitive edge over the provider that does have to spend that 120K.


If they were competitive, they would be within an order of magnitude of on-prem, but they are not, they are 5x or more. "Super Common" is a weasel word.


Super common though. Everyone from Hetzner to AWS does this, unless you specifically pay for full allocation, which is typically ~2x the price.

Chances are you are using systems that do this and you haven't even noticed.


Werner Vogels seems to disagree: https://twitter.com/Werner/status/25137574680


Tweeted Sep 21, 2010, might still be true for EC2 VMs though.


There's the 't' series of instances that offer burstable CPU. AFAIK still 1:1 on memory though, and there's models that allow you to pay to stay un-throttled when using t family instances vs. getting throttled when out of burst credits.


We like Firecracker. People should use whatever makes sense for them.


> The main issue we've had with QEMU is that it has too many options you need to configure. For instance, enabling your VM to return unused RAM to the host requires at least three challenging tasks

This just works on Hyper-V Linux guests btw. For all the crap MS gets they do some things very right.


It kind of just works. It’s actually broken under Debian 13 for some reason; memory usage infinitely balloons if the feature is enabled.


13 is debian-testing so I guess Microsoft still has time to make it work - last I checked it wasn't yet on Azure supported list.


For context: Most developers run testing since debian-stable averages two years behind


Presumably this doesn't use the "microvm" machine type in QEMU? (also on front page right now https://news.ycombinator.com/item?id=36673945)


I came to the same conclusion as OP. QEMU is the most stable, hackable, well-supported VM hypervisor on the market. Setting it up is a pain, but once you get it set up with all your custom scripts, you never have to do it again. Ever. Even in your next project.


I toyed with it a bit and was delighted to get it running. Only to discover getting even basic networking going is another mission in itself.

Light is cool but for many tasks that level of Spartan is overkill

If I’m investing time in light it might as well be wasm tech


I know that Firecracker does not let you bind mount volumes, but QEMU does. So, we changed to QEMU from Firecracker. If you run the workloads in Kubernetes, you just have to change a single value in a yaml file to change the runtime.

I would be scared to let unknown persons use QEMU that bind mounts volumes as that is a huge security risk. Firecracker, I think, was designed from the start to run un-sanitized workloads, hence, no bind mounting.


> you just have to change a single value in a yaml file

Most dangerous 12-words sentence.


I know a good way to make a process make the most of the hardware and play cooperatively with other processes: don't use virtualization.

I will never understand the whole virtual machine and cloud craze. Your operating system is better than any hypervisor at sharing resources efficiently.


In this context (the blog post) and the reason firecracker was created, was to isolate workloads.

And if youre running untrusted code, then using a virtualized environment is the easiest (id even say best) way to go about it.


> cloud craze.

Automatic scaling is great. Cloud parallelization (a.k.a fork) is absolutely wild once you get it rolling. Code deployments are incredibly simple. Never having to worry about physical machines or variable traffic loads is worth the small overhead they charge me for the wrapper. The generic system wide permissions model is an absolute joy once you get over the learning curve.


After reading the README of virtualization tools (and looking at the author) I discovered the benefits of using them. I recommend also giving that a try.


I do have to use it since someone early on in the company I work at decided to do everything with AWS and Kubernetes.

The fact of the matter is that it's just inefficient, slow and expensive.

Bare metal is simple, fast, and keeps you in control.


bare metal has high capex and low opex

cloud vms have low capex and high opex

which one is more expensive is a function of many variables


to get decent prices for cloud you need to commit to 3 years of usage upfront, and if you do that then it's about the same price as buying the hardware outright.

Otherwise it's three times more expensive.


I want to segregate, not to share.


Tl;dr: We tried to misuse technology and we failed. If Firecracker was developed for a single binary executed fir a short period of time why do you try to use it for multiple executables running for a long time? Does it make any sense to even try?


AWS uses firecracker to execute long-running Fargate tasks; its hardly misuse


Where in the "sales" pitch on the fancy-CSS website as well as the README does it say only to use it for single-shot workloads?

I think the complaints are perfectly valid.


"Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more." [1]

Interesting. I guess we are reading a different website.

1. https://firecracker-microvm.github.io/


Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs right on the hardware. KVM is a hypervisor. Firecracker is a process that controls KVM. If you want to call firecracker (and QEMU, when used in conjunction with KVM) a VMM ("virtual machine monitor") I won't complain. But please please please, we need a word for what KVM and Xen are, and "hypervisor" is the best fit. Stop using that word for a user-level process like Firecracker.


Nitpick: it’s not accurate to say that a hypervisor, by definition, runs right on the hardware. Xen (as a type-1 hypervisor) has this property; KVM (as a type-2 hypervisor) does not. It’s important to remember that the single core responsibility of a hypervisor is to divide hardware resources and time between VMs, and this decision-making doesn’t require bare-metal.

For those unfamiliar, the informal distinction between type-1 and type-2 is that type-1 hypervisors are in direct control of the allocation of all resources of the physical computer, while type-2 hypervisors operate as some combination of being “part of” / “running on” a host operating system, which owns and allocates the resources. KVM (for example) gives privileged directions to the Linux kernel and its virtualization kernel module for how to manage VMs, and the kernel then schedules and allocates the appropriate system resources. Yes, the type-2 hypervisor needs kernel-mode primitives for managing VMs, and the kernel runs right on the hardware, but those primitives aren’t making management decisions for the division of hardware resources and time between VMs. The type-2 hypervisor is making those decisions, and the hypervisor is scheduled by the OS like any other user-mode process.


Type-1 and type-2 hypervisor is terminology that should at this point be relegated to the past.

It was never popularly used in a way accurate to the origin of the classification - in the original paper by Popek and Goldberg talked about formal proofs for the two types and they really have very little to do with how the terms began being used in the 90s and 00s. Things have changed a lot with computers since the 70s when the paper was written and the terminology was coined.

So, language evolves, and Type-1 and Type-2 came to mean something else in common usage. And this might have made sense to differentiate something like esx from vmware workstation in their capabilities, but it's lost that utility in trying to differentiate Xen from KVM for the overwhelming majority of use cases.

Why would I say it's useless in trying to differentiate, say, Xen and KVM? Couple of reasons:

1) There's no performance benefit to type-1 - a lot of performance sits on the device emulation side, and both are going to default to qemu there. Other parts are based heavily on CPU extensions, and Xen and KVM have equal access there. Both can pass through hardware, support sr-iov, etc., as well.

2) There's no overhead benefit in Xen - you still need a dom0 VM, which is going to arguably be even more overhead than a stripped down KVM setup. There's been work on dom0less Xen, but it's frankly in a rough state and the related drawbacks make it challenging to use in a production environment.

Neither term provides any real advantage or benefit in reasoning between modern hypervisors.


> Type-1 and type-2 hypervisor is terminology that should at this point be relegated to the past.

Maybe it's because of the time I grew up in, but in my mind the prototypical Type-I hypervisor is VMWare ESX Server; and the prototypical Type-II hypervisor is VMWare Workstation.

It should be noted that VMWare Workstation always required a kernel module (either on Windows or Linux) to run; so the core "hypervisor-y" bit runs in kernel mode either way. So what's the difference?

The key difference between those two, to me is: Is the thing at the bottom designed exclusively to run VMs, such that every other factor gives way? Or does the thing at the bottom have to "play nice" with random other processes?

The scheduler for ESX Server is written explicitly to schedule VMs. The scheduler for Workstation is the Windows scheduler. Under ESX, your VMs are the star of the show; under Workstation, your VMs are competing with the random updater from the printer driver.

Xen is like ESX Sever: VMs are the star of the show. KVM is like Workstation: VMs are "just" processes, and are competing with whatever random bash script was created at startup.

KVM gets loads of benefits from being in Linux; like, it had hypervisor swap from day one, and as soon as anyone implements something new (like say, NUMA balancing) for Linux, KVM gets it "for free". But it's not really for free, because the cost is that KVM has to make accommodations to all the other use cases out there.

> There's no performance benefit to type-1 - a lot of performance sits on the device emulation side, and both are going to default to qemu there.

Er, both KVM and Xen try to switch to paravirtualized interfaces as fast as possible, to minimize the emulation that QEMU has to do.


>Maybe it's because of the time I grew up in, but in my mind the prototypical Type-I hypervisor is VMWare ESX Server; and the prototypical Type-II hypervisor is VMWare Workstation.

My point is that these are largely appropriated terms - neither would fit the definitions of type 1 or type 2 from the early days when Popek and Goldberg were writing about them.

> Or does the thing at the bottom have to "play nice" with random other processes?

From this perspective, Xen doesn't count. You can have all sorts of issues from the dom0 side and competing with resources - you mention PV drivers later, and you can 100% run into issues with VMs because of how dom0 schedules blkback and netback when competing with other processes.

ESXi can also run plenty of unmodified linux binaries - go back in time 15 years and it's basically a fully featured OS. There's a lot running on it, too. Meanwhile, you can build a linux kernel with plenty of things switched off and a root filesystem with just the bare essentials for managing kvm and qemu that is even less useful for general purpose computing than esxi.

>Er, both KVM and Xen try to switch to paravirtualized interfaces as fast as possible, to minimize the emulation that QEMU has to do.

There are more things being emulated than there are PV drivers for, but this is a bit outside of my point.

For KVM, the vast majority of implementations are using qemu for managing their VirtIO devices as well - https://developer.ibm.com/articles/l-virtio/ - you'll notice that IBM even discusses these paravirtual drivers directly in context of "emulating" the device. Perhaps a better way to get the intent across here would be saying qemu handles the device model.

From a performance perspective, ideally you'd want to avoid PV here too and go with sr-iov devices or passthrough.


According to the actual paper that introduced the distinction, and adjusting for change of terminology in the last 50 years, a type-1 hypervisor runs in kernel space and a type-2 hypervisor runs in user space. x86 is not virtualizable by a type-2 hypervisor, except by software emulation of the processor.

What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.

For more detail see https://www.spinics.net/lists/kvm/msg150882.html



There's arguments in both directions for something like kvm. Wiki states it pretty well:

> The distinction between these two types is not always clear. For instance, KVM and bhyve are kernel modules[6] that effectively convert the host operating system to a type-1 hypervisor.[7] At the same time, since Linux distributions and FreeBSD are still general-purpose operating systems, with applications competing with each other for VM resources, KVM and bhyve can also be categorized as type-2 hypervisors.[8]

https://en.wikipedia.org/wiki/Hypervisor#Classification


Not really, calling KVM a type-1 is a misunderstanding of what the “bare-metal” distinction is referring to. The real difference between the two types is whether the hypervisor owns the hardware or not. In the case of a type-1, the hypervisor runs below the kernel and controls access to the hardware, even for the kernel. In type-2, the hypervisor runs on the kernel, which owns the hardware, and must go through the kernel to use hardware resources.


But that's not how that works. KVM is as "bare-metal" in access to the system as ESXi is. The hypervisor code runs in ring 0 in both cases.


Keep fighting the good fight, friend.

Although I’ll note that the line between a VMM and hypervisor are not always clear. E.g., KVM includes some things that other hypervisors delegate to the VMM (such as instruction completion). And macOS’s hypervisor.framework is almost a pass through to the CPU’s raw capabilities.


I think you could help me answer the question that has been in my mind for a month :)

Is there any article that tells the difference and relationship between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc. with their typical use cases?


KVM is a Linux kernel implementation of the cpu extensions to accelerate vms to near bare metal speeds.

Qemu is a user space system emulator. It can emulate in software different architectures like ARM, x86, etc. It can also emulate drivers, networking, disks, etc. Is called via the command line.

The reason you'll see Qemu/KVM a lot is because Qemu is the emulator, the things actually running the VM. And it utilizes KVM (on linux, OSX has HVF, for example) to accelerate the VM when the host architecture matches the VM's.

Libvirt is an XML based API on top of Qemu (and others). It allows you to define networks, VMs (it calls them domains), and much more with a unified XML schema through libvirtd.

Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI to do the same.

Proxmox is Debian under the hood with Qemu/KVM running VMs. It provides a robust web UI and easy clustering capabilities. Along with nice to haves like easy management of disks, ceph, etc. You can also manage Ceph through an API with Terraform.

Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface. I haven't played much with xen though, KVM is kind of the defacto, but IIRC AWS used to use a modified Xen before KVM came along and ate Xen's lunch.


>Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface. I haven't played much with xen though, KVM is kind of the defacto, but iirc AWS uses a modified Xen.

If you actually played with Xen you'd know it's not actually easier to manage. And increased security claims are dubious at best, as same thing that would be attacked (dom0 managing the whole thing and running linux) have direct unfettered access to xen microkernel. There is reason many sites migrated away from Xen to KVM. Also many Xen drivers de facto run part Linux dom0 instance so you don't even get that isolation.

We ran Xen for few years, as KVM at first was still not as refined and Xen was first to mature market, and it was just million little annoying things.

KVM offers far simple and straightforward management. A VM is just a process. You can look at its CPU usage via normal tools. No magic. No driver problems.


> I haven't played much with xen though, KVM is kind of the defacto, but iirc AWS uses a modified Xen.

AWS switched to KVM, and even a lot of AWS systems that report themselves as Xen are running as KVM with a compat shim to say it's Xen.


> Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface.

You're going to need dom0 (a "control domain") on any Xen host. Gotta have something running xl and the rest of the toolstack for managing it. dom0less technically exists but the drawbacks mean it's not really usable by most people in a production situation.


KVM is kernel-based virtual machine, with libvirt being its API abstraction over all of it. QEMU is a virtual machine host that leverages kvm or software virtualization to spin up machines on the host. virt-manager does the same. Xen is another virtual machine host, like KVM. Proxmox is a virtual machine manager (like QEMU, virt-manager) but is web based. Libvirt will provide abstraction for kvm,qemu,xen

Use cases: proxmox web interface exposed on your local network on a KVM Linux box that uses QEMU to manage VM’s. Proxmox will allow you to do that from the web. QEMU is great for single or small fleet of machines but should be automated for any heavy lifting. Proxmox will do that.


This is almost entirely wrong especially as far as QEMU, Libvirt and virt-manager are concerned.

QEMU is a low level process that represents the virtual machine. It has no equivalent in Xen. Using QEMU directly is not a good idea unless your needs for VM configurations change all the time and you hardly reuse VMs.

Libvirt is at a higher level than QEMU. It manages the QEMU processes and gives them access to system resources (image files, network interfaces, pass-through PCI devices). It also makes it easy to manage the configuration of your virtual machines and the resources they use.

Higher still is virt-manager, which is a GUI interface for libvirt. Proxmox sits at roughly the same level as virt-manager.


How? KVM and Xen are kernel level. QEMU uses KVM but also has a software virtualization capability. Libvirt is an API abstraction over it all. virt-manager is a gui app to manage libvirt machines. Proxmox as well. Proxmox VE talks to VMHost via libvirt.


Libvirt does not use KVM. Libvirt uses either QEMU (which in turn might or might not use KVM) or Xen or other hypervisors. So it's incorrect to say that Libvirt abstracts over KVM.

And virt-manager indeed manages Libvirt machines so it's not at the level of QEMU as you wrote in the parent comment:

> Proxmox is a virtual machine manager (like QEMU, virt-manager)


Semantics, libvirt abstracts over KVM via QEMU because QEMU/KVM/HVT is all one driver.


KVM is not enough to create a virtual machine. KVM only virtualizes the processor, not a whole machine.


>Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

Straight from their site. QEMU is the user space interface, KVM the kernel space driver. It’s enough to run whatever OS. That’s the point.

For libvirt: https://libvirt.org/drivers.html

They support a bunch as well.


I don't want to necessarily make this an argument to/from authority, but for some context here - you are discussing this with Paolo Bonzini, maintainer of KVM, contributor to QEMU. In the list of people that best understand the difference and demarcation points between KVM and QEMU, he's pretty far up there.


Exactly, it's QEMU that abstracts over the processor virtualization APIs of KVM, Microsoft WHPX, Apple Hypervisor.framework etc. Not Libvirt.


I don't know if _one_ such article exists, but here is a piece of tech doc from oVirt (yet another tool) that shows how - or that - VDSM is used by oVirt to communicate with QEMU through libvirt: https://www.ovirt.org/develop/architecture/architecture.html...

In really simple terms, so simple that I'm not 100% sure they are correct:

* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)

* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]

* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.

* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.

* Proxmox is the above toolstack (-ish) but as one product.

---

[1] https://www.redhat.com/en/topics/virtualization/what-is-KVM

[2] https://wiki.qemu.org/Main_Page

[3] https://libvirt.org/

[4] https://virt-manager.org/


Ps: typically if you want to run VMs you are faced with two paths only: 1) you want one or more VMs on your personal computer or 2) you want one or more VMs in an office environment.

On the first path you are likely going to be just fine with VirtualBox, VMWare Workstation or Hyper-V (Windows only) / Parallels (Mac intended). Which one you should pick depends on your desired use of the machines.

On the second path you would go with a solution that deals with the nitty-gritty details, such as Proxmox, oVirt, Hyper-V, ESXi, or any of the other many available options - granted you are not going full cloud-based, which opens up a whole lot of different options too.

You would generally never need to worry about which components are needed where and why. I've had to worry about it once or twice before, because I've had to debug why an oVirt solution was not behaving like I wanted it to behave. Knowing the inner workings helps in that case.


> On the first path you are likely going to be just fine with VirtualBox, VMWare Workstation or Hyper-V (Windows only) / Parallels (Mac intended).

As a Linux user, why would you want to use VirtualBox or VMWare Workstation? They are not so well integrated with the system, and, frankly, VirtualBox is more of a toy VM player... just go for virt-manager. It gives a conceptually similar interface to VirtualBox, but better integration with the rest of the system. Especially, when it comes to stuff like sending different key combinations.

I honestly cannot think of a single benefit to using VirtualBox (and I'm less familiar with VMWare player) compared to virt-manager. My guess is that it's more often used because it's also a common choice on MS Windows, so, you get more hits if you are going to search the Web for questions associated to VMs / you'd get tutorials for how to set up a VM that use VirtualBox. But, if you apply yourself to learning how either one of these works, you'd see no reason to choose it.


The only reason for using it would be that the GUI is a little bit more capable. But feature crippling without the PUEL addons is horrible.


[flagged]


Honestly, when we're discussing with humans we want the humans' response. Anyone of us can ask ChatGPT, I don't know why people keep pasting its answers


people keep doing it because it lets them farm upvotes.


Pretty good I must say. AWS however is using KVM only these days!


What is AWS Nitro then?


Nitro is the custom hardware that runs in AWS data centers (or in outposts). You can use it bare metal, or if you use VM instances it will run Linux (plus KVM) and a custom user space component that replaces QEMU.


Damn, outdone by ChatGPT


I think people just pick the coolest sounding term. Imagine someone is sharing what they are working on, what’s cooler sounding “I am working on a virtual machine monitor” or “I am working on a hypervisor”. Hypervisor just sounds futuristic and awesome.

It’s like with “isomorphic” code. That just sounds much cooler than “js that runs on the client and the server”.


> virtual machine monitor

Is it good to think of libvirt as a virtual machine mointor, or is that more "virtual machine management"?


I'd love to get a clear explanation of what libvirt actually does. As far as I can tell it's a qemu argument assembler and launcher. For my own use-case, I just launch qemu from systemd unit files:

https://wiki.archlinux.org/title/QEMU#With_systemd_service


The main important point is that Libvirt takes care of privilege separation.

It makes sure that if your VM and/or QEMU are broken out of, there are extra layers to prevent getting access to the whole physical machine. For example it runs QEMU as a very limited user and, if you're using SELinux, the QEMU process can hardly read any file other than the vm image file.

By contrast the method in the arch wiki runs QEMU as root. QEMU is exposed to all sort of untrusted input, so you really don't want it to run as root.

Libvirt also handles cross machine operations such as live migration, and makes it easier to query a bunch of things from QEMU.

For more info see https://www.redhat.com/en/blog/all-you-need-know-about-kvm-u...


Thank you, this is the best explanation of the utility that I've seen. I appreciate it.


> As far as I can tell it's a qemu argument assembler

Raises a few questions to me:

Can you use KVM/do KVM stuff without QEMU?

Can you do libvirt stuff without QEMU?

Hoping the answers to both aren't useless/"technically, but why would you want to?"


> Can you use KVM/do KVM stuff without QEMU?

Yes there's a few things out there like Firecracker that use KVM without using QEMU. I'm not completely aware of all of them but they do exist

> Can you do libvirt stuff without QEMU?

Yes it can also manager LXC containers and a few other types like Xen and Bhyve and Virtuozzo, like QEMU without KVM. The without KVM part is important to letting you run VMs that are emulating other architectures than the native one.

For a good bit of this, it is "why would you want to" but there are definitely real cases where you'd want to be able to do this. Like the LXC or Virtuozzo support means that you can run lighter weight containers (same underlying tech as Docker essentially) through the same orchestration/management that you use for virtual machines. And the Bhyve support lets you do the same thing for running things on top of FreeBSD (though I've never used it this way) so that a heterogeneous mix of hosts is managed through the same interfaces.


There is cute article from lwn demoing using kvm directly without anything else: https://lwn.net/Articles/658511/


That is really neat, it never occurred to me that you could just use KVM as a sort of sandbox.


> Can you use KVM/do KVM stuff without QEMU?

There are projects doing that althought qemu is the de facto standard and best bet if you don't need to boot your machines in 20ms

> Can you do libvirt stuff without QEMU?

Libvirt have many backends so yes. IIRC it can even manage virtualbox, althought I'm, not sure why anyone would want to

> Hoping the answers to both aren't useless/"technically, but why would you want to?"

...why? Is there a problem kvm+qemu+libvirt doesn't solve for you?


> Can you use KVM/do KVM stuff without QEMU?

Here's a post of someone using KVM from Python (raw, without needing a kvm library or anything): https://www.devever.net/~hl/kvm


libvirt supports other hypervisors than QEMU, such as VirtualBox, Xen and LXC.


I know ChatGPT responses get downvoted here but I fed it this to help me better understand this:

> is KVM a hypervisor? is it type 1 or type 2? is QEMU a hypervisor, is it type 1 or type 2? if QEMU is using KVM, is QEMU then not a hypervisor in that use case?

Yes, KVM (Kernel-Based Virtual Machine) is indeed a hypervisor. It's a type 1 hypervisor, also known as a "bare metal" hypervisor. This is because KVM directly runs on the host's hardware to control the hardware and to manage guest operating systems. The fact that it's a Linux kernel module that allows the Linux kernel to function as a hypervisor makes it very efficient.

QEMU (Quick Emulator) is a bit more complex. By itself, it is technically a type 2 or "hosted" hypervisor, meaning it runs within a conventional operating system environment. QEMU is a generic, open-source machine emulator and virtualizer that can emulate a variety of hardware types and host a range of guest operating systems.

However, when QEMU is used with KVM, the picture changes somewhat. In this case, KVM provides the hardware virtualization where it allows the host machine to call CPU instructions of full virtualization. Then, QEMU emulates the hardware resources and provides the user interface for the VM, thus allowing for better performance and usability. It's this combination of KVM's hardware acceleration capabilities and QEMU's emulation capabilities that makes them often used together.

In this case, QEMU is not acting purely as a hypervisor; it's providing hardware emulation and user interface for the VMs, while KVM is the part providing the hypervisor functionality. However, we often refer to the combination of "QEMU/KVM" as a unit when talking about this mode of operation.


It's a lot of glue to present a consistent interface but it also does the management part.

"API to virtualization system" would probably be closest approximation but it also does some more advanced stuff like coordinating cross-host VM migration


"Firecracker...'s excellent for running short-lived workloads...A little-known fact about Firecracker is its lack of support... for long-lived workloads."

Okay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: