I really want VM's to integrate 'smarter' with the host.
For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).
To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.
How are you going to know in advance that the pages are going to be the same?
e.g. your guest kernel is loading an application into memory, by reading some parts of an ELF file from disk. Presumably each VM has its own unique disk, so the hypervisor can't know that this is "the same" page of data as another VM has without actually reading it into memory first and calculating a hash or something.
If the VMs share a disk image (e.g. the image is copy-on-write), then I could see it being feasible - e.g. with KVM, even if your VMs are instantiated by distinct userspace processes, they would probably share the pages as they mmap the same disk image. You would still need your virtualised disk device to support copy-on-write, which may or may not be possible depending on your use case.
But your copy-on-write disk images will probably quickly diverge in a way that makes most pages not shareable, unless you use some sort of filesystem optimised for that.
Lastly, since you mentioned Chromium or Slack in another comment - I'm sure you'll find nearly all of the loading time there is not spent loading the executable from disk, but actually executing it (and all its startup/initialisation code). So this probably won't be the speedup you're imagining. It would just save memory.
> pages not shareable, unless you use some sort of filesystem optimised for that.
btrfs on the host would have support for deduplication of identical pages in the disk images. It's true that a CPU-costly scan would be needed to identify new shared pages, if for example, two VM's are both updated to the latest distro release.
> An OS isn't large. Your spotify/slack/browser instance is of comparable size.
A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)
They're not even close to "being comparable in size," although I guess that says more about Windows.
Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.
And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...
The second I read "shared block cache" my brain went to containers.
If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)
Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.
Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.
People run containers for two reasons:
#1. They cannot control their devs with python dependencies.
#2. Everyone runs containers! Can't be left behind.
I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.
See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."
In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.
Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.
I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.
Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...
The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.
By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.
For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).
To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.