I really want VM's to integrate 'smarter' with the host. For example, if I'm run...

scarface_74 · on July 10, 2023

This seems like a security issue waiting to happen when you’re running code from different users.

adql · on July 10, 2023

https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...

zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs

re-thc · on July 10, 2023

KVM has KSM (kernel samepage merging) since a long time ago that de-duplicates pages.

ec109685 · on July 10, 2023

It has side channel attacks so be careful when enabling: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)

londons_explore · on July 10, 2023

But that makes a copy first, and only later notices that the pages are the same and merges them again.

Better to not make copies in the first place.

Liquid_Fire · on July 10, 2023

How are you going to know in advance that the pages are going to be the same?

e.g. your guest kernel is loading an application into memory, by reading some parts of an ELF file from disk. Presumably each VM has its own unique disk, so the hypervisor can't know that this is "the same" page of data as another VM has without actually reading it into memory first and calculating a hash or something.

If the VMs share a disk image (e.g. the image is copy-on-write), then I could see it being feasible - e.g. with KVM, even if your VMs are instantiated by distinct userspace processes, they would probably share the pages as they mmap the same disk image. You would still need your virtualised disk device to support copy-on-write, which may or may not be possible depending on your use case.

But your copy-on-write disk images will probably quickly diverge in a way that makes most pages not shareable, unless you use some sort of filesystem optimised for that.

Lastly, since you mentioned Chromium or Slack in another comment - I'm sure you'll find nearly all of the loading time there is not spent loading the executable from disk, but actually executing it (and all its startup/initialisation code). So this probably won't be the speedup you're imagining. It would just save memory.

londons_explore · on July 10, 2023

> pages not shareable, unless you use some sort of filesystem optimised for that.

btrfs on the host would have support for deduplication of identical pages in the disk images. It's true that a CPU-costly scan would be needed to identify new shared pages, if for example, two VM's are both updated to the latest distro release.

londons_explore · on July 10, 2023

Chromium load time with an empty profile on my system is 4.5 seconds with a cleared disk block cache, and 2.0 seconds with a warm disk cache.

So startup time could be better than halved. Seems worth it.

adql · on July 10, 2023

it's not really possible in hypervisor as it doesn't know what guest will be putting in its memory beforehand

tjoff · on July 10, 2023

Doubt it is worth the hassle. How many do you really expect to be identical?

An OS isn't large. Your spotify/slack/browser instance is of comparable size. Says more about browser based apps but still.

drbawb · on July 10, 2023

> An OS isn't large. Your spotify/slack/browser instance is of comparable size.

A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)

They're not even close to "being comparable in size," although I guess that says more about Windows.

cthalupa · on July 10, 2023

My reading was that the "comparable in size" was more about memory footprint and less about usage of storage

londons_explore · on July 10, 2023

Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.

And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...

hamandcheese · on July 10, 2023

If you already know so much about your application(s), are you sure you need virtualization?

drbawb · on July 10, 2023

The second I read "shared block cache" my brain went to containers.

If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)

Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.

Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.

[1]: https://www.youtube.com/watch?v=coFIEH3vXPw

9659 · on July 11, 2023

People run containers for two reasons: #1. They cannot control their devs with python dependencies. #2. Everyone runs containers! Can't be left behind.

gorbak25 · on July 10, 2023

I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.

shrubble · on July 10, 2023

OpenVZ does this. If you have 5 VMs each loading the same library then memory is conserved, as I understand it.

anthk · on July 10, 2023

kvm does the same with KSM.

shrubble · on July 10, 2023

Not precisely, in that KSM does it after the fact while OpenVZ has it occur as a consequence of its design, on the loading of the program.

See (OpenVZ) "Containers share dynamic libraries, which greatly saves memory." It's just 1 Linux kernel when you are running OpenVZ containers.

https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_c...

See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."

https://access.redhat.com/documentation/en-us/red_hat_enterp...

In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.

afr0ck · on July 12, 2023

KSM is a Linux kernel feature, not directly related to KVM.

jarym · on July 10, 2023

Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.

londons_explore · on July 10, 2023

Pretty straightforward for disk blocks. Many VM disks are already de-duped, either through snapshopping or through copy on write host filesystems.

The host block cache will end up deduplicating it automatically because all the 'copies' lead back to the same block on disk.

kritr · on July 11, 2023

I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.

https://learn.microsoft.com/en-us/windows/security/applicati...

andrewflnr · on July 10, 2023

Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...

londons_explore · on July 10, 2023

The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.

By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.