Hacker News new | past | comments | ask | show | jobs | submit login

I really want VM's to integrate 'smarter' with the host.

For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).

To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.




This seems like a security issue waiting to happen when you’re running code from different users.


https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...

zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs


KVM has KSM (kernel samepage merging) since a long time ago that de-duplicates pages.


It has side channel attacks so be careful when enabling: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)


But that makes a copy first, and only later notices that the pages are the same and merges them again.

Better to not make copies in the first place.


How are you going to know in advance that the pages are going to be the same?

e.g. your guest kernel is loading an application into memory, by reading some parts of an ELF file from disk. Presumably each VM has its own unique disk, so the hypervisor can't know that this is "the same" page of data as another VM has without actually reading it into memory first and calculating a hash or something.

If the VMs share a disk image (e.g. the image is copy-on-write), then I could see it being feasible - e.g. with KVM, even if your VMs are instantiated by distinct userspace processes, they would probably share the pages as they mmap the same disk image. You would still need your virtualised disk device to support copy-on-write, which may or may not be possible depending on your use case.

But your copy-on-write disk images will probably quickly diverge in a way that makes most pages not shareable, unless you use some sort of filesystem optimised for that.

Lastly, since you mentioned Chromium or Slack in another comment - I'm sure you'll find nearly all of the loading time there is not spent loading the executable from disk, but actually executing it (and all its startup/initialisation code). So this probably won't be the speedup you're imagining. It would just save memory.


> pages not shareable, unless you use some sort of filesystem optimised for that.

btrfs on the host would have support for deduplication of identical pages in the disk images. It's true that a CPU-costly scan would be needed to identify new shared pages, if for example, two VM's are both updated to the latest distro release.


Chromium load time with an empty profile on my system is 4.5 seconds with a cleared disk block cache, and 2.0 seconds with a warm disk cache.

So startup time could be better than halved. Seems worth it.


it's not really possible in hypervisor as it doesn't know what guest will be putting in its memory beforehand


Doubt it is worth the hassle. How many do you really expect to be identical?

An OS isn't large. Your spotify/slack/browser instance is of comparable size. Says more about browser based apps but still.


> An OS isn't large. Your spotify/slack/browser instance is of comparable size.

A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)

They're not even close to "being comparable in size," although I guess that says more about Windows.


My reading was that the "comparable in size" was more about memory footprint and less about usage of storage


Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.

And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...


If you already know so much about your application(s), are you sure you need virtualization?


The second I read "shared block cache" my brain went to containers.

If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)

Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.

Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.

[1]: https://www.youtube.com/watch?v=coFIEH3vXPw


People run containers for two reasons: #1. They cannot control their devs with python dependencies. #2. Everyone runs containers! Can't be left behind.


I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.


OpenVZ does this. If you have 5 VMs each loading the same library then memory is conserved, as I understand it.


kvm does the same with KSM.


Not precisely, in that KSM does it after the fact while OpenVZ has it occur as a consequence of its design, on the loading of the program.

See (OpenVZ) "Containers share dynamic libraries, which greatly saves memory." It's just 1 Linux kernel when you are running OpenVZ containers.

https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_c...

See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."

https://access.redhat.com/documentation/en-us/red_hat_enterp...

In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.


KSM is a Linux kernel feature, not directly related to KVM.


Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.


Pretty straightforward for disk blocks. Many VM disks are already de-duped, either through snapshopping or through copy on write host filesystems.

The host block cache will end up deduplicating it automatically because all the 'copies' lead back to the same block on disk.


I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.

https://learn.microsoft.com/en-us/windows/security/applicati...


Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...


The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.

By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: