I've read multiple comments on using dedup for VMs here. Wouldn't it be a lot more efficient for this to be implemented by the hypervisor rather than the filesystem?
I'm a former VMware certified admin. How do you envision this to work? All the data written to the VM's virtual disk will cause blocks to change and the storage array is the best place to keep track of that.
You do it at the file system layer. Clone the template which creates only metadata referencing the original blocks then you perform copy-on-write as needed.
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
Linked clones shouldn’t need that. They likely start out with only references to the original blocks, and then replace them when they change. If so, it’s a different concept (as it would mean that any new duplicate blocks are not shared), but for the use case of “spin up a hundred identical VMs that only change comparably little” it sounds more efficient performance-wise, with a negligible loss in space efficiency.
Am I certain of this? No, this is just what I quickly pieced together based on some assumptions (albeit reasonable ones). Happy to be told otherwise.
Linked clones aren't used in ESXi, instant clones and they ARE pretty nifty and heavily used in VDI where you need to spin up many thousands of desktop VMs. But they have to keep track of what blocks change and so ever clone has a delta disk. At the end of the day you are just moving around where this bookkeeping happens. And it is best to happen on a enterprise grade array with ultra optimized inline dedupe like a Pure array.
I’m not sure that’s true, because the hypervisor can know which blocks are related to begin with? From what I quoted above it seems that the file system instead does a lookup based on the block content to determine if a block is a dupe (I don’t know if it uses a hash, necessitating processing the whole block, or something like an RB tree, which avoids having to read the whole block if it already differs early from candidates). Unless there is a way to explicitly tell the file system that you are copying blocks for that purpose, and that VMware is actually doing that. If not, then leaving it to the file system or even the storage layer should have a definite impact on performance, albeit in exchange for higher space efficiency because a lookup can deduplicate blocks that are identical but not directly related. This would give a space benefit if you do things like installing the same applications across many VMs after the cloning, but assuming that this isn’t commonly done (I think you should clone after establishing all common state like app installations if possible), then my gut feeling is very much that the performance benefit of more semantic-level hypervisor bookkeeping outweighs the space gains from “dumb” block-oriented fs/storage bookkeeping.
Your phrasing sounds like you're unaware that filesystems can also do the same kind of cloning that a hypervisor does, where the initial data takes no storage space and only changes get written.
In fact, it's a much more common feature than active deduplication.
VM drives are just files, and it's weird that you imply a filesystem wouldn't know about the semantics of a file getting copied and altered, and would only understand blocks.
Uh, thanks for the personal attack? I am aware that cloning exists, and I very explicitly allowed for the use of such a mechanism to change the conclusion in both of my comments. My trouble was that I wasn't sure how much filesystem-cloning is actually in use in relevant contexts. Does POSIX have some sort of "copyfile()" system call nowadays? Last I knew (outdated, I'm sure), the cp command for example seemed to just read() blocks into a buffer and write() them out again. I'm not sure how the filesystem layer would detect this as a clone without a lookup. I was quoting and basing my assumptions on the article:
> The downside is that every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
Which, if universally true, is very much different from what a hypervisor could do instead, and I've detailed the potential differences. But if a hypervisor does use some sort of clone system call instead, that can indeed shift the same approach into the fs layer, and my genuine question is whether it does.
I said "your phrasing sounds like" specifically to make it not personal. Clearly some information was missing but I wasn't sure exactly what. I'll try to phrase that better in the future.
It sounds like the information you need is that cp has a flag to make cloning happen. I think it even became default behavior recently.
Also that the article quote is strictly talking about dedup. That downside does not generalize to the clone/reflink features. They use a much more lightweight method.
Huh? What do you mean? They absolutely are. I've made extensive use of them in ESXi/vsphere clusters in situations where I'm spinning up and down many temporary VMs.
Linked clones do not exist in ESXi. Horizon Composer is what is/was used to create them, and that requires a vCenter Server and a bit of infrastructure, including a database.
No, you can create them if you only have vCenter via its API. No extra infrastructure beyond that, though. The pyVmomi library has example code of how to do it. IIRC it is true that standalone ESXi does not offer the option to create a linked clone by itself, but if I wanted to be a pendant I'd argue that linked clones do exist in ESXi, as that is where vCenter deploys them.