Hacker News new | past | comments | ask | show | jobs | submit login

Imagine the case where you're trying to create a storage system for a large number of virtual machine images (e.g., you're trying to build your own equivalent of AWS Machine Images). There is obviously a lot of duplication between parts of images. And not necessarily at the same offset, but also at different offsets that are n*2^k bytes apart, where 2^k represents the block/sector size.

You could consider building this storage system on top of BLAKE3's tree model. Namely you store blocks as small Merkle tree. And an image is basically a collection of blocks that has a different 'hat' on top of it. Unfortunately, BLAKE3 makes this hard, because the same block will end up having a different Merkle tree node depending on the offset at which it's stored.




Author of HashBackup here. I don't see how any kind of hash tree would be effective at de-duplicating VM machine images, other than the degenerate case of an exact copy, which is easy to detect with a single file hash.

Most OSes use 4K block sizes. To get the best dedup you have to hash every 4K block and lookup each one individually in a dedup table. Two VM images could both contain an identical 4GB file, but every 4K block of that file could be stored at different offsets in the VM images. A tree hash wouldn't let you dedup anything but identical sections stored at identical offsets, whereas a dedup table and 4K blocks allows you to dedup the entire file.


Sounds to me like you are trying to use the innards of a hash algorithm for something for which it was not designed...

Either modify the algorithm to your needs, and rename it.

Or just use something thats already suitable off-the-shelf. Plenty of merkle-trees out there.


I thing CDC is what you're looking for. Some backup tools like restic use it. See https://en.m.wikipedia.org/wiki/Rolling_hash


Duplicated myself, sry


> You could consider building this storage system on top of BLAKE3's tree model.

Consider a crypto currency pow that did that without the chunk counter. It'd be trivially exploitably by precalculating all the tree but the chunk that changed per nonce.


You mean something like a CDC algorithm? I know that some Backup tools like Restic use this.

https://en.m.wikipedia.org/wiki/Rolling_hash


You can use a CDC algorith, but if you know that duplication mostly occurs at power-of-two boundaries, there is no need to use that. Deduplicating on a binary tree basis is sufficient.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: