Imagine the case where you're trying to create a storage system for a large numb...

prirun · on Nov 13, 2023

Author of HashBackup here. I don't see how any kind of hash tree would be effective at de-duplicating VM machine images, other than the degenerate case of an exact copy, which is easy to detect with a single file hash.

Most OSes use 4K block sizes. To get the best dedup you have to hash every 4K block and lookup each one individually in a dedup table. Two VM images could both contain an identical 4GB file, but every 4K block of that file could be stored at different offsets in the VM images. A tree hash wouldn't let you dedup anything but identical sections stored at identical offsets, whereas a dedup table and 4K blocks allows you to dedup the entire file.

londons_explore · on Nov 13, 2023

Sounds to me like you are trying to use the innards of a hash algorithm for something for which it was not designed...

Either modify the algorithm to your needs, and rename it.

Or just use something thats already suitable off-the-shelf. Plenty of merkle-trees out there.

luoc · on Nov 13, 2023

I thing CDC is what you're looking for. Some backup tools like restic use it. See https://en.m.wikipedia.org/wiki/Rolling_hash

luoc · on Nov 13, 2023

Duplicated myself, sry

marktangotango · on Nov 13, 2023

> You could consider building this storage system on top of BLAKE3's tree model.

Consider a crypto currency pow that did that without the chunk counter. It'd be trivially exploitably by precalculating all the tree but the chunk that changed per nonce.

luoc · on Nov 13, 2023

You mean something like a CDC algorithm? I know that some Backup tools like Restic use this.

https://en.m.wikipedia.org/wiki/Rolling_hash

EdSchouten · on Nov 13, 2023

You can use a CDC algorith, but if you know that duplication mostly occurs at power-of-two boundaries, there is no need to use that. Deduplicating on a binary tree basis is sufficient.