What I dislike about BLAKE3 is that they added explicit logic to ensure that ide...

oconnor663 · on Nov 13, 2023

We go over some of our reasoning around that in section 7.5 of https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak.... An early BLAKE3 prototype actually didn't include the chunk counter (https://github.com/oconnor663/bao/blob/master/docs/spec_0.9....), so I'm definitely sympathetic to the use cases that wish it wasn't there. However, after publication we found out that something like a chunk counter is necessary for the security of the Bao streaming verification tool: https://github.com/oconnor663/bao/issues/41. It could be that there's a design that's the best of both worlds, but I'm not sure.

lazide · on Nov 13, 2023

Huh?

The storage system doing this wouldn’t use that part of the hash, it would do it itself so no issues? (Hash chunks, instead of feeding everything in linearly)

Otherwise the hash isn’t going to be even remotely safe for most inputs?

jasonwatkinspdx · on Nov 14, 2023

Answer: identify chunks via something like rsyncs rolling window or GearHash, then name those chunks by Blake3.

Trying to use Blake3's tree structure directly to dedupe is a misunderstanding of the problem you're trying to solve. Removing the counter would not let you use Blake3 alone for this purpose.

persnickety · on Nov 13, 2023

Could you point to how this is implemented and how it can be used? From the sound of it, you're trying to do something like rsync's running-window comparison?

EdSchouten · on Nov 13, 2023

Imagine the case where you're trying to create a storage system for a large number of virtual machine images (e.g., you're trying to build your own equivalent of AWS Machine Images). There is obviously a lot of duplication between parts of images. And not necessarily at the same offset, but also at different offsets that are n*2^k bytes apart, where 2^k represents the block/sector size.

You could consider building this storage system on top of BLAKE3's tree model. Namely you store blocks as small Merkle tree. And an image is basically a collection of blocks that has a different 'hat' on top of it. Unfortunately, BLAKE3 makes this hard, because the same block will end up having a different Merkle tree node depending on the offset at which it's stored.

prirun · on Nov 13, 2023

Author of HashBackup here. I don't see how any kind of hash tree would be effective at de-duplicating VM machine images, other than the degenerate case of an exact copy, which is easy to detect with a single file hash.

Most OSes use 4K block sizes. To get the best dedup you have to hash every 4K block and lookup each one individually in a dedup table. Two VM images could both contain an identical 4GB file, but every 4K block of that file could be stored at different offsets in the VM images. A tree hash wouldn't let you dedup anything but identical sections stored at identical offsets, whereas a dedup table and 4K blocks allows you to dedup the entire file.

londons_explore · on Nov 13, 2023

Sounds to me like you are trying to use the innards of a hash algorithm for something for which it was not designed...

Either modify the algorithm to your needs, and rename it.

Or just use something thats already suitable off-the-shelf. Plenty of merkle-trees out there.

luoc · on Nov 13, 2023

I thing CDC is what you're looking for. Some backup tools like restic use it. See https://en.m.wikipedia.org/wiki/Rolling_hash

luoc · on Nov 13, 2023

Duplicated myself, sry

marktangotango · on Nov 13, 2023

> You could consider building this storage system on top of BLAKE3's tree model.

Consider a crypto currency pow that did that without the chunk counter. It'd be trivially exploitably by precalculating all the tree but the chunk that changed per nonce.

luoc · on Nov 13, 2023

You mean something like a CDC algorithm? I know that some Backup tools like Restic use this.

https://en.m.wikipedia.org/wiki/Rolling_hash

EdSchouten · on Nov 13, 2023

You can use a CDC algorith, but if you know that duplication mostly occurs at power-of-two boundaries, there is no need to use that. Deduplicating on a binary tree basis is sufficient.