What I dislike about BLAKE3 is that they added explicit logic to ensure that identical chunks stored at different offsets result in different Merkle tree nodes (a.k.a. the ‘chunk counter’).
Though this feature is well intended, it makes this hash function hard to use for a storage system where you try to do aggressive data deduplication.
Furthermore, on platforms that provide native instructions for SHA hashing, BLAKE3 isn’t necessarily faster, and certainly more power hungry.
The storage system doing this wouldn’t use that part of the hash, it would do it itself so no issues? (Hash chunks, instead of feeding everything in linearly)
Otherwise the hash isn’t going to be even remotely safe for most inputs?
Answer: identify chunks via something like rsyncs rolling window or GearHash, then name those chunks by Blake3.
Trying to use Blake3's tree structure directly to dedupe is a misunderstanding of the problem you're trying to solve. Removing the counter would not let you use Blake3 alone for this purpose.
Could you point to how this is implemented and how it can be used? From the sound of it, you're trying to do something like rsync's running-window comparison?
Imagine the case where you're trying to create a storage system for a large number of virtual machine images (e.g., you're trying to build your own equivalent of AWS Machine Images). There is obviously a lot of duplication between parts of images. And not necessarily at the same offset, but also at different offsets that are n*2^k bytes apart, where 2^k represents the block/sector size.
You could consider building this storage system on top of BLAKE3's tree model. Namely you store blocks as small Merkle tree. And an image is basically a collection of blocks that has a different 'hat' on top of it. Unfortunately, BLAKE3 makes this hard, because the same block will end up having a different Merkle tree node depending on the offset at which it's stored.
Author of HashBackup here. I don't see how any kind of hash tree would be effective at de-duplicating VM machine images, other than the degenerate case of an exact copy, which is easy to detect with a single file hash.
Most OSes use 4K block sizes. To get the best dedup you have to hash every 4K block and lookup each one individually in a dedup table. Two VM images could both contain an identical 4GB file, but every 4K block of that file could be stored at different offsets in the VM images. A tree hash wouldn't let you dedup anything but identical sections stored at identical offsets, whereas a dedup table and 4K blocks allows you to dedup the entire file.
> You could consider building this storage system on top of BLAKE3's tree model.
Consider a crypto currency pow that did that without the chunk counter. It'd be trivially exploitably by precalculating all the tree but the chunk that changed per nonce.
You can use a CDC algorith, but if you know that duplication mostly occurs at power-of-two boundaries, there is no need to use that. Deduplicating on a binary tree basis is sufficient.
Though this feature is well intended, it makes this hash function hard to use for a storage system where you try to do aggressive data deduplication.
Furthermore, on platforms that provide native instructions for SHA hashing, BLAKE3 isn’t necessarily faster, and certainly more power hungry.