So, my deduplication is about merging various archives I have of various things ...

redsaz · 2023-11-05T22:25:18 1699223118

I like the strategy of only hashing the first part of a file as a multiphase approach of deduping in order to quickly eliminate unique files, I wish I had done it that way with my util. Maybe for v2!

at_a_remove · 2023-11-05T23:44:11 1699227851

I think my next pass at it will be to merge the MD5 and SHA-1 steps so that we get two outputs at the end, this way I would save on the file-reading.

But after that, I think one percent of the end of the file. Then ten percent of the start of the file, then ten percent of the end of the file ...

Given that metadata is typically at the end or beginning of a file, those seem like the best place to look for differences.

I would be open to other hashes so long as they were drop-in easy. I'm not concerned about malicious, forced collisions because they would have to overcome two different kinds of hashing, and the most it would earn is a delay, since there's always a byte-by-byte comparison at the very end.

I suspect I would similarly want to use multiple fingerprinting methods for the visual characteristics of an image file.

redsaz · 2023-11-06T04:55:02 1699246502

> I'm not concerned about malicious, forced collisions

Consider taking a look at xxh3, possibly. It seems a pretty decent contender, hashing speed-wise, to trade off of secure hashing: https://github.com/Cyan4973/xxHash