This was a great write up. I've already sent it to a few people. On the question...

throw0101a · on July 2, 2022

> For protocol 30 and beyond (first supported in 3.0.0), the checksum used is MD5. For older protocols, the checksum used is MD4.

Newer versions (≥3.2?) support xxHash and xxHash3:

* https://github.com/WayneD/rsync/blob/master/checksum.c

* https://github.com/Cyan4973/xxHash

* https://news.ycombinator.com/item?id=19402602 (2019 XXH3 discussion)

JonathonW · on July 3, 2022

As mentioned in a sibling comment, linux.die.net's manpage is outdated here (covering rsync 3.0.6). Current versions of rsync (>= 3.2.0) autonegotiate checksum type between several different options (including a few variants of xxhash, md5, and md4) unless overridden by the user, or one side is pre-3.2.0 (in which case you get the old behavior with MD5 or MD4 depending on protocol version). The pre-transfer and transfer checksums also don't necessarily have to be the same (I don't particularly care to hunt down the default priority lists right now, so I'm not sure if they are the same or not by default).

Manpage for rsync 3.2.4, from Debian testing: https://manpages.debian.org/testing/rsync/rsync.1.en.html

hinkley · on July 3, 2022

I was surprised to learn recently that sha1 is typically about 10% faster than md5. I have some coworkers who I need to remind of this, since they’re still using md5 for caching.

yencabulator · on July 3, 2022

And Blake3 is almost 7 times faster than SHA-1...

aidenn0 · on July 2, 2022

I thought xxHash was only used for the chunk hash, not the whole file hash?

londons_explore · on July 2, 2022

Failure cases of the 'quick check':

* Underlying disk device corruption - but modern disks do internal error checking, and should emit an IO error.

* Corruption in RAM/software bug in the kernel IO subsystem. Should be detected by filesystem checksumming.

* User has accidentally modified file and set mtime back. fixes this case.

* User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

Given that, I see no users who really benefit from checksumming. It isn't sufficient for anyone with really high data integrity requirements, while also being overkill for typical usecases.

naniwaduni · on July 2, 2022

> * User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

No, md5 is not broken like that (yet). There is no nkown second-preimage attack against md5; the practical collision vulns only affect cases where an attack controls the file content both before and after update.

tjoff · on July 2, 2022

The very existence of filesystem checksumming is because your first point isn't always true.

Also, filesystem checksumming does not guard against ram/kernel-bugs. On top of that file system checksumming is very rare.

Izkata · on July 3, 2022

It has its use: I discovered corruption in files copied between external hard drives over a USB hub, using checksum. It was a bad hub, hasn't happened again since I changed it (finding it once got me paranoid enough to check every once in a while).

For anyone curious about how to find such problems without changing the files, I used "--checksum --dry-run --itemize-changes"

shakna · on July 2, 2022

Or they're using a filesystem where mtime is always zero, i.e. unsupported.

Which is an actual case that has occured for myself.

kzrdude · on July 2, 2022

I guess zfs send and similar are better solutions, but what if we could query the filesystem for existing checksums of a file and save IO that way, if filesystems on both sides already stored usable checksums?

js2 · on July 2, 2022

> I guess zfs send and similar are better solutions.

It depends. I recently built a new zfs pool server and needed to transfer a few TB of data from the old pool to the new pool, but I built the new pool with a larger record size. If I’d used zfs send the files would have retained their existing record size. So rsync it was.

formerly_proven · on July 2, 2022

Unless you are also doing FS-level deduplication using the same checksums, it generally makes no sense for these to be cryptographic hashes, so they're not necessarily suitable for this purpose.

IIRC neither ZFS nor btrfs use cryptographic hashes for checksumming by default.

throw0101a · on July 2, 2022

> on is a short hand for fletcher4 for non-deduped datasets and sha256 for deduped datasets

* https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Chec...

* https://people.freebsd.org/~asomers/fletcher.pdf

* https://en.wikipedia.org/wiki/Fletcher%27s_checksum

Strangely enough SHA-512 is actually (50%) faster than -256:

> ZFS actually uses a special version of SHA512 called SHA512t256, it uses a different initial value, and truncates the results to 256 bits (that is all the room there is in the block pointer). The advantage is only that it is faster on 64 bit CPUs.

* https://twitter.com/klarainc/status/1367199461546614788

slavik81 · on July 2, 2022

It seems that SHA512t256 is another name for SHA-512/256. It's a shame that the initialization is different from SHA-512, as it would have been very useful to be able to convert a SHA-512 hash into a SHA-512/256 hash.

_abox · on July 3, 2022

ZFS send does not do any kind of cross checking with the receiving end so no, not really ideal. Even incremental ZFS sends don't do this, it keeps state on the sending side only. It's ok for its intended usecases but it's not a direct rsync replacement.

It's also hard/impossible to restore individual files out of a ZFS send stream without restoring the whole thing so I've reverted to using tarballs of ZFS snapshots for backups instead of ZFS send. Again, it was never really meant for this so it was my mistake trying to use it that way.

jwilk · on July 2, 2022

> https://linux.die.net/man/1/rsync

linux.die.net is horribly outdated. This particular page is from 2009.

Up-to-date docs are here:

https://download.samba.org/pub/rsync/rsync.1

secure · on July 2, 2022

Which version is the samba one? Latest release? Git?

If you want to see the man page of the version in Debian, that would be https://manpages.debian.org/testing/rsync/rsync.1.en.html

Disclaimer: I wrote the software behind manpages.debian.org :)

jwilk · on July 2, 2022

"This manpage is current for version 3.2.5dev of rsync" – so I guess it's from git.

_abox · on July 3, 2022

Yeah I always do the quick check normally but once in a while I run a full rescan to make sure nothing got damaged over time. Definitely worth it.