Hacker News new | past | comments | ask | show | jobs | submit login

This was a great write up. I've already sent it to a few people.

On the question of what happens if a file's contents change after the initial checksum, the man page for rsync[0] has an interesting explanation of the *--checksum* option:

> This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a "quick check" that (by default) checks if each file's size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit checksum for each file that has a matching size. Generating the checksums means that both sides will expend a lot of disk I/O reading all the data in the files in the transfer (and this is prior to any reading that will be done to transfer changed files), so this can slow things down significantly.

> The sending side generates its checksums while it is doing the file-system scan that builds the list of the available files. The receiver generates its checksums when it is scanning for changed files, and will checksum any file that has the same size as the corresponding sender's file: files with either a changed size or a changed checksum are selected for transfer.

> Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option's before-the-transfer "Does this file need to be updated?" check. For protocol 30 and beyond (first supported in 3.0.0), the checksum used is MD5. For older protocols, the checksum used is MD4.

[0]: https://linux.die.net/man/1/rsync




> For protocol 30 and beyond (first supported in 3.0.0), the checksum used is MD5. For older protocols, the checksum used is MD4.

Newer versions (≥3.2?) support xxHash and xxHash3:

* https://github.com/WayneD/rsync/blob/master/checksum.c

* https://github.com/Cyan4973/xxHash

* https://news.ycombinator.com/item?id=19402602 (2019 XXH3 discussion)


As mentioned in a sibling comment, linux.die.net's manpage is outdated here (covering rsync 3.0.6). Current versions of rsync (>= 3.2.0) autonegotiate checksum type between several different options (including a few variants of xxhash, md5, and md4) unless overridden by the user, or one side is pre-3.2.0 (in which case you get the old behavior with MD5 or MD4 depending on protocol version). The pre-transfer and transfer checksums also don't necessarily have to be the same (I don't particularly care to hunt down the default priority lists right now, so I'm not sure if they are the same or not by default).

Manpage for rsync 3.2.4, from Debian testing: https://manpages.debian.org/testing/rsync/rsync.1.en.html


I was surprised to learn recently that sha1 is typically about 10% faster than md5. I have some coworkers who I need to remind of this, since they’re still using md5 for caching.


And Blake3 is almost 7 times faster than SHA-1...


I thought xxHash was only used for the chunk hash, not the whole file hash?


Failure cases of the 'quick check':

* Underlying disk device corruption - but modern disks do internal error checking, and should emit an IO error.

* Corruption in RAM/software bug in the kernel IO subsystem. Should be detected by filesystem checksumming.

* User has accidentally modified file and set mtime back. fixes this case.

* User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

Given that, I see no users who really benefit from checksumming. It isn't sufficient for anyone with really high data integrity requirements, while also being overkill for typical usecases.


> * User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

No, md5 is not broken like that (yet). There is no nkown second-preimage attack against md5; the practical collision vulns only affect cases where an attack controls the file content both before and after update.


The very existence of filesystem checksumming is because your first point isn't always true.

Also, filesystem checksumming does not guard against ram/kernel-bugs. On top of that file system checksumming is very rare.


It has its use: I discovered corruption in files copied between external hard drives over a USB hub, using checksum. It was a bad hub, hasn't happened again since I changed it (finding it once got me paranoid enough to check every once in a while).

For anyone curious about how to find such problems without changing the files, I used "--checksum --dry-run --itemize-changes"


Or they're using a filesystem where mtime is always zero, i.e. unsupported.

Which is an actual case that has occured for myself.


I guess zfs send and similar are better solutions, but what if we could query the filesystem for existing checksums of a file and save IO that way, if filesystems on both sides already stored usable checksums?


> I guess zfs send and similar are better solutions.

It depends. I recently built a new zfs pool server and needed to transfer a few TB of data from the old pool to the new pool, but I built the new pool with a larger record size. If I’d used zfs send the files would have retained their existing record size. So rsync it was.


Unless you are also doing FS-level deduplication using the same checksums, it generally makes no sense for these to be cryptographic hashes, so they're not necessarily suitable for this purpose.

IIRC neither ZFS nor btrfs use cryptographic hashes for checksumming by default.


> on is a short hand for fletcher4 for non-deduped datasets and sha256 for deduped datasets

* https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Chec...

* https://people.freebsd.org/~asomers/fletcher.pdf

* https://en.wikipedia.org/wiki/Fletcher%27s_checksum

Strangely enough SHA-512 is actually (50%) faster than -256:

> ZFS actually uses a special version of SHA512 called SHA512t256, it uses a different initial value, and truncates the results to 256 bits (that is all the room there is in the block pointer). The advantage is only that it is faster on 64 bit CPUs.

* https://twitter.com/klarainc/status/1367199461546614788


It seems that SHA512t256 is another name for SHA-512/256. It's a shame that the initialization is different from SHA-512, as it would have been very useful to be able to convert a SHA-512 hash into a SHA-512/256 hash.


ZFS send does not do any kind of cross checking with the receiving end so no, not really ideal. Even incremental ZFS sends don't do this, it keeps state on the sending side only. It's ok for its intended usecases but it's not a direct rsync replacement.

It's also hard/impossible to restore individual files out of a ZFS send stream without restoring the whole thing so I've reverted to using tarballs of ZFS snapshots for backups instead of ZFS send. Again, it was never really meant for this so it was my mistake trying to use it that way.


> https://linux.die.net/man/1/rsync

linux.die.net is horribly outdated. This particular page is from 2009.

Up-to-date docs are here:

https://download.samba.org/pub/rsync/rsync.1


Which version is the samba one? Latest release? Git?

If you want to see the man page of the version in Debian, that would be https://manpages.debian.org/testing/rsync/rsync.1.en.html

Disclaimer: I wrote the software behind manpages.debian.org :)


"This manpage is current for version 3.2.5dev of rsync" – so I guess it's from git.


Yeah I always do the quick check normally but once in a while I run a full rescan to make sure nothing got damaged over time. Definitely worth it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: