Hacker News new | past | comments | ask | show | jobs | submit login
How does rsync work? (stapelberg.ch)
319 points by secure on July 2, 2022 | hide | past | favorite | 51 comments



This is also available as a video, "Why I wrote my own rsync":

* https://media.ccc.de/v/gpn20-41-why-i-wrote-my-own-rsync

* https://www.youtube.com/watch?v=wpwObdgemoE


This was a great write up. I've already sent it to a few people.

On the question of what happens if a file's contents change after the initial checksum, the man page for rsync[0] has an interesting explanation of the *--checksum* option:

> This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a "quick check" that (by default) checks if each file's size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit checksum for each file that has a matching size. Generating the checksums means that both sides will expend a lot of disk I/O reading all the data in the files in the transfer (and this is prior to any reading that will be done to transfer changed files), so this can slow things down significantly.

> The sending side generates its checksums while it is doing the file-system scan that builds the list of the available files. The receiver generates its checksums when it is scanning for changed files, and will checksum any file that has the same size as the corresponding sender's file: files with either a changed size or a changed checksum are selected for transfer.

> Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option's before-the-transfer "Does this file need to be updated?" check. For protocol 30 and beyond (first supported in 3.0.0), the checksum used is MD5. For older protocols, the checksum used is MD4.

[0]: https://linux.die.net/man/1/rsync


> For protocol 30 and beyond (first supported in 3.0.0), the checksum used is MD5. For older protocols, the checksum used is MD4.

Newer versions (≥3.2?) support xxHash and xxHash3:

* https://github.com/WayneD/rsync/blob/master/checksum.c

* https://github.com/Cyan4973/xxHash

* https://news.ycombinator.com/item?id=19402602 (2019 XXH3 discussion)


As mentioned in a sibling comment, linux.die.net's manpage is outdated here (covering rsync 3.0.6). Current versions of rsync (>= 3.2.0) autonegotiate checksum type between several different options (including a few variants of xxhash, md5, and md4) unless overridden by the user, or one side is pre-3.2.0 (in which case you get the old behavior with MD5 or MD4 depending on protocol version). The pre-transfer and transfer checksums also don't necessarily have to be the same (I don't particularly care to hunt down the default priority lists right now, so I'm not sure if they are the same or not by default).

Manpage for rsync 3.2.4, from Debian testing: https://manpages.debian.org/testing/rsync/rsync.1.en.html


I was surprised to learn recently that sha1 is typically about 10% faster than md5. I have some coworkers who I need to remind of this, since they’re still using md5 for caching.


And Blake3 is almost 7 times faster than SHA-1...


I thought xxHash was only used for the chunk hash, not the whole file hash?


Failure cases of the 'quick check':

* Underlying disk device corruption - but modern disks do internal error checking, and should emit an IO error.

* Corruption in RAM/software bug in the kernel IO subsystem. Should be detected by filesystem checksumming.

* User has accidentally modified file and set mtime back. fixes this case.

* User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

Given that, I see no users who really benefit from checksumming. It isn't sufficient for anyone with really high data integrity requirements, while also being overkill for typical usecases.


> * User has maliciously modified file and set mtime back. Since it's MD5 (broken), the malicious user can make the checksum match too. checksumming doesn't help.

No, md5 is not broken like that (yet). There is no nkown second-preimage attack against md5; the practical collision vulns only affect cases where an attack controls the file content both before and after update.


The very existence of filesystem checksumming is because your first point isn't always true.

Also, filesystem checksumming does not guard against ram/kernel-bugs. On top of that file system checksumming is very rare.


It has its use: I discovered corruption in files copied between external hard drives over a USB hub, using checksum. It was a bad hub, hasn't happened again since I changed it (finding it once got me paranoid enough to check every once in a while).

For anyone curious about how to find such problems without changing the files, I used "--checksum --dry-run --itemize-changes"


Or they're using a filesystem where mtime is always zero, i.e. unsupported.

Which is an actual case that has occured for myself.


I guess zfs send and similar are better solutions, but what if we could query the filesystem for existing checksums of a file and save IO that way, if filesystems on both sides already stored usable checksums?


> I guess zfs send and similar are better solutions.

It depends. I recently built a new zfs pool server and needed to transfer a few TB of data from the old pool to the new pool, but I built the new pool with a larger record size. If I’d used zfs send the files would have retained their existing record size. So rsync it was.


Unless you are also doing FS-level deduplication using the same checksums, it generally makes no sense for these to be cryptographic hashes, so they're not necessarily suitable for this purpose.

IIRC neither ZFS nor btrfs use cryptographic hashes for checksumming by default.


> on is a short hand for fletcher4 for non-deduped datasets and sha256 for deduped datasets

* https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Chec...

* https://people.freebsd.org/~asomers/fletcher.pdf

* https://en.wikipedia.org/wiki/Fletcher%27s_checksum

Strangely enough SHA-512 is actually (50%) faster than -256:

> ZFS actually uses a special version of SHA512 called SHA512t256, it uses a different initial value, and truncates the results to 256 bits (that is all the room there is in the block pointer). The advantage is only that it is faster on 64 bit CPUs.

* https://twitter.com/klarainc/status/1367199461546614788


It seems that SHA512t256 is another name for SHA-512/256. It's a shame that the initialization is different from SHA-512, as it would have been very useful to be able to convert a SHA-512 hash into a SHA-512/256 hash.


ZFS send does not do any kind of cross checking with the receiving end so no, not really ideal. Even incremental ZFS sends don't do this, it keeps state on the sending side only. It's ok for its intended usecases but it's not a direct rsync replacement.

It's also hard/impossible to restore individual files out of a ZFS send stream without restoring the whole thing so I've reverted to using tarballs of ZFS snapshots for backups instead of ZFS send. Again, it was never really meant for this so it was my mistake trying to use it that way.


> https://linux.die.net/man/1/rsync

linux.die.net is horribly outdated. This particular page is from 2009.

Up-to-date docs are here:

https://download.samba.org/pub/rsync/rsync.1


Which version is the samba one? Latest release? Git?

If you want to see the man page of the version in Debian, that would be https://manpages.debian.org/testing/rsync/rsync.1.en.html

Disclaimer: I wrote the software behind manpages.debian.org :)


"This manpage is current for version 3.2.5dev of rsync" – so I guess it's from git.


Yeah I always do the quick check normally but once in a while I run a full rescan to make sure nothing got damaged over time. Definitely worth it.


Nice write up. rsync is great as an application but I found it more cumbersome to use when wanting to integrate it into my own application. There's librsync but the documentation is threadbare and it requires an rsync server to run. I found bita/bitar (https://github.com/oll3/bita) which is inspired by rsync & family. It works more like zsync which leverages HTTP Range requests so it doesn't require anything running on the server to get chunks. Works like a treat using s3/b2 storage to serve files and get incremental differential updates on the client side!


Always happy to see my pet project mentioned (bita) and that it is actually being used by others than me :)


When trying to understand rsync and the rolling checksum I stumbled upon a small python implementation in some self-hosted corner of the web[0], which I have archived on GH[1] (not the author, but things can vanish quickly, as proved by the bzr repo which went poof[2]).

[0]: https://blog.liw.fi/posts/rsync-in-python/

[1]: https://github.com/lloeki/rsync/blob/master/rsync.py

[2]: https://code.liw.fi/obsync/bzr/trunk/




Ah I see now, I didn't traverse deep enough whem checking.


I don't think you can clone that.


True, but I imagined fetching every file with wget or equivalent and then just using it locally. Doesn't work anyway because the data is not all there.


See also the 1996 original paper by Tridgell (also of Samba fame) and Mackerras:

* https://rsync.samba.org/tech_report/

* https://www.andrew.cmu.edu/course/15-749/READINGS/required/c...


Another great resource from Tridgell himself is this Ottawa Linux Symposium talk: http://olstrans.sourceforge.net/release/OLS2000-rsync/OLS200...


I encountered a strange situation 2 days ago. I rsync my pdf files periodically between my harddrives. rsync showed no differences between two folder trees, but if I did `diff -r` between the two, 3 pdfs came out different.

I checked the three individually but they showed no corruption or changes either side. How can this happen?

Edit: the hard drive copy is previously rsynced from this copy & both copies are mirrored with google cloud bucket.

The 3 files which showed different have the same MD5 checksum


rsync uses a heuristic based on file times and sizes to compare files. to compare file content use the --checksum feature (computationally expensive to run)


Yes but it doesn't answer why rsync & checksum pass the set of files as same, but diff reports them different.


It isn't clear from the conversation so far: do you use "--checksum"?


Rsync can checksum a lot of megabytes per second. In general I'd say the disk IO is much more expensive than the computation.


If your PDF files have the same MD5 checksum but "diff" shows differences then this is an MD5 collision.

Maybe it's a trivial thing, eg. your 3 files got resynchronized right between running rsync and running diff. So you should have retried rsync after the diff.

Or you obtained these PDFs from a source that purports to demonstrate MD5 collisions. Or someone is attacking you by replacing your files. Or, more likely, this is user error, and you are not reporting to us what's happening exactly.

You can always do a diff on a hex dump of the PDF content and see with your own eyes what part of the PDF is actually different. It's not that hard to interpret the format and know which PDF structure changed. You can run "qpdf --qfd input.pdf" on both versions and this uncompress all structures to make the internal content human readable (besides images).


I tried and the difference between the two turns out all gibberish.

>Or, more likely, this is user error, and you are not reporting to us what's happening exactly.

Here's the sequence:

1. rsync -av src/ dest/

2. diff -r src/ dest/

[Shows 3 pairs of differences]

3.

   md5 file1a & file 1b; compare
   md5 file2a & file 2b; compare
   md5 file3a & file 3b; compare
[All three pairs match MD5]

4. run rsync - no difference still

5.compare difference by diff - shows gibberish. All three copies open by pdf viewers. The qpdf option doesn't make sense because all the 3 happen to be advanced math textbooks, and the plaintext is impossible to read.

In fact I just checked now, and the same error pattern persists. Its not something I am terribly concerned since I have 3 copies of my data - but this is a pattern which showed up for the first time. I do this exercise regularly (once per month)


Would you mind emailing me file1a and file1b assuming you are willing to share them? Send to: m (at) zorinaq.con

It's possible to have a scenario where your synchronization process (involving Google Cloud?) somehow changed the modification time of your files so that both copies in src/ and dst/ have the same timestamp, in that case rsync will not notice they are different (if they also have the same size). Like the other reply said you have to use rsync --checksum or -c to force rsync to compare the content of files.


As noted in a sibling comment, you should check out the --checkout / -c option. Specifically, replace:

1. rsync -acv src/ dest/

It will be much slower - using lots of cpu and disk access - but more thorough.


Typo: it's "--qdf".


The same happend to me syncing to an SD card. The reason may be different timestamp resolutions, which make same files look different. I synced from HFS+ to FAT32 back then.


I don't see why any of this is needed. Just install Dropbox, and...


I'm gonna interpret this in the best faith possible, assume you're referencing the infamous Dropbox HN comment, upvote you to counteract the downvotes from people who missed the joke, and link to the aforementioned comment: https://news.ycombinator.com/item?id=9224


The irony is that BrandonM's classic HN comment makes more sense these days, as Dropbox continues its evolution towards Microsofthood. Increasingly, using Dropbox means putting up with an unceasing deluge of product promotions while you're trying to get your work done. No such annoyances with rsync and ftp and the like.

Just last week, Dropbox unilaterally decided I didn't want local copies of the shared files on my laptop, which made for some awkwardness inside a secured facility with no Internet access.


Let them use rsync for you?


Rsync worst issue is someone port scanning and brute force their way into your system. Turn off your port.


Most people use rsync over ssh


Don't most consumer routers have all ports blocked? Who is connecting a computer directly to the modem these days?


Most of them do NAT, which isn't semantically the same as blocking all ports, but functionally it is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: