Hacker News new | past | comments | ask | show | jobs | submit login

If your PDF files have the same MD5 checksum but "diff" shows differences then this is an MD5 collision.

Maybe it's a trivial thing, eg. your 3 files got resynchronized right between running rsync and running diff. So you should have retried rsync after the diff.

Or you obtained these PDFs from a source that purports to demonstrate MD5 collisions. Or someone is attacking you by replacing your files. Or, more likely, this is user error, and you are not reporting to us what's happening exactly.

You can always do a diff on a hex dump of the PDF content and see with your own eyes what part of the PDF is actually different. It's not that hard to interpret the format and know which PDF structure changed. You can run "qpdf --qfd input.pdf" on both versions and this uncompress all structures to make the internal content human readable (besides images).




I tried and the difference between the two turns out all gibberish.

>Or, more likely, this is user error, and you are not reporting to us what's happening exactly.

Here's the sequence:

1. rsync -av src/ dest/

2. diff -r src/ dest/

[Shows 3 pairs of differences]

3.

   md5 file1a & file 1b; compare
   md5 file2a & file 2b; compare
   md5 file3a & file 3b; compare
[All three pairs match MD5]

4. run rsync - no difference still

5.compare difference by diff - shows gibberish. All three copies open by pdf viewers. The qpdf option doesn't make sense because all the 3 happen to be advanced math textbooks, and the plaintext is impossible to read.

In fact I just checked now, and the same error pattern persists. Its not something I am terribly concerned since I have 3 copies of my data - but this is a pattern which showed up for the first time. I do this exercise regularly (once per month)


Would you mind emailing me file1a and file1b assuming you are willing to share them? Send to: m (at) zorinaq.con

It's possible to have a scenario where your synchronization process (involving Google Cloud?) somehow changed the modification time of your files so that both copies in src/ and dst/ have the same timestamp, in that case rsync will not notice they are different (if they also have the same size). Like the other reply said you have to use rsync --checksum or -c to force rsync to compare the content of files.


As noted in a sibling comment, you should check out the --checkout / -c option. Specifically, replace:

1. rsync -acv src/ dest/

It will be much slower - using lots of cpu and disk access - but more thorough.


Typo: it's "--qdf".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: