Hacker News new | past | comments | ask | show | jobs | submit login
Rediscovering the Rsync Algorithm (incubaid.com)
185 points by nicolast on Feb 14, 2012 | hide | past | favorite | 43 comments



As cool as the rsync algorithm is, i'd much rather we had the dsync utility outlined in this usenix 08 paper. http://www.usenix.org/event/usenix08/tech/full_papers/pucha/...

An adaptive protocol that matches to the systems load dynamically whether its cpu/disk/network. Anyone know of what happened to this?


Apologies - downvoted by mistake!


There is a Better Way. Instead of using fixed sized blocks, use variable sized blocks. Decide the block boundaries using the data in the blocks themselves. This will reduce your search from O(n^2) to O(n).

Tarsnap does this. My project (ddar) does the same.


I've heard that called 'anchored hashing'. Its recently been used by FreeArc: http://encode.ru/threads/43-FreeArc?p=28237&viewfull=1#p...

How does ddar do it?


ddar calculates a 32-bit Rabin fingerprint over a sliding 48-byte window (a Rabin fingerprint is a type of checksum which allows quick calculation of a "slide" with just the previous fingerprint, the new byte coming in, and the old byte going away). The fingerprint is compared against a mask which has n bits set, where 2^n is the desired target block size. If there's a match, then the location of the window is taken to be a block boundary. With random data this should lead to a binomial distribution of block sizes centred around the target block size. But minimum and maximum block sizes are also enforced for pathological cases which skews this slightly.

Blocks whose boundaries are determined by this algorithm are hashed (SHA256 currently). The hash is used to key each block in a reference-counted key/value store.

Then each archive member is just a list of block hashes.


If you were to elaborate on and brush up the description, it would make an excellent HN submission. Good stuff anyways.


I wrote a library that does this. I haven't posted it publicly, because it's part of a much larger project (not yet complete), but could put it on github if anybody is interested.

There are implementations in Lua, Java, and C.


This would be really cool.


I'll ping you on twitter when I post it.


The rsync algorithm and program are both great, and I use the program a lot to update directory trees across the network. It's also my default tool for synchronizing two directories on the same system. The rsync program correctly optimizes for this case by skipping the rsync algorithm and completely copying changed files. However, it still uses multiple processes and seemingly still calculates some hashes, making it slower than it needs to be.

Joey found [0] that running rsync once in dry-run mode to find what files have been changed, copying them each with cp, then running rsync a second time to handle things like deletions and file permissions resulted in a major speedup.

[0] http://kitenet.net/~joey/blog/entry/local_rsync_accelerator/


Don’t walk the folder and ‘rsync’ each file you encounter

If I just tell rsync to syncronise between two directories, what does it do internally? I might have assumed that it does the more naive option, but in practice it seems to do a lot of upfront calculation that suggests it's doing something more sophisticated.


Rsync will compare timestamps, to only transfer changed files.


The default behavior also checks file size in addition to modification time.


Is there a purpose to that?


Modification time can be altered, or incorrect. File size is just another fast check that can be done with the same data return by stat'ing the file.

Rsync can also compare hashes if you don't trust the size and time on the files.


Sidenote but in case it's helpful to someone; if you need to have rsync.exe on Windows, here's one path:

https://github.com/thbar/rsync-windows


Do you know of any other implementations of the rsync algorithm other than the actual rsync program? And where are they used?

Do you know how and where dropbox uses rsync?

There have been some tries to port the rsync program to other languages/platforms [1], but they are usually not in sync with the current rsync program. I am talking about ports of the program, not new implementations of the algorithm.

[1] https://github.com/MatthewSteeples/rsync.net


There's librsync [0], although it's not compatible with the rsync program. Duplicity [1] uses it.

[0] http://librsync.sourceforge.net [1] http://duplicity.nongnu.org



rsync is great. I use the "-H" and "--link-dest" options to make incremental backups which look like snapshots. Been doing this for the better part of a decade; would be interested to know if there's A Better Solution(tm) out there...


rdiff-backup http://rdiff-backup.nongnu.org/ may be what you are looking for?


rsnapshot, which does all that for you with automated rotation, built on top of rsync.


cool


If someone committed any of that code to a repository I was working on, then I'd hang them up. It's 2011 and people are still using one and two letter variable names.

An interesting article, but I don't have time nor the inclination to understand the code, which is the core of it.


Of course, this is a wonderful example of how a literate style works. Despite the small variables names, the code is easy to understand because of the accompanying text. My (admittedly biased) leaning is that it is easier to understand than a similar snippet with "good variable names" without the accompanying explanation.

That is, if you aren't willing to read this article, then I have my doubts that you would have worked through any free standing code.


Yeah, but you should still have meaningful variable names.

Let's be honest here, very few people practice a good literate style.


Hmm... I definitely do not want to just toss out meaningful variable names. I'm torn, though, as I feel that if we had better practices in place to promote literate styles in general, we'd be in a better place. (That is, I think the return from literate styling is higher than the return from meaningful variable names.)

It really just comes down to adding a narrative to what you're doing. I think the README convention that github has helped push out there is doing more to get people using each other's codes than any amount of variable naming.


>I think the README convention that github has helped push out there is doing more to get people using each other's codes than any amount of variable naming.

Well… the README file is an ancient convention that probably stretches all the way back to the 70s.

IMHO, no, not at all. Github is helping people cos it lowered the cost of publishing code by 1) making it easy and 2) putting code at the forefront of the project.

In my ideal world, we would all use something like Docco - http://jashkenas.github.com/docco/ - but in my day to day I count myself lucky if I find the barest amount of class documentation, let alone a literate style.

So: people will do what is easy. Everyone should start by having meaningful variable names.


Yes, apologies, I meant to word that in a better way. I knew github did not invent the readme, it just seems they are sparking a revival of it. In that I mean you don't just get a bloody jar file (or whatever) and have to go looking for ways to use it. Take a look at any good shared project out there and there is just a text file outlining how to get going with it.

I think this is possibly my being jaded with Java's documentation conventions. It can be great for index level documentation, but is absolutely terrible at showing use cases and why things were done the way they are.


Has anybody looked through the code of the rsync program itself? Looks pretty intimidating. Would be nice to see a comment from somebody better versed in this style of programming.

http://ftp.samba.org/pub/rsync/


Many people feel a blend of slight obscurity with succinctness are the core ingredients for an elegant solution. As a math major at university, I find this extremely prevalent in the textbooks of proof courses. I generally find it irritating, and to an extent, patronizing.


> Many people feel a blend of slight obscurity with succinctness are the core ingredients for an elegant solution.

My exemplar for mathematical elegance is page 3 of Serre's A Course in Arithmetic. In a tidy page of text, it proves what takes many textbooks a long and tedious chapter, introducing important concepts like the Frobenius. It's so short and neat I can reproduce it here in full:

Let K be a field. The image of Z in K is an integral domain, hence isomorphic to Z or to Z/pZ, where p is a prime; its field of fractions is isomorphic to Q or to Z/pZ = F_p. In the first case, one says that K is of characteristic zero; in the second case, that K is of characteristic p.

The characteristic of K is denoted by char(K). If char(K) = p != 0, p is also the smallest integer n > 0 such that n 1 = 0.

Lemma. If char(K) = p, the map sigma : x -> x^p is an isomorphism of K onto one of its subfields K^p.

We have sigma(xy) = sigma(x) sigma(y). Moreover, the binomial coefficient (p choose k) is congruent to 0 (mod p) if 0 < k < p. From this it follows that sigma(x + y) = sigma(x) + sigma(y); hence sigma is a homomorphism. Furthermore, sigma is clearly injective.

Theorem 1. (i) The characteristic of a finite field K is a prime number p != 0; if f = [K:F_p], the number of elements of K is q = p^f. (ii) Let p be a prime number and let q = p^f (f >= 1) be a power of p. Let Omega be an algebraically closed field of characteristic p. There exists a unique subfield F_q of Omega which has q elements. It is the set of roots of the polynomial X^q - X. (iii) All finite fields with q = p^f elements are isomorphic to F_q.

If K is finite, it does not contain the field Q. Hence its characteristic is a prime number p. If f is the degree of the extension K/F_p, it is clear that Card(K) = p^f, and (i) follows.

On the other hand, if Omega is algebraically closed of characteristic p, the above lemma shows that the map x -> x^q (where q = p^f, f >= 1) is an automorphism of Omega; indeed, this map is the f-th iterate of the automorphism sigma : x -> x^p (note that sigma is surjective since Omega is algebraically closed). Therefore, the elements in Omega invariant under x -> x^q form a subfield F_q of Omega. The derivative of the polynomial X^q - X is q X^(q-1) - 1 = p p^(f-1) X^(q-1) - 1 = -1, and is not zero. This implies (since Omega is algebraically closed) that X^q - X has q distinct roots, hence Card(F_q) = q. Conversely, if K is a subfield of Omega with q elements, the multiplicative group K* of nonzero elements in K has q-1 elements. Then x^(q-1) = 1 if x in K* and x^q = x if x in K. This proves that K is contained in F_q. Since Card(K) = Card(F_q) we have K = F_q which completes the proof of (ii).

Assertion (iii) follows from (ii) and from the fact that all fields with p^f elements can be embedded in Omega since Omega is algebraically closed.

> proof courses

I can't think of a single course I took, starting with the first semester, that wasn't a "proof course".


> I can't think of a single course I took, starting with the first semester, that wasn't a "proof course".

ODE, Cal 1-3 and their labs, Multivariate, Matrix methods. Congrats to you for being able to skip 8 courses, not all of us are that talented.


When I studied the equivalents of calc 1 and 2 (from a textbook), the textbook proved everything, although somewhat informally. I mean, with epsilons and deltas and everything, but with a lot more prose than you see in a math paper on arXiv. The different textbook I later used for calc 3, which I actually did take a class in, also proved everything. As did the professor, in class, on the blackboard. I basically never read the book or did any of the homework for that class; I just rederived things from first principles during the exams, based on my memories of the lectures. I always finished the exams last, but I got an A in the class. This was all in the US. My father's calculus textbook, from which I'd learned calculus to start with, was also from his calculus courses in the US.

I take it my experience was atypical?


No in many US universities rigorous calculus proofs are saved for a class called something like Real Analysis which is typically take as the 4th or 5th class for a math major.

Compare e.g.

Transcendental functions, techniques and applications of integration, indeterminate forms, improper integrals, infinite series.

with

Algebraic and topological structure of the real number system; rigorous development of one-variable calculus including continuous, differentiable, and Riemann integrable functions and the Fundamental Theorem of Calculus; uniform convergence of a sequence of functions; contributions of Newton, Leibniz, Cauchy, Riemann, and Weierstrass.


Professors at my university do prove things in the classes I mentioned above, but I was attempting to make a distinction between courses where the emphasis is the proof(Abstract/Contemporary algebra) and courses where the emphasis is the process (Matrix methods). Sorry for any confusion.


That must be a US thing. It seems strange to first do a course without proofs and then later redo it with proofs. My point was that where I studied, the courses for mathematicians were all "proof courses"; there were separate courses for engineers and other scientists who are happy to accept mathematics on faith.


I'm not arguing which education system is better, but having a degree of familiarity in the subject matter you plan on proving things in helps out a lot. Proving things about continuity/discontinuity would be a bit difficult if you've never had to identify continuous functions with calculus.


Well, you're never going to deeply understand everything in a subject in only one or even several passes over the material, but there's no reason you can't concurrently treat analysis both formally and informally, with lots of concrete examples and intuition to go along with abstract definitions and proofs.


I agree. Although I enjoyed and got some value from the article once I puzzled through the naming, it would not have killed the author to use more descriptive names, and it would have made the code much more beautiful, at least to me. As it is, it seems needlessly opaque. Clean code communicates intent, and this code made that intent more difficult to puzzle out than it needed to be.

I'm not saying I don't ever use one and two-letter variables names; they're useful where their meaning is unambiguous, as with small inner loops. But when your goal is to educate others, better to err on the side of verbosity. That to me is more literate than the alternative.


My philosophy is that the length of a variable or function name should be inversely proportional to the size of the scope it affects. This means that if a name is only local to the current procedure then it hardly matters what you call it.

I have no problem with his use of a procedure with the signature "let rec loop acc s .... in" when it is fundamentally a placeholder recursive control structure. Such a signature is as much a common boilerplate pattern in OCaml as "for (i = 0; i < length; i++) { ... }" in a C-like language.

That said, his top level procedures are named things like "apply" and "updates" which are possibly too generic, but OCaml has a nice module system so the names need only convey meaning unambiguous with relation to each other; there is no global namespace to pollute.

This is a common problem of clean functional languages: their declarations can be unambiguously correct with less specificacy. Function declarations need not specify subject and object of the sentences they represent since these can be inferred, they must only specify the verb. But this means skimming becomes less fruitful: one must read everything to understand anything.


I wonder what the recommended naming of loop variables is in 2011. Any guidelines?


Robert C Martin (Uncle Bob)'s Clean Code is a great read and discusses good naming in exhaustive and occasionally alarming depth.

http://www.amazon.com/Clean-Code-Handbook-Software-Craftsman...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: