- Unison is written in OCaml, which is (probably) a perfectly fine language but not commonly used
- Unison synchronizes files to files, but for a Dropbox-like system you really want deduplication for space savings (i.e. server-side storage is a bunch of pointers to content-addressed blocks.)
- in general, it's not clear to me that the client is really the hard part of Dropbox. Note e.g. the part where Dropbox now runs its own data centers for cost reasons.
- there's a million fiddly things to get right, and Unison hasn't had that much usage
> there's a million fiddly things to get right, and Unison hasn't had that much usage
Unison is the only bidirectional sync tool that I trust to get the details right. It is backed by a formal model with various proofs of correctness. Such models are also easily representated in OCaml; which I can assure you is more than a fine language, especially if one cares about correctness. Dropbox has struggled to get these details correct in the past (see the paper "Mysteries of DropBox", by Unison's author Professor Benjamin Pierce). TLDR, Pierce teaches these Python hackers how to fix their broken code.
>Unison is the only bidirectional sync tool that I trust to get the details right. It is backed by a formal model with various proofs of correctness.
That's just about the sync process/stages, the easy part that can actually be formalized.
The "million fiddly things" are about OS and filesystem issues, incompatibilities, and so on, and Dropbox has a hugely larger test base for those things...
I disagree. POSIX, although somewhat dated, has provided a good enough abstraction layer for filesystems and OSs. My proof of this is the number of different and successful filesystems for Unix/Linux. If the abstraction didn't work, everyone would be forced to use the same filesystem.
The issues and subtleties are with bidirectional sync. It is not "the easy part". Dropbox didn't get it right in the past, we have no proof they have it right now.
>POSIX, although somewhat dated, has provided a good enough abstraction layer for filesystems and OSs.
First of all, POSIX semantics are not what Windows support. Second, even where available, POSIX is a tiny part of the possible issues. Adequate for naive apps that need to open or write some files, not for a reliable sync tool.
The issues discussed in that link, such as concurrency, lack of atomicity and failure are all factors that make bi-directional sync a hard problem. Yes POSIX provides few guarantees and so that burden rests on the application code above. The formal model must consider the race conditions and failure modes, it is not "the easy part". I do acknowledge there is a modelling gap to overcome and some problems with POSIX, but I disagree that its the hardest part of the problem.
That's really the job of the filesystem to protect against corruption using checksums. ZFS will do this and hopefully more will follow. I suppose a file sync tool could detect a change of contents with no change of mtime, but that would be expensive.
> - Unison synchronizes files to files, but for a Dropbox-like system you really want deduplication for space savings (i.e. server-side storage is a bunch of pointers to content-addressed blocks.)
> - in general, it's not clear to me that the client is really the hard part of Dropbox. Note e.g. the part where Dropbox now runs its own data centers for cost reasons.
The whole point of Unison is that you do not need a server. And certainly not a server run by a for-profit corporation in the United States of Surveillance.
Erm, with pairwise syncing you still want a "server". I've used unison as my primarily file syncing for a decade, and trying to maintain a spanning tree among partially-available nodes/disks is a pain in the ass. And creating loops means you can't rely on file deletions.
I agree it doesn't have to be a corporate-owned, or even corporate-snoopable "cloud" server, and if this is your goal then unison will work well. Also you can trivially solve the dedup using zfs, but I don't think dedup is a killer feature for personal storage.
(Better would be a "manual dedup" utility to catch those instances where you copied off the same thing multiple times over the years. I'd appreciate any recommendations here, although writing one seems trivial when I get around to it - calculate recursive sha256sums for every node in the filesystem tree, and look for the biggest matching ones)
> Erm, with pairwise syncing you still want a "server". I've used unison as my primarily file syncing for a decade, and trying to maintain a spanning tree among partially-available nodes/disks is a pain in the ass. And creating loops means you can't rely on file deletions.
You do not need a spanning tree or loops, you just need to do a topological sort on the nodes/disks and then consistently sync them in that order. There is no magic algorithm that will handle conflict resolution for arbitrary syncing.
The problem I'm referring to isn't conflict resolution, but rather spurious re-creation of files that have been deleted. Unison's pairwise syncing cannot distinguish a file that has been deleted from one that has yet to be created (whereas say vector clocks do).
Create file F on A. Sync A-B, A-C. Delete F on A. Sync A-B. ?????. Sync B-C. Sync A-B. File F now re-exists on A (and B and C).
??? is some event where you cannot sync A-C to directly save A's changes on C, yet you still want to save any changes from B on C. Say a remote machine crashes and becomes unavailable, you didn't have the time over 3G, or some other type of ad-hoc craziness which is why you're using distributed syncing in the first place.
Which implies you need to choose one node from {A,B,C} that is the most likely to be available to sync the other two to. That node can also provide connectivity to a larger network, which generalizes to a spanning tree.
IMO a topological sort would be an even stricter requirement than spanning tree, in that if one node becomes available, you can't sync anything "below" it! Also what does syncing disks "in order" mean when changes can happen at any time? (eg I use unison for maildir).
> The problem I'm referring to isn't conflict resolution, but rather spurious re-creation of files that have been deleted.
That is conflict resolution, over the set of files. Which is why Unison provides the "reconciling changes" UI.
> Which implies you need to choose one node from {A,B,C} that is the most likely to be available to sync the other two to.
Yes, that is a way to avoid update conflicts.
> IMO a topological sort would be an even stricter requirement than spanning tree, in that if one node becomes available, you can't sync anything "below" it!
Yes, it is also a way to avoid update conflicts.
> (eg I use unison for maildir).
Why? IMAP already takes care of synchronization, without any potential problems.
Eh, we can disagree on a precise definition of abstract "conflict resolution", but I'm referring to unison's behavior with -batch. There's no logical conflict that needs to be resolved by user choice or algorithm. When looking at the larger system of all the replicas, unison's behavior is to simply undelete a file that was explicitly deleted.
> Why? IMAP already takes care of synchronization, without any potential problems.
Less configuration, less attack surface. Sharing mail over NFS is a common thing, and I view unison as a type of distributed filesystem.
Thanks! I guess formulating that question, I was halfway to apt-cache search. I had envisioned a tool that would find an entire equivalent dir tree (two unpacked source tarballs -> one match), and at first glance I don't know if fdupes will do that. But it might make sense to use an off the shelf tool with a little more manual labor.
Unison has support for that on some platforms. On Linux, the option "-repeat watch" uses inotify to watch for changes and rerun unison each time there's a change. You can combine that with -auto and some of the options specifying policies for resolving conflicts (e.g. -prefer newer) to get a setup that works without user interaction.
I remember being surprised when I found references to Unison in the Dropbox Linux client. This was in the early days, I'm quite sure they rewrote it since. It would be cool to get the full story from a Dropbox employee.