Asynchronous filesystem replication with FUSE and rsync

jewel · on Dec 11, 2014

FUSE is definitely the way to go with file synchronization, since the system will never miss a write, and can lazily load data for reads. It means that the entire filesystem doesn't have to fit on every machine that's synced. For a more sophisticated FUSE sync filesystem, be sure to check out orifs:

http://ori.scs.stanford.edu/

The best introduction to orifs is their paper, which is linked from the above site.

nona · on Dec 11, 2014

Very interesting. I always wanted a solution to asynchronously replicate 100's of GBs over long distances.

I've longingly looked at Intermezzo/Coda a long time ago but it never went anywhere; played with block level replication (lvmsync), but it doesn't allow concurrent use; in the end, the only solution I had to fall back on was rsync (which needs to iterate over the entire directory structure, crazy expensive) and git and/or unison (both of which can't cope with many GBs).

I'm going to give orifs a try.

mindslight · on Dec 11, 2014

Unison works great on my >2TB collection. Granted, most files are static and my update bandwidth is a tiny fraction of that, doable over DSL.

The only real issues I've found with unison are:

Works exclusively in terms of replica pairs. Still, I'd rather have a limited solid tool instead of a flaky perfect one.

Lack of inotify support (hacky half/embedded python stuff after the original author lost interest doesn't count).

3x disk access time for copying new files (hash source, transfer, hash dest). This could be reduced to 2x for similar consistency guarantees, but is really only an issue when creating new replicas, and can be fixed by copying the data first.

And for use as a 'filesystem', having to come up with file organization / wrapper scripts to choose what is replicated where, while still being sure what is a "full replica".

jewel · on Dec 11, 2014

Just in case you're not aware, here are some other options:

btsync (http://getsync.com) works well with at least 500GB of photo-sized files.

git-annex works well for large collections. I am using this for my own photo collection, which is about 450GB. I like it because you can do partial checkout.

I haven't tried it with a large collection but git-annex assistant should work well too, if you're interested in automatic sync.

icebraining · on Dec 11, 2014

Another git-annex user with a few hundred GBs. I have my laptop, VPS, nexus tablet and an S3 account being synced without any issues (with a different subset of files in each).

pmoriarty · on Dec 11, 2014

I wonder if this could be used as a replacement for Vagrant's rsync feature.[1]

From Vagrant's documentation:

  Vagrant can use rsync as a mechanism to sync a folder to the
  guest machine. This synced folder type is useful primarily in
  situations where other synced folder mechanisms are not
  available, such as when NFS or VirtualBox shared folders aren't
  available in the guest machine.

  The rsync synced folder does a one-time one-way sync from the
  machine running to the machine being started by Vagrant.

The disadvantage of the above is that it's a one-time, one-way sync. SFS would overcome this limitation, if I'm not mistaken.

[1] http://docs.vagrantup.com/v2/synced-folders/rsync.html

geerlingguy · on Dec 11, 2014

It's one-time right now (though there's an open ticket to make it bidirectional [1]), but you can use `vagrant rsync-auto` to watch for changes and continously sync. I posted an article a few weeks ago highlighting one of the reasons I use rsync rather than NFS shares with Vagrant [2].

Though I would love to see FUSE+rsync (like this article mentions) as a default/standard option in Vagrant!

[1] https://github.com/mitchellh/vagrant/issues/3062

[2] http://www.midwesternmac.com/blogs/jeff-geerling/nfs-rsync-a...

Lethalman · on Dec 11, 2014

Vagrant could run inotify+rsync periodically from both ends. It works for few files. SFS is when doing an rsync of a whole tree is not an option, in case of a big number of files.

0x0 · on Dec 11, 2014

Wouldn't the inotify API be a better way to detect file writes rather than writing a full FUSE wrapper?

Lethalman · on Dec 11, 2014

Yes, we've thought a lot about using inotify, our first prototype was also using inotify.

- Our system needs to cope with millions of directories. Millions of directories for inotify mean a lot of structures in the kernel. For large numbers it can also mean gigabytes of ram. Add to that the mapping in userspace of file descriptors for rename operations.

- Using inotify would take a lot of time at startup to recurse into millions of subdirectories.

- inotify will not automatically watch new directories. You have to list all the files right after the creation of a directory and watch the subdirectories, etc. Not a problem at all, but way simpler with FUSE.

- If your system is not able to keep up with inotify events, you miss events because the kernel cannot buffer all the events. For us it's better to slow down the system but never miss an event.

- inotify is attached to the original filesystem. That means it's hard (not impossible) to handle loops in an active/active replication setup. Whereas with SFS, replication is done on straight to the original filesystem.

- If the inotify application crashes, you lose events because software keeps writing to the filesystem. If SFS crashes the mountpoint is unwritable by the application which reports an error and can switch to a different storage.

Some choices as you can read depends on the requirements, our requirements where not met by inotify.

rwmj · on Dec 11, 2014

There's a new thing (fanotify). LWN did a short series on different filesystem notification methods in July:

https://lwn.net/Articles/604686/

https://lwn.net/Articles/605128/

Unfortunately the fanotify article is yet to come!

(Having said that, I certainly appreciate your pain trying to use inotify to track write events on an entire filesystem)

mopo2000 · on Dec 11, 2014

fanotify is super limited, and not useful for this app. It's from 2009, but has seen little use.

dezgeg · on Dec 11, 2014

AFAIK inotify isn't especially great for recursively watching a large directory.

0x0 · on Dec 11, 2014

Interesting, I haven't looked at the details and assumed it was similar to the osx FSEvents API. Maybe Linux has some catching up to do in this area?

vidarh · on Dec 11, 2014

We've done that as a workaround for old legacy web apps where nobody had originally thought about failover (and had spread data updates all over the filesystem, intermingled with code... It was not fun to untangle).It works, but we've always supplemented it with regular full filesystem rsync's.

Our cases have also had very limited semantics (uploads that adds new files or modify files, but no deletions ever) that made things easier.

This seems a lot cleaner...

Though these days we mostly use glusterfs and most of those old apps have thankfully been retired or cleaned up.

Lethalman · on Dec 11, 2014

Glusterfs is for synchronous replication. Glusterfs asynchronous replication, at the time of writing SFS, only supported master-slave.

vidarh · on Dec 11, 2014

Yes, the point with bringing up gluster is that synchronous replication is the best solution for our use case (we have relatively low amount of filesystem writes, but need the files to be consistently replicated near-instantly).

But for applications where async is a good fit, your solution does definitively seem much nicer than the inotify + rsync stuff we were doing (we ended up with that - temporary - solution after writing off lsyncd, though I don't remember why).

kyledrake · on Dec 11, 2014

This is neat. I like that it's using stable off-the-shelf unix components.

I'm putting together the new https://neocities.org fileserver stuff right now, so I'll definitely be looking into this.

The current plan is to use hourly rsyncs, and then implement this (or some flavor of it): http://code.google.com/p/lsyncd/

RE inotify vs FUSE: The former is event-driven from an API, the latter I believe uses lower level blocks. Which one is better here is entirely debatable. Gluster uses a similar approach this does I believe. I'm not an expert on unix file APIs, so take all of this with a grain of salt.

The biggest reason we can't use Gluster replication is that if you request a file when replicating, it goes to ask all the servers if the file is on them, instead of just failing because it's not on the local system. That's fine for many things, but it's an instant DDoS if someone runs siege on the server and just blasts you with random bunk 404 requests. You can't cache your way out of that one. Apparently the performance for instant request for lots of small files can be pretty slow too.

SSHFS (and rsync using SSH) blow S3 out of the water on performance for remote filesystem work. The difference is pretty insane.

vidarh · on Dec 11, 2014

Gluster does what you configure it to do, and unless you have a crazy setup it won't do what you think it will do.

For your type of setup, you'd combine the "distribute" and "replicate" translators. The "distribute" translator spreads files out over sub-volumes by hashed path. Unless you specifically enable "lookup-unhashed" it will not forward requests for files that are not found to any other sub-volumes than the one matching the hash.

The underlying volumes would then be "replicate" and you'd there decide how many replicas you want.

There are other cluster translators than distribute too, that provides more complex distribution methods, but there you do run into overheads with determining which subvolume holds the files.

You'd also usually layer the io-cache translator in (which you can add on both the clients and the servers, and which provides in memory caching), so while there are possible pathological cases, Gluster is pretty resilient.

Also Gluster will replicate directly between servers to heal, but read/writes using the fuse fs will have the client talk directly to each backend and write in parallel to each member of the replicate set for that file, so there normally is no separate replication unless a server is/was down or a disk has crashed: normally, once a write is reported as complete, the file is on all replicas.

But frankly almost regardless of file storage solution, your best bet is still to put Nginx or another suitable reverse proxy in front of the file servers, so you can e.g. cache the hottest requested files in faster/more expensive ways (memcached, or on SSDs)

pmoriarty · on Dec 11, 2014

"it's an instant DDoS if someone runs siege on the server and just blasts you with random bunk 404 requests"

Would you run your fileservers on a publically accessible network? If not, then the above attack would require the attackers have access to your private, internal network and at that point it could be argued that you've already lost.

Of course, it's good to design defense-in-depth, so that you could survive a hit on your internal network as well. But I'm not sure how much sleep I'd lose over the possibility of a DDOS attack against a private network.

On the other hand, if your servers are on a private network, I'm not sure why you'd use SFS over something like NFS.

emeraldd · on Dec 11, 2014

In this context a web-server is just a fancy interface to a file server. So you have a bunch of read-only "file servers" sitting on the public web talking to a private read/write back end.

andyidsinga · on Dec 11, 2014

for newly built applications, why not setup a ceph or swift cluster ..then use an s3 interface for access to files/objects? total an minimum replicas can be configured... so you will get something like eventual consistency if you use small min than total.

Lethalman · on Dec 11, 2014

That's true. I've tried ceph, swift and riakcs.

Our requirements were two servers, the storage is several terabytes, and losing data because of an unknown bug in the filesystem wasn't an option.

Such filesystems are quorum-based and certainly need more than two servers with terabytes of data. More than what we could afford. Also if you lose quorum, the system hangs to ensure atomicity. But we don't need such properties.

Also after some testing, those filesystem resulted about 5-10% slower than traditional filesystems, with few nodes of course. They are designed to scale with many nodes.

andyidsinga · on Dec 11, 2014

Thank you

Lethalman · on Dec 11, 2014

I'm an employee at Immobiliare.it and yesterday we've released for the first time some internal software to the public on github. Just wanting to share our work :)

jjviana · on Dec 11, 2014

I wonder how your system deals with write conflicts? The documentation is not clear about that...

icebraining · on Dec 11, 2014

Sounds clear to me:

Because of no locks, the last write wins according to rsync update semantics based on file timestamp + checksum. In case two files have the same mtime, rsync compares the checksum to decide which one wins.

https://github.com/immobiliare/sfs#consistency-model

hrez · on Dec 12, 2014

Since it's using FUSE, it would be nice if it had lazy initial syncing. I.e. on empty filesystem slave it would serve/sync accessed files even if they are not synced yet. That's at least until full sync is completed. Millions of files or GB's of data can take long time for initial sync. And lazy sync would allow to start serving clients with no delay.

Lethalman · on Dec 12, 2014

It's already like this, there's no initial sync.

minaguib · on Dec 11, 2014

This is very interesting indeed for the described use case. AFAIK the other alternative is lsyncd

Lethalman · on Dec 11, 2014

lsyncd uses inotify. Also, it does rsync (or other backends) of the whole tree instead of single files AFAIK, but that's of course not a stopper as it would need few changes to implement.

In addition, I don't think lsyncd has been designed for an active/active setup, rather for a master-slave setup.

gnur · on Dec 11, 2014

Lsyncd does not sync the entire tree every time. It depends on the situation, most of the time in our use case it starts a rsync for 5 specific files. Only on startup or when the system is overloaded it syncs the entire tree.

Lethalman · on Dec 11, 2014

Good, thanks for clarifying.

vtemian · on Dec 11, 2014

Did you check https://news.ycombinator.com/item?id=8735937?

Lethalman · on Dec 11, 2014

No I've just read about it on HN today. It's interesting. I guess for millions of files you finish inodes quickly if you don't cut the history periodically. But it's a good idea.

patrickg_zill · on Dec 11, 2014

I have been looking into this, and my current idea is tending towards using VMs running DragonflyBSD.

VM-1 - local NFS server that runs DBSD and Hammer filesystem with many nice features (auto snapshots, etc.) Will be fast, especially if the VM-1 is on the same physical host-node as my worker VMs.

VM-2 is remote, and receives the DBSD filesystems I send it. All snapshots etc. from the original FS are retained. If the connection is interrupted, the Hammer sync code will figure it out and restart from the latest successful transaction.

senorsmile · on Dec 16, 2014

Sounds really cool. Have you actually implemented this as a prototype yet?

illumen · on Dec 11, 2014

Would this work with sub folders being synced from different machines onto a single machine?

Even better, could it grab data as required from other machines?

Lethalman · on Dec 11, 2014

You mean only one way from N machines to a single machine? Yes. When other machines add/change any file it will be uploaded to that single machine.

fit2rule · on Dec 12, 2014

Poor name for a filesystem - there are already a couple of filesystems named "SFS" ..

anon4 · on Dec 11, 2014

If this could be ported to windows, mac, android and ios, I'd drop dropbox in a heartbeat.

emcrazyone · on Dec 11, 2014

I too am basically looking for a Windows port for RSync. I'm aware of a project that runs on top of Cygwin (DeltaCopy?) but it's not kept up with the latest Rsync protocol last I checked.

Any recommendations?

beagle3 · on Dec 12, 2014

rsync runs very well in cygwin (or without on its own - just copy rsync.exe, cygwin1.dll from a cygwin installation and you are good to go if you have some ssh; if you want rsync to use cygwin's ssh, you'll need to also copy ssh.exe and a handful of other .dll).

There doesn't seem to be anything else close to rsync's robustness or completeness of functionality - on Windows or Linux.