The key feature of Time Machine is hard links to directories -- which is only possible on modern HFS+ (and rsync doesn't even try). Some people like the UI too, of course.
Without hard linked directories, a full --link-dest backup of a decent sized disk, with zero file changes from the previous pass, can easily consume 100MB (and take 45 minutes to perform).
This disk consumption might seem insignificant, today, but that's 2.4GB per day if you run a standard Time Machine equivalent backup schedule. Of course you might not choose to do that, because the previous hour's backup would only finish 15 mins before the next one started, which is insane.
These numbers are from direct experience on a 2TB, approximately 60% utilized source drive.
That said, I use rsync, not Time Machine, for my OSX backups. You'll want a few additional switches for HFS+, and if your target drive is HFS+ also, make sure you turn OFF "ignore ownership on this volume" in the Finder... but the script posted here has the right general idea. Somewhere on my project list is adding directory hardlinking to rsync.
the hard linking of directories is a hack not really a feature. There's a good reason why filesystems ( including HFS+ when not being used for Time Machine) do not do it.
It's a hack, but it's also the best (only?) way to deduplicate all that metadata without breaking the ability to access the history using pre-existing filesystem operations.
Volume snapshots are a better way. The only filesystems I know of that support them are ZFS, BtrFS, and HAMMER. I'm not sure if snapshots are implemented using hard links or if they are more fundamental.
Another way, which is worse, is block-level deduplication. All of the above filesystems support it, as does NTFS.
I wish Apple would adopt HAMMER for Mac OS. It is BSD-licensed and more suitable for a memory-constrained environment than ZFS.
I'm not aware of any volume snapshot system that presents the history as a plain directory tree. At best, you can use some special mount options to mount a snapshot, but you generally have to use a special tool particular to that filesystem in order to access a snapshot, and creating the snapshot always requires such a tool. There's no standard Unix way to create or access volume snapshots. Time Machine histories can be created, accessed, and analyzed entirely with standard tools except for the modified `ln`.
Btrfs snapshots are implemented using COW-shared trees (multiple trees for data and metadata). They can't be implemented with hardlinks, that would be a very leaky abstraction.
Why would deduplicating the metadata be important for this use case? Static amount of available inodes on some filesystems is the only reason I can think of.
Because of the aforementioned problem where it takes 100MB and the better part of an hour just to record a new snapshot when literally nothing has changed since the previous snapshot. Deduplicating gets rid of that 100MB overhead, and being able to explicitly do it with hard links instead of having to rely on the filesystem to discover the duplication on its own takes care of most of the running time overhead.
The problem with soft-links is that if the underlying file/directory is deleted, you are screwed. For example if you have 100 full-machine backups and want to free some space so you decide to delete every other one, you have to be careful that none of the backups you are keeping, have soft-links to files in the backups you are deleting.
With hard-links the underlying data is not deleted until all hard-links are deleted, so you can delete any individual backup directory without losing data in any other backup directory.
A soft-link is like a pointer in C whereas a hard-link is like a C++ shared_ptr, ie. reference counted.
When will this trend of "like time machine" backup software going to stop ?
Time Machine, as can be seen here http://www.apple.com/support/timemachine/ time machine is tightly integrated in the os and provides a self-defining interface and user experience.
This github page is for a wrapper shell script around rsync, which is not like time machine.
I love/hate Time Machine and would love to switch to something better. The problem is that to be secure, you need both versioned backups and redundant storage of the backup. Time Machine has trouble with network volumes (this doesn't stop me from using my DroboFS but I have to struggle with it regularly). It also doesn't have a cloud backend. Best option to have both is Crashplan, but the UI sucks. So if there's anyone else like me, we're regularly in the market for considering options that offers all of redundancy, version history, and a solid UI.
Time Machine is a great "enable and forget about it" solution, however it has some limitations too. For example, it can only backup to a drive, not to a folder within the drive. Also, in my case, I wanted to backup the Users folder of my Windows Bootcamp partition but it cannot be done. Excluding files during the backup is also not possible. The nice thing about a small bash script to handle all this is that it can be easily customized to your needs.
Oh, and integration with the OS X recovery partition / OS reinstallation mechanism that allows you to point to a Time Machine backup as the recovery point for your re-installation.
The integration with OS reinstall works very well, and is pretty seamless from the user's perspective. I used to do two backups-- a local TM backup and a separate cloud backup, but I found it was actually easier to just use Automator to up mount my TM volume once a day, image it, and send that up to the cloud. When my TM volume failed last year, I just pulled the latest image and put it on a replacement drive, and I was back up and running.
TM has had a few problems, but by and large it is one of the quiet successes in OS X, and probably my favorite feature if the OS. Why Microsoft hasn't put something like it in Windows is baffling to me.
I found that just running an Automator script for the "new disk image from selection" command on the root of the drive worked perfectly-- set an iCal event to run that script once a day, and you're done.
Be sure to test this to make sure it restores, but in my case it works flawlessly.
rdiff-backup uses reverse diffs for its incremental backups, not hard links. This has the nice advantage of when large files change, the backup size changes only by the diff, not a whole new version of the file.
This is really cool. For me, most of my data is "in the cloud" now. Important and frequently-accessed project files are in Dropbox, code is on Github, and most of my music is either streamed from the web in Spotify or my actual library is stored/streamed via iTunes Match. Because of this, I actually don't have a considerable amount of data to keep backed-up, and a lightweight non-time-machine solution like this looks perfect. I tried rsync before but never got into a solid routine.
I've become obsessed with the "12factor" app approach everywhere in my digital life, so that if a device ever disappeared, was stolen, or died, I could get a replacement fully operational without any problems. Like an app-server dying, just launch a new one and it will bootstrap itself.
I'm kinda crazy with my new "homelab" and started it off with an old 2U Poweredge I got on eBay for about $200. It's cheaper than a Synology/Drobo, has room for 6 drives, and the dual quad-core processors + 16GB of RAM is pretty cool too. It's running FreeNAS right now in a VM with a 3TB ZFS pool. I've created an AFP share that appears to my mac as Time Machine and over my gigabit-network it does a pretty fast backup. I really like the FreeNAS software. It's open-source, runs on FreeBSD, and the UI/admin tool is built in Django.
Agreed. I've used rsnapshot for years and am not sure how this script really differs (better or worse). I can say that rsnapshot is a real workhorse that gives me a lot of peace of mind.
I've used dirvish for several years (nearly a decade), both locally and remotely without issue. No configuration is necessary on the client, only on the backup server (alternatively you can flip it and put all of your configuration on the client). Uses rsync, hard links, and can keep snapshots at various intervals.
Seconded. I set up `rsnapshot` some years ago to backup monthly, weekly and daily. It's been running ever since without interruption, and it's saved my butt on more than one occasion.
I have written a similar script that I use for many years now. These sort of backups are very convenient.
I started too with hard links but nowadays I prefer to format my backup disk with BTRFS and use btrfs snapshots instead, though I still support hard links.
I prefer btrfs snapshots due to their support for COW, so if I decide to play with a backup I won't mess all the other versions of this backup. With hard links you should never write to your existing backups.
Lately I added a helper script to mount remote filesystems and lock mysql databases but it could be easier to use.
Most of my code is checks to make sure I won't write somewhere I shouldn't to.
As quesera noted below, on a not-so-big modern disk with 500,000 files, the metadata can easily be in the 50-100MB range, which adds up to >1GB for metadata (even when nothing has changed) if you back up every hour.
You should, however, consider bup (https://github.com/bup/bup) - it takes less than a minute to figure out nothing is done, it deduplicates parts of files, (that is, if you have a 20GB virtual machine image, and you've changed one byte in the middle of it, then the next snapshot is going to take ~10KB, not 20GB). The older release don't keep ownership/modification time, but there's a new version pending release soon that does.
It also works well remotely (through ssh), can do an integrity check (bup fsck), redundancy (using par2; important after deduplication). And it has a fuse frontend that makes it all accessible as a file system, as well as an ftp frontend.
Missing some options for extended attributes (SELinux, ACLs) and (on OSX) resource forks.
Also, if you're doing multiple backups of the same data to a filesystem over time, it's worth doing 'cp -al' from the previous backup to the current backup destination, then rsync over the top of that - that way multiple copies of files which haven't changed don't take up any extra space.
> Also, if you're doing multiple backups of the same data to a filesystem over time, it's worth doing 'cp -al' from the previous backup to the current backup destination, then rsync over the top of that - that way multiple copies of files which haven't changed don't take up any extra space
Isn't that already taken care of by its use of rsync's --link-dst option?
You don't want to hard-link backups to the non-backup copy -- the non-backup files might change, and if it is a change that doesn't remove the file first, it will modify the backups if the backups are hard linked.
Doesn't vim intentionally do this if it edits files that are hard linked? emacs breaks the hard link. I'm not saying either behavior is best, but in this case one might be surprising!
Link-backup does this as well, plus it knows how to build hard links to old backups even when directory structure or filenames change, effectively de-dup support. It does this by building a content addressable index on the destination filesystem that backup trees hard-link against.
ZFS solves the hard-link problem spectacularly well (and adds a whole bunch of data-integrity verification on top of that) with snapshots.
What I don't like about that solution is it can't dedupe (ZFS dedupe just ultimately doesn't work very well). Hence my interest (and if anyone checks my post history, my constant spruiking of) bup - which does efficient dedupe and output of git pack-files. Stick that on a ZFS volume with snapshots, and you've got block-level checksummed, versioned and deduplicated backups.
What it's all missing of course, is a pleasant interface to use it with (one which doesn't fallback to the thing I see way too often in a lot of these scripts "don't worry, we're just going to stat your entire filesystem every 20 minutes).
Without hard linked directories, a full --link-dest backup of a decent sized disk, with zero file changes from the previous pass, can easily consume 100MB (and take 45 minutes to perform).
This disk consumption might seem insignificant, today, but that's 2.4GB per day if you run a standard Time Machine equivalent backup schedule. Of course you might not choose to do that, because the previous hour's backup would only finish 15 mins before the next one started, which is insane.
These numbers are from direct experience on a 2TB, approximately 60% utilized source drive.
That said, I use rsync, not Time Machine, for my OSX backups. You'll want a few additional switches for HFS+, and if your target drive is HFS+ also, make sure you turn OFF "ignore ownership on this volume" in the Finder... but the script posted here has the right general idea. Somewhere on my project list is adding directory hardlinking to rsync.