Bup: Efficient file backup system based on the git packfile format

tobias2014 · on Feb 15, 2014

Another backup possibility I currently use: ZFS on a backup server (not necessarily ZFS on the system that should be backed up), pull data with rsync on the backup host to a ZFS, after that make a snapshot for an "incremental backup".

So simplified it's like: rsync -avx remote:/etc /backup/ && zfs snapshot backup@`date`

With zfSnap (https://github.com/graudeejs/zfSnap) you can tell how long incremental backups/snapshots are kept, "rsync && zfSnap -d -a 1w backup"

You can take advantage of the /backup/.zfs/snapshot directory to access all snapshots, built-in compression and possible data deduplication.

If you also have ZFS on the remote host, you can use zfs send and zfs receive to transfer the snapshot directly to the backup server, instead of using rsync for the diff.

jedbrown · on Feb 16, 2014

I do the same with btrfs. I actually tried bup first, but had a series of problems (see mailing list) and switched to btrfs snapshots. My main disk is also btrfs so I send incremental snapshots ("btrfs send -p"; faster than rsync and I can keep using the machine without making the backed-up state inconsistent), but the rsync method is fine for other source file systems.

tekacs · on Feb 15, 2014

rsync.net is an example of a host that does something like this (with daily snapshots)

    $> ssh rsyncnet ls .zfs/snapshot
    daily_2014-02-09
    daily_2014-02-10
    daily_2014-02-11
    daily_2014-02-12
    daily_2014-02-13
    daily_2014-02-14
    daily_2014-02-15

They allow you to customise the length of time for which these snapshots are kept, too (IIRC) (at the cost to you of the incremental extra storage)

rsync · on Feb 16, 2014

Just a clarification ...

All accounts get 7 days (as the parent shows) and all 1TB+ accounts get 7 days + 4 weeks.

However, if you want a custom schedule (say, 30 days, 8 weeks, and 6 months) it only costs more if the space they use exceeds your existing account.

So, if you have a 100 GB account and you use 60 GB and your fancy snapshot schedule (of which the first 7 days is always free) only uses 35 GB ... then there is no additional cost at all.

res0nat0r · on Feb 16, 2014

rsync.net looks awesome, but it is just way too expensive for me.

rsync · on Feb 16, 2014

You should email us and ask about the "new customer HN reader discount".

Also, note that 1TB accounts are 15c per GB, per month, and 10TB accounts are 7c per GB, per month - all with no traffic or usage costs.

This compares very favorably with S3 and blows the mozy pro pricing out of the water[1].

[1] https://mozy.com/product/mozy/business

cbhl · on Feb 16, 2014

Have you asked for the HN discount?

https://news.ycombinator.com/item?id=6554313

(I happen to get a discount for being a prgmr.com user, but IIRC the terms are similar.)

developuh · on Feb 16, 2014

Where can I learn more about using rsync and ZFS ?

gcr · on Feb 15, 2014

Bup is lovely. I used it to back up my huge home folder and only switched away to rdiff-backup because (at the time) there was no support for deleting old revisions.

Is there any support for that? (Of course, for a large enough hard drive, it's not much of a problem...)

apenwarr · on Feb 15, 2014

People have been actively working on a "prune" feature but it seems to never quite get finished. This would indeed be nice to have, although it's less important than you might think, given really good deduplication (which bup has). Currently bup has a very simple model - never delete anything - which is hard to screw up, so you're very unlikely to lose data.

bambambazooka · on Feb 17, 2014

I really have to finish that feature...

rlpb · on Feb 15, 2014

I wrote ddar, which is basically this but solves that particular problem, by using something other than the git packfile format.

http://www.synctus.com/ddar and http://github.com/basak/ddar

It's recently been made available on Homebrew, too.

aristidb · on Feb 16, 2014

Nice. Could you add some description there how the deduplication works? I assume it uses rolling checksums to create chunk boundaries, just like bup?

rlpb · on Feb 16, 2014

> I assume it uses rolling checksums to create chunk boundaries, just like bup?

That's right.

rakoo · on Feb 15, 2014

I don't see a link to the source code on your page, is it just me ?

rlpb · on Feb 16, 2014

See my github link for the latest source. The other page predates me uploading to github, so there's only a tarball source download from there. I should move everything over to github now, really.

raphinou · on Feb 16, 2014

How actively is it developed? How reliable is it currently?

rlpb · on Feb 16, 2014

> How actively is it developed?

It's a simple tool that does a simple job. It's pretty much done.

> How reliable is it currently?

No known bugs.

jitl · on Feb 15, 2014

Looking at the README under "Stuff that is stupid":

  > bup currently has no way to prune old backups

Thanks for the rdiff-backup shout-out. I'm looking for a nice way to do system backups to my NAS of large VM images without having to install Crashplan. Bup and rdiff-backup both look pretty good.

gcr · on Feb 16, 2014

Obnam is also great: http://liw.fi/obnam/

This does deduplication, but can also encrypt the deduplicated blocks and store them on (say) S3.

luxpir · on Feb 17, 2014

Obnam can also push/pull backups over SSH/SFTP. Seconded.

bambambazooka · on Feb 17, 2014

I've been working on that feature for some time. I'll try to finish it for the next release! If you want to test my existing code contact me.

natch · on Feb 15, 2014

How do people who would use this kind of thing manage to have remote servers with terabytes of available disk space on them?

Anything is possible with money, of course, but how is this anything other than really expensive?

For example AWS S3 would be $235/month (that's $2,820/year!) for 3TB not even including any data-out transfer charges. Sure there are others that are cheaper but only marginally so.

Is this really what people are doing? Makes the commercial services sound really cheap.

apenwarr · on Feb 15, 2014

My suggestion is to backup your cloud servers, which are expensive and redundant and have good uplink speeds, to home servers which are cheap and have good downlink speeds. You don't need your backup file server to be ultra-reliable or even up all the time, so the cheapest possible PC sitting on a home internet connection is a pretty good choice. That way, 3TB is just $150 or so plus your electricity, and it's not a per-month fee.

_euvw · on Feb 16, 2014

I'm curious as to how you would restore that much data quickly, from a home-uplink to the cloud?

cbhl · on Feb 16, 2014

You would send the hard drive via snail-mail to Amazon.

A mass restoration is expected to be rare, so it's okay for it to be a bit more expensive.

nisa · on Feb 15, 2014

If you ignore the cloud services and rent dedicated servers you can get up to 6 TB disk space for 50$ a month.

edit: 45tb for 300€/month: https://www.hetzner.de/en/hosting/produkte_rootserver/xs29

drdaeman · on Feb 16, 2014

Unless you don't need any redundancy (cloud has some) that's more likely about 39TB (in RAID50, if controller supports such setup).

If so, that's still less than €8/TB/mo, which is better than most cloud storage providers offers. You also have some spare memory and CPU resources (so you could resell them for others as, for example, memcached instances) and a possibility to get a proper SLA, as a bonus.

_delirium · on Feb 15, 2014

My strategy is:

1. Regularly back up "important" directories (code/, papers/, web/, etc.) to fairly safe/redundant cloud storage with incremental history. I have pretty little of this, <50gb, so it's not super-expensive.

2. Occasionally exchange bulk but less-important backups with my brother, so we're each the other's high-latency, questionable-durability "off-site backup". No incremental dumps here, just rsync. This is where my MP3 collection, DVD rips, and similar goes.

3. Photos, which are important but also bulk, go to Flickr, which is free.

4. Don't back up stuff I can re-acquire, e.g. big public datasets I've downloaded to work on, or Debian ISOs. Also, I don't back up the OS, just my data.

There do, however, seem to be some cloud services that offer big full-disk backups for a surprisingly low flat price, e.g. http://www.backblaze.com/ is $5/mo/machine.

t0mas88 · on Feb 15, 2014

We use a dedicated server. It's a fairly basic machine with a single Xeon CPU and 32GB ram and a lot of drive-bays, which I think doesn't cost much. It has at the moment 16 x 3TB drives in it in RAID 50. RAID 50 is not super on performance, but fast enough to saturate gigabit in sequential operations (which backups are). So it has 32 TB of useful storage for a price of around € 700 per month (leased server at a high-end hosting company, so could be much cheaper if you buy it yourself or use a cheaper provider). Per TB that's around € 21.5 per month. Although our reasons for doing this were not based on the prices of storage in the cloud, it was based on having the data on our own machine with disk encryption and only a connection to our internal management network and not the public internet.

kbar13 · on Feb 16, 2014

You shouldn't think about backups as paying for $/GB. You should think of it has paying for the ability to restore correctly instantly when you want to restore. That's where the value really lies.

nathancahill · on Feb 16, 2014

AWS is never the cheapest option. AWS is great when you're quickly scaling up and down, but the flexibility comes at a (high) cost.

For backups, you're dealing with a relatively consistent or predictable amount of data. Buy the appropriate dedicated server for your needs.

acdha · on Feb 16, 2014

Remember that there's a huge difference between files you need to read and write at any time with low latency and backups which happen at larger intervals and read infrequently, with looser latency demands. For backups, the service to compare is Glacier, which is about an order of magnitude cheaper.

What I'd consider is essentially the Crashplan model: P2P / external backups locally (i.e. full LAN speed) and an off-site replica which can be cheaper and slower as long as you have a high confidence that it'll be available eventually. This way normal operations are fast but if the building burns down you're covered and presumably have higher priorities than waiting for a restore to run.

sitkack · on Feb 15, 2014

AWS is for startups with venture capital, ephemeral storage and compute, or someone needing to deploy a high traffic website instantly.

atso · on Feb 16, 2014

I hear about rdiff-backup, but I think its two main drawbacks are:

* on the webpage, there is no new release since 2009.

* has no de-duplication.

I was considering moving my 5+ years old rdiff-backup system to any of those new, promising programs:

* obnam [http://liw.fi/obnam/]

* attic [https://pythonhosted.org/Attic/]

They both do automatic de-duplication, old backup deletion and remote encryption.

limmeau · on Feb 16, 2014

The original rdiff-backup author went on to create duplicity. Maybe a big part of the rdiff-backup community has followed him?

I'm using obnam now.

sciurus · on Feb 16, 2014

For a link, http://duplicity.nongnu.org/

I backup locally using duplicity, then I ship those files off to Amazon Glacier using mt-aws.

https://github.com/vsespb/mt-aws-glacier

mattdeboard · on Feb 15, 2014

I'm assuming you're not the author. But just in case the author wanders by: How did you decide which parts to write in C?

apenwarr · on Feb 15, 2014

It's not too hard actually. A line of python is roughly 80x slower than a line of C (no exaggeration). But a typical line of python does a lot more than a typical line of C. So things you can do with a "loose" loop (like once per 64k block) is usually ok in python. Things you have to do with a "tight" loop (like once per byte) need to be in C.

I once did a presentation about python performance optimization lessons from bup: http://lanyrd.com/2011/pycodeconf/sghxk/

And it's true, I'm not the most active maintainer anymore. The people who took over seem to be doing a pretty good job though.

tekacs · on Feb 15, 2014

I'd be curious to know if your stance on PyPy has changed at all since 2011 (if indeed it's something that you've taken any new long looks at since) given their progress in that time.

I know that I would humbly submit at the least that my position has moved to believing that PyPy is a viable option for high-speed code (albeit in substantial part due to better interaction with C, nowadays).

nathancahill · on Feb 15, 2014

Not the author either, but I've made that decision based on profiling the code. It's generally easy to see which functions are slowing down performance and can be refactored to a more performant language.

staticshock · on Feb 15, 2014

The author is https://news.ycombinator.com/user?id=apenwarr, though it looks like he isn't the most active maintainer nowadays.

atmosx · on Feb 15, 2014

The most efficient backup system for operating systems I've used so far is 'tarsnap'. The only drawback is that restore is really slow.

tekacs · on Feb 15, 2014

I love Tarsnap, but S3 storage costs aren't exactly brilliant. Figuring out exactly how you wish to store keys can also be another thought, upfront (albeit one that arises from the increased security that you get 'for free').

Alternately, CrashPlan and other consumer-style services have a bad habit of using very slow, heavy, world-slowing systemwide file update scanning. :/

Having said this, a discussion of the merits and flaws of Tarsnap and similar backup services is something I'm fairly certain I've seen lengthy discussions of on similar HN posts.

(https://news.ycombinator.com/item?id=5767116 is a good source for lots of that sort of discussion)

leephillips · on Feb 16, 2014

Not really closely related, but another solution that uses git infrastructure to back up large files is git-annex:

https://git-annex.branchable.com/

rsync · on Feb 16, 2014

... and there are some tangential benefits to using git-annex as well:

http://rsync.net/products/git-annex-pricing.html

bambambazooka · on Feb 17, 2014

and git-annex has a bup backend :)

brimstedt · on Feb 16, 2014

so, its 2014 and still people use homegrown variations of tar, rsync, git and whatnot. Or a half done solution like this, or an abandoned solution like box backup.

why on earth isnt there already a perfect cross platform open source backup program? :)

I know,i know..why dont i make one myself? because we dont need a nother half done solution :-b

gcr · on Feb 16, 2014

Every "perfect cross platform open source program" had to start as a homegrown variation of tar, rsync, git, and whatnot. It's not like they fall out of the sky.

brimstedt · on Feb 16, 2014

of course things dont come out of nothing, i just find it so odd that we have open source top quality OSes, monitoring systems, programming languages, ides, browsers, graphics suites, etc etc, but no backup ..

pjc50 · on Feb 17, 2014

Not having a backup isn't a "pain point". Until you have a disaster. And disasters are infrequent enough ..

brimstedt · on Feb 16, 2014

(dont mean tobsay this is a bad product or anything, just not a finished product..)

bambambazooka · on Feb 17, 2014

you can help finish it

gesman · on Feb 15, 2014

This is an outstanding project with great potential.

Killer for many commercial, overpriced services.

natch · on Feb 15, 2014

Where would you host the backups?

gesman · on Feb 15, 2014

How about here:

http://www.soyoustart.com/ca/en/offers.xml

sitkack · on Feb 15, 2014

Like here https://bluevm.com/cart.php?gid=42 search for '100 GB'

kimjotki2 · on Feb 16, 2014

a 'backup system' that runs on python. how oxymoron.

jrockway · on Feb 16, 2014

Seriously. I write all my critical software in assembly so that my super-fast disks and networks aren't bottlenecked by unnecessary CPU instructions! Backup software always values speed over correctness!

isomorphic · on Feb 16, 2014

You mean like rdiff-backup, which I've been using in production against millions of files for more than half a decade without a single problem?

It may be that some languages or runtimes host consistently more reliable software than others, but I'd bet that the individual programmer, coding style and practice have more of an effect on reliability.

qwerta · on Feb 16, 2014

How is that different from shell script which calls programs?