Hacker News new | past | comments | ask | show | jobs | submit login
Bup: Efficient file backup system based on the git packfile format (github.com/bup)
186 points by tekacs on Feb 15, 2014 | hide | past | favorite | 61 comments



Another backup possibility I currently use: ZFS on a backup server (not necessarily ZFS on the system that should be backed up), pull data with rsync on the backup host to a ZFS, after that make a snapshot for an "incremental backup".

So simplified it's like: rsync -avx remote:/etc /backup/ && zfs snapshot backup@`date`

With zfSnap (https://github.com/graudeejs/zfSnap) you can tell how long incremental backups/snapshots are kept, "rsync && zfSnap -d -a 1w backup"

You can take advantage of the /backup/.zfs/snapshot directory to access all snapshots, built-in compression and possible data deduplication.

If you also have ZFS on the remote host, you can use zfs send and zfs receive to transfer the snapshot directly to the backup server, instead of using rsync for the diff.


I do the same with btrfs. I actually tried bup first, but had a series of problems (see mailing list) and switched to btrfs snapshots. My main disk is also btrfs so I send incremental snapshots ("btrfs send -p"; faster than rsync and I can keep using the machine without making the backed-up state inconsistent), but the rsync method is fine for other source file systems.


rsync.net is an example of a host that does something like this (with daily snapshots)

    $> ssh rsyncnet ls .zfs/snapshot
    daily_2014-02-09
    daily_2014-02-10
    daily_2014-02-11
    daily_2014-02-12
    daily_2014-02-13
    daily_2014-02-14
    daily_2014-02-15
They allow you to customise the length of time for which these snapshots are kept, too (IIRC) (at the cost to you of the incremental extra storage)


Just a clarification ...

All accounts get 7 days (as the parent shows) and all 1TB+ accounts get 7 days + 4 weeks.

However, if you want a custom schedule (say, 30 days, 8 weeks, and 6 months) it only costs more if the space they use exceeds your existing account.

So, if you have a 100 GB account and you use 60 GB and your fancy snapshot schedule (of which the first 7 days is always free) only uses 35 GB ... then there is no additional cost at all.


rsync.net looks awesome, but it is just way too expensive for me.


You should email us and ask about the "new customer HN reader discount".

Also, note that 1TB accounts are 15c per GB, per month, and 10TB accounts are 7c per GB, per month - all with no traffic or usage costs.

This compares very favorably with S3 and blows the mozy pro pricing out of the water[1].

[1] https://mozy.com/product/mozy/business


Have you asked for the HN discount?

https://news.ycombinator.com/item?id=6554313

(I happen to get a discount for being a prgmr.com user, but IIRC the terms are similar.)


Where can I learn more about using rsync and ZFS ?


Bup is lovely. I used it to back up my huge home folder and only switched away to rdiff-backup because (at the time) there was no support for deleting old revisions.

Is there any support for that? (Of course, for a large enough hard drive, it's not much of a problem...)


People have been actively working on a "prune" feature but it seems to never quite get finished. This would indeed be nice to have, although it's less important than you might think, given really good deduplication (which bup has). Currently bup has a very simple model - never delete anything - which is hard to screw up, so you're very unlikely to lose data.


I really have to finish that feature...


I wrote ddar, which is basically this but solves that particular problem, by using something other than the git packfile format.

http://www.synctus.com/ddar and http://github.com/basak/ddar

It's recently been made available on Homebrew, too.


Nice. Could you add some description there how the deduplication works? I assume it uses rolling checksums to create chunk boundaries, just like bup?


> I assume it uses rolling checksums to create chunk boundaries, just like bup?

That's right.


I don't see a link to the source code on your page, is it just me ?


See my github link for the latest source. The other page predates me uploading to github, so there's only a tarball source download from there. I should move everything over to github now, really.


How actively is it developed? How reliable is it currently?


> How actively is it developed?

It's a simple tool that does a simple job. It's pretty much done.

> How reliable is it currently?

No known bugs.


Looking at the README under "Stuff that is stupid":

  > bup currently has no way to prune old backups
Thanks for the rdiff-backup shout-out. I'm looking for a nice way to do system backups to my NAS of large VM images without having to install Crashplan. Bup and rdiff-backup both look pretty good.


Obnam is also great: http://liw.fi/obnam/

This does deduplication, but can also encrypt the deduplicated blocks and store them on (say) S3.


Obnam can also push/pull backups over SSH/SFTP. Seconded.


I've been working on that feature for some time. I'll try to finish it for the next release! If you want to test my existing code contact me.


How do people who would use this kind of thing manage to have remote servers with terabytes of available disk space on them?

Anything is possible with money, of course, but how is this anything other than really expensive?

For example AWS S3 would be $235/month (that's $2,820/year!) for 3TB not even including any data-out transfer charges. Sure there are others that are cheaper but only marginally so.

Is this really what people are doing? Makes the commercial services sound really cheap.


My suggestion is to backup your cloud servers, which are expensive and redundant and have good uplink speeds, to home servers which are cheap and have good downlink speeds. You don't need your backup file server to be ultra-reliable or even up all the time, so the cheapest possible PC sitting on a home internet connection is a pretty good choice. That way, 3TB is just $150 or so plus your electricity, and it's not a per-month fee.


I'm curious as to how you would restore that much data quickly, from a home-uplink to the cloud?


You would send the hard drive via snail-mail to Amazon.

A mass restoration is expected to be rare, so it's okay for it to be a bit more expensive.


If you ignore the cloud services and rent dedicated servers you can get up to 6 TB disk space for 50$ a month.

edit: 45tb for 300€/month: https://www.hetzner.de/en/hosting/produkte_rootserver/xs29


Unless you don't need any redundancy (cloud has some) that's more likely about 39TB (in RAID50, if controller supports such setup).

If so, that's still less than €8/TB/mo, which is better than most cloud storage providers offers. You also have some spare memory and CPU resources (so you could resell them for others as, for example, memcached instances) and a possibility to get a proper SLA, as a bonus.


My strategy is:

1. Regularly back up "important" directories (code/, papers/, web/, etc.) to fairly safe/redundant cloud storage with incremental history. I have pretty little of this, <50gb, so it's not super-expensive.

2. Occasionally exchange bulk but less-important backups with my brother, so we're each the other's high-latency, questionable-durability "off-site backup". No incremental dumps here, just rsync. This is where my MP3 collection, DVD rips, and similar goes.

3. Photos, which are important but also bulk, go to Flickr, which is free.

4. Don't back up stuff I can re-acquire, e.g. big public datasets I've downloaded to work on, or Debian ISOs. Also, I don't back up the OS, just my data.

There do, however, seem to be some cloud services that offer big full-disk backups for a surprisingly low flat price, e.g. http://www.backblaze.com/ is $5/mo/machine.


We use a dedicated server. It's a fairly basic machine with a single Xeon CPU and 32GB ram and a lot of drive-bays, which I think doesn't cost much. It has at the moment 16 x 3TB drives in it in RAID 50. RAID 50 is not super on performance, but fast enough to saturate gigabit in sequential operations (which backups are). So it has 32 TB of useful storage for a price of around € 700 per month (leased server at a high-end hosting company, so could be much cheaper if you buy it yourself or use a cheaper provider). Per TB that's around € 21.5 per month. Although our reasons for doing this were not based on the prices of storage in the cloud, it was based on having the data on our own machine with disk encryption and only a connection to our internal management network and not the public internet.


You shouldn't think about backups as paying for $/GB. You should think of it has paying for the ability to restore correctly instantly when you want to restore. That's where the value really lies.


AWS is never the cheapest option. AWS is great when you're quickly scaling up and down, but the flexibility comes at a (high) cost.

For backups, you're dealing with a relatively consistent or predictable amount of data. Buy the appropriate dedicated server for your needs.


Remember that there's a huge difference between files you need to read and write at any time with low latency and backups which happen at larger intervals and read infrequently, with looser latency demands. For backups, the service to compare is Glacier, which is about an order of magnitude cheaper.

What I'd consider is essentially the Crashplan model: P2P / external backups locally (i.e. full LAN speed) and an off-site replica which can be cheaper and slower as long as you have a high confidence that it'll be available eventually. This way normal operations are fast but if the building burns down you're covered and presumably have higher priorities than waiting for a restore to run.


AWS is for startups with venture capital, ephemeral storage and compute, or someone needing to deploy a high traffic website instantly.


I hear about rdiff-backup, but I think its two main drawbacks are:

* on the webpage, there is no new release since 2009.

* has no de-duplication.

I was considering moving my 5+ years old rdiff-backup system to any of those new, promising programs:

* obnam [http://liw.fi/obnam/]

* attic [https://pythonhosted.org/Attic/]

They both do automatic de-duplication, old backup deletion and remote encryption.


The original rdiff-backup author went on to create duplicity. Maybe a big part of the rdiff-backup community has followed him?

I'm using obnam now.


For a link, http://duplicity.nongnu.org/

I backup locally using duplicity, then I ship those files off to Amazon Glacier using mt-aws.

https://github.com/vsespb/mt-aws-glacier


I'm assuming you're not the author. But just in case the author wanders by: How did you decide which parts to write in C?


It's not too hard actually. A line of python is roughly 80x slower than a line of C (no exaggeration). But a typical line of python does a lot more than a typical line of C. So things you can do with a "loose" loop (like once per 64k block) is usually ok in python. Things you have to do with a "tight" loop (like once per byte) need to be in C.

I once did a presentation about python performance optimization lessons from bup: http://lanyrd.com/2011/pycodeconf/sghxk/

And it's true, I'm not the most active maintainer anymore. The people who took over seem to be doing a pretty good job though.


I'd be curious to know if your stance on PyPy has changed at all since 2011 (if indeed it's something that you've taken any new long looks at since) given their progress in that time.

I know that I would humbly submit at the least that my position has moved to believing that PyPy is a viable option for high-speed code (albeit in substantial part due to better interaction with C, nowadays).


Not the author either, but I've made that decision based on profiling the code. It's generally easy to see which functions are slowing down performance and can be refactored to a more performant language.


The author is https://news.ycombinator.com/user?id=apenwarr, though it looks like he isn't the most active maintainer nowadays.


The most efficient backup system for operating systems I've used so far is 'tarsnap'. The only drawback is that restore is really slow.


I love Tarsnap, but S3 storage costs aren't exactly brilliant. Figuring out exactly how you wish to store keys can also be another thought, upfront (albeit one that arises from the increased security that you get 'for free').

Alternately, CrashPlan and other consumer-style services have a bad habit of using very slow, heavy, world-slowing systemwide file update scanning. :/

Having said this, a discussion of the merits and flaws of Tarsnap and similar backup services is something I'm fairly certain I've seen lengthy discussions of on similar HN posts.

(https://news.ycombinator.com/item?id=5767116 is a good source for lots of that sort of discussion)


Not really closely related, but another solution that uses git infrastructure to back up large files is git-annex:

https://git-annex.branchable.com/


... and there are some tangential benefits to using git-annex as well:

http://rsync.net/products/git-annex-pricing.html


and git-annex has a bup backend :)


so, its 2014 and still people use homegrown variations of tar, rsync, git and whatnot. Or a half done solution like this, or an abandoned solution like box backup.

why on earth isnt there already a perfect cross platform open source backup program? :)

I know,i know..why dont i make one myself? because we dont need a nother half done solution :-b


Every "perfect cross platform open source program" had to start as a homegrown variation of tar, rsync, git, and whatnot. It's not like they fall out of the sky.


of course things dont come out of nothing, i just find it so odd that we have open source top quality OSes, monitoring systems, programming languages, ides, browsers, graphics suites, etc etc, but no backup ..


Not having a backup isn't a "pain point". Until you have a disaster. And disasters are infrequent enough ..


(dont mean tobsay this is a bad product or anything, just not a finished product..)


you can help finish it


This is an outstanding project with great potential.

Killer for many commercial, overpriced services.


Where would you host the backups?



Like here https://bluevm.com/cart.php?gid=42 search for '100 GB'


a 'backup system' that runs on python. how oxymoron.


Seriously. I write all my critical software in assembly so that my super-fast disks and networks aren't bottlenecked by unnecessary CPU instructions! Backup software always values speed over correctness!


You mean like rdiff-backup, which I've been using in production against millions of files for more than half a decade without a single problem?

It may be that some languages or runtimes host consistently more reliable software than others, but I'd bet that the individual programmer, coding style and practice have more of an effect on reliability.


How is that different from shell script which calls programs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: