Hacker News new | past | comments | ask | show | jobs | submit login
Encrypted Backup Shootout (acha.ninja)
208 points by andrewchambers on Jan 3, 2021 | hide | past | favorite | 107 comments



I get that performance is interesting to graph, but it's very much secondary in importance when compared to the backup solution being bulletproof. I've found encrypted Borg very difficult to get wrong and setup is very simple. I've also successfully recovered two systems with the tool without issue.

Not saying that Borg is necessarily the best solution, just that we should evaluate the important metrics.


>but it's very much secondary in importance when compared to the backup solution being bulletproof

To an extent, anyway. I moved to Borgmatic precisely because the restore speed on Tarsnap made it unsuitable for my needs. Unfortunately I didn't uncover this problem until I actually needed to restore, which extended an outage by about half a day. Having a vague sense of the tool's performance is a good thing, I think.


I moved to Borgmatic precisely because the restore speed on Tarsnap made it unsuitable for my needs.

Yes, as long-time Tarsnap users, we got a very unpleasant shock a few months ago when our main server literally went pop one morning and we needed to do a full restore of everything to get the replacement up and running. It took many times longer than we were expecting to download all the data, and consequently caused us extra days of unnecessary downtime.

To his credit, Colin was responsive when we asked for advice and did suggest something we could try to make things a bit faster (which it did). But really, when you’ve just had a catastrophic failure, messing around hand-holding your backup tool just to get your essential data restored as quickly as possible shouldn’t be needed. Backup tools need to Just Work, quickly and reliably, every time.

Given that Tarsnap also seems to work out very expensive these days, it’s now in the category of software where I’m happy to have used it and it did do its job even if not ideally, but unless there are dramatic improvements on both counts in the near future I think we’ll probably be looking for another option.


> I get that performance is interesting to graph, but it's very much secondary in importance when compared to the backup solution being bulletproof.

No. I need a backup that consistently finishes before I close the lid to put my laptop in my backpack at the end of the day. Too slow -> no backup.


Unless daily work datasets are huge, you would be hard pressed to find a backup system that is unable to pull that off, though it's unclear from your description when you consider your backup to start.

My multi-daily backups (always incremental in Kopia) take 3 minutes, with a working set of 4 million files.


Many dev environments come in virtual machine images. On a good day they're throwaway Vagrant builds, but sometimes there's a significant manual setup required and you want to include them in a backup.

Incidentally, some backup utilities run into unexplainable performance issues trying to delta diff very large files such as VM images. Rsync in particular sometimes stalls completely when trying to sync a 30GB VM image to a remote server.


I think Kopia should deal fine with this, though I don't think I have many such large files that I modify.

The reason why I think it's fine is that it doesn't really try to do "delta diffs" at all. Instead is uses a rolling hash to cut the input file into variable length chunks and those chunks are added to the chunk storage, and then actual files are constructed from those chunks.

Obviously such copies should be done when the files are stable, so either not running or using a snapshot. (If using LVM, this can be arranged easily with any filesystem, except ones that don't like getting mounted multiple times..)

So while I guess size-related bugs could be there, there's no any fundamental property of making the task bigger as the files themselves grow bigger.


There's backup tools that can make a complete backup of your VM while it is still running. At least for VMware virtual machines [0]

You can schedule the backup to start some time before you leave at the end of the day.

[0] https://www.vimalin.com

disclaimer: I am the author of vimalin


Why would you sync the VM image instead of doing a file level backup from within the VM? File-level backups lets you control include/exclude lists (no need to back up /var/tmp for example, or exclude *.dbf files as they should be backed up with an appropriate database plugin that does a hot base backup, then grabs transaction logs afterwards).


The main purpose of my backup is for getting back to doing invoiceable hours after the statistically inevitable hardware failure or misplacement (I've yet to lose a laptop, fingers crossed). It's not that the data on those VMs is irreplaceable, any significant work output will be in a VCS outside my laptop anyway. It's that rebuilding it from pieces in case of loss requires work that no client will pay for.

That being the case, I much prefer being able to re-image my home dir partition into a new laptop/ssd and get back to paying work, over to having to fiddle with individual VMs that only exist as partial backups.

Edit: I've actually considered the opposite of what you suggest, just doing a raw partition image for backup. SSD is really fast, ethernet is fast, and the runtime would be consistent. Just haven't got it out of me to try it in practice.


How does developing a backup application differ from developing other projects?

What do you test differently? What do you avoid changing? How do you avoid breaking things? How do you get enough beta testers when the costs of failure are so extreme?


Each project can have more or less the same metrics, but different priorities. In a backup tool for example, integrity > restore speed > backup speed (unless the backup speed is so slow that you cannot backup). In a project that can afford to loose data, speed may be more important than integrity.


Worth noting that Restic is suitable for Windows. Not sure if Bupstash was tested on macOS. Borg also has FreeBSD support and experimental support for Windows via WSL2 and Cygwin.


I was facing this problem when selecting a backup strategy for client machines. Servers are one thing where i can rely on them to be always running, but clients go off grid occasionally.

As for restic, yes it's available on Windows, but has a serious speed degradation when datasets increase above 2TB, mainly due to chunks being 2-5MB, where Borg uses 400MB chunks. And we're not talking a 10% slowdown. It's to the point where prune operations on repositories >2TB can/will take multiple days to complete.

For clients, I evaluated multiple solutions, one was Duplicati (https://www.duplicati.com/), and while it appears to work well most of the time, it sometimes stops backing up for no apparent reason. Backuppc also looked promising, but doesn't (didn't ?) support source encrypted backups.

In the end i ended up with Arq (https://www.arqbackup.com/) for Windows and Mac clients. It works reliably, and checks all the right boxes.

For servers i was initially really hooked on restic as it allows backups over HTTPS instead of SSH, but the performance issues were a showstopper.


It would be nice to see kopia.io in the comparison.

Kopia is a backup software based on the similar concepts as other modern apps (infinite increments, rolling-hash based deduplication, immutable blobs, encryption) but has some things going for it in addition, such as the support for a wide array of backends (including S3; I use it on my small Ceph cluster), repo synchronization ability (using the same array of backends), and the ability to do concurrent backups on the same repo - something borgbackup still misses.

Actually the last bit is the reason I chose to go with it at all, and I have since not regretted it, though I have only recovered small trees from it beyond testing. It can do backing from multiple computers to same repository (with dedupping) at different times without a lengthy index rebuilding. You can do this also with an intermediate server if you want clients to have different credentials to the repository (this is also quite difficult with borgbackup), but I haven't set up that yet.

I guess performance-wise it doesn't hurt it's written in Go, though I don't have good numbers on that because my backup storage is not the fastest available :). It's able to do concurrent access on S3 which helps a lot.


How much does it take CPU wise on the client? I use arq which does need an app on the server side, but it's bog slow in scanning your client and uploading new chunks, taking up a significant amount of CPU. While something like carbon copy cloner is orders of magnitude faster.

I'm upgrading to an M1 mac soon although, so that might not be as much of an issue anymore.


Difficult to compare. I mean high CPU usage is good, it gets its job done faster, right?-) At least it is able to use multiple cores for it.

Just started a backup and it seems the CPU usage stood still at at about 400%, varying from 90% to 900% at first, then I guess some second phase consumed 150%-200% CPU. I have AMD Ryzen 3900X.

The increment seemed to take about 3 minutes with my 4M files. No idea really how much it changed compared to previous one, probably very little. Backup size was 320 gigabytes, and IO-wise the backing Ceph S3 was probably the bottleneck, not the local SSD.


I created nFreezer ("encrypted freezer") for this purpose at the end of 2020: https://github.com/josephernest/nfreezer.

Main features:

* encrypted at rest locally (the data is never decrypted on the remote server)

* handles file renames / moves (no data is re-sent, it can save GB of transfer!)

* single Python file of ~ 250 lines of code. It is quick to read the code and decide if you like/trust it or not.


As a borg user I've been quite happy after iterating between rolling my own and managing the headaches.

My most recent memory of borg was when I realized I'd been backing up superfluous files I didn't expect. I thought I'd be stuck with them until the end of time because they've made it in to my long term snapshots. Fortunately `borg recreate` did exactly what I needed and didn't feel as dangerous as the page suggests.


I have never heard of bupstash before. I am a happy user of borg and my first reaction was, why another backup tool? But then I noticed that bupstash is written in Rust and if I understand it correctly, only requires a single binary on the server? That could give it an edge, as borg requires the Python runtime. But given that bupstash is alpha, I will surely "wait and see"


That's a nice feature of Restic as well. It's written in Go and also requires a single binary.


Oh it is? I thought it was borg's predecessor. But now I see that it was called Attic. Might need to check out Restic more closely.

What I like with borg is that it has the wrapper borgmatic that simplifies scheduling to keep the right amount of backups over time. Without a similar ability of another tool I would be reluctant to switch.


Restic does the same, though on the command line. I migrated from Borg to Restic and it took half an hour or so, the two tools are really similar.

What sold me on Restic is that it doesn't actually require a binary on the server. I don't know how much slower it is without one, though.


I wonder if borg plays well with e.g. pyoxidize. Technically you still require the python runtime but it's embedded so you've still got just a single file.


The borg releases are provided as a single self-sufficient binary, created with pyinstaller.


HashBackup (author here) is written in Python and has a single binary with only system shared libraries such as libc as dependencies. It isn't particularly easy to do, but necessary IMO for a backup tool if you want to recover a server.


Sorry if this is a naive question, but how comparable would bupstash be to Duplicity [0] (commonly used via its wrapper, Deja Dup [1])?

Duplicity uses librsync and GnuPG; it also uses rdiff/rdiffdir to produce tarred incremental snapshots, if I recall correctly.

[0]: http://duplicity.nongnu.org/

[1]: https://wiki.gnome.org/Apps/DejaDup/Details


> The snapshots are all made to tmpfs so hopefully does not measure delays introduced by the network or disk activity.

This by itself would make an interesting benchmark. In the previous company that I worked for, there was a customer who used Veeam Backup to a disk backed by Ceph, in an 8+3 erasure-coded pool. And they were complaining to us about disk space overuse by Ceph and poor performance. We traced that to the fact that Veeam uses 128KB transfer size (too small) and excessive fsyncs.


I keep thinking of btrfs, where one can snapshot the fs directly, almost instantly at nearly zero storage cost. And then send around incremental diff between snapshots.

Alas, as the FAQ states, there is no native encryption in btrfs. https://btrfs.wiki.kernel.org/index.php/FAQ#Does_btrfs_suppo...

Interestingly bupstash documents how to store a btrfs send "image" into bupstash. I imagine the primary use for this is adding encryption (otherwise why use bupstash at all? perhaps simple uniformity I suppose, if one already used bupstash): https://bupstash.io/doc/guides/Filesystem%20Backups.html#Btr...


I think both bcachefs and zfs have snapshots and encryption. I don't know if you can send incremental snapshots with bcachefs, but you definitely can with zfs.


Personally, I just use BTRFS on top of LUKS. On my remote backup server there's a little bit of weirdness, but it works out so that only my personal laptop has the key, which is of course itself password protected.


my hope would be that btrfs subvolumes could each be keyed differently, which would enable finer granularity of management than whole disk at a time.

and ideally, the contents could be encrypted but still send-receiveable.

thus, I could have an encrypted subvolume of, say, my photos, and send a snapshot of that over to a friend for backup. and they could send me some of their subvolumes, again securely, along with incremental updates over time.

with LUKS, we'd need to agree ahead of time how big a partition to make for each other's encrypted contents & there's not really any way to insure, when I use my key to unlock & incrementally update my subvolume, that my friend "forgets" and unmounts the subvolume after I update it. Big ask, but if btrfs did add encryption, it'd truly be the only file system tool we needed.


restic used to be my tool of choice in the past, but I unfortunately ran into an issue where a single corrupted blob caused many files to be unrecoverable. Thankfully, I noticed it when testing the backups so I didn't actually lose any data.

I've since switched to using encrypted rclone backups due to the design decision of having each source file map to one target file. For my specific use case, it's more important that corruption of backup files has minimal impact vs. having good deduplication. I also like that the format is simple enough where I feel confident I could write an extraction tool that skips checksum validation or bad blocks if I really need to.


Yeah, this is one downfall of these tools, they amplify the risk of corruption unless you take additional redundancy measures.


Did you test the backups with Restic's built in test command, or by restoring?


I tested by attempting to do a full restore to a temp directory.


Do you remember if `restic check` said that everything was fine?


Unfortunately I don't remember. From the (little) notes I kept, the corruption occurred with restic version 0.9.3. I had also tried restoring with the latest code from git master, but I didn't make a note of the exact commit (it didn't make a difference anyway).

The commands that triggered the corruption were the following. I tested the restore around the 10th run of this.

    repo=sftp:chenxiaolong@<ip address>:/home/restic
    restic -r "${repo}" backup -x /home
    restic -r "${repo}" backup -x --exclude /stuff/Android/AOSP /stuff
    restic -r "${repo}" forget --keep-last 3 --prune


Does this risk exist with un-encrypted Borg back ups also?

(Compressed deduped also doesn’t map file to file).


Almost certainly, yes. It's an inherent risk with any backup system that deduplicates at at a block level, encrypted or not -- if the same block appears in many files, then it's stored only once (which is good in terms of space efficiency), but if that block is corrupted, then all the files sharing that block will be affected. Incidentally, this is not just an issue with backup systems; almost any compressed data format will also have similar risks. It's a tradeoff of the safety in redundancy for space efficiency.


Would it be possible or desirable for one of these backup system to natively implement parity?


Having recently acquired a NAS for use (among other things) as a single source of truth for photos spanning ~15 years, I have done a lot of research around backups.

These are the threats I have considered and would like to mitigate against, in no particular order:

1. Physical damage or theft to the NAS and supporting hardware (e.g. backup drives in my home). 2. Accidental deletion or corruption of files through user error. 3. Ransomware which targets my NAS. In particular a sophisticated malware author could target common cloud backup destinations by looking for credentials stored on the NAS, and delete any backups, although I have not heard of any such attacks in the wild.

The first threat seems simple enough to mitigate: make regular backups backed by cloud storage, and keep offline credentials for accessing the backup in multiple geographical locations as well as a cloud based password manager.

The second is also not troublesome: use a filesystem which supports versioning and take regular snapshots.

The third is where I have been somewhat disappointed by the options. An effective strategy is to keep an external hard drive which is plugged into the NAS regularly and keeps a clone of the data. However, an extremely cunning malware author could still pre-empt this by corrupting data on plugged in drives. This is extremely unlikely, but here's where I was hoping for better options in cloud backups: effectively all the pieces are in place for immutable backups, except for the tooling.

Options such as restic and rclone don't have good (if any) support for targets which support immutability, such as AWS S3 and BackBlaze B2. My current solution is to use a version of restic which I have patched to not require delete permissions when targeting B2, and very carefully manage API keys so that deleting backups would require compromising my BackBlaze account. In case a script does try to corrupt the backup repository, there are few if any supported ways of accessing a past version of a B2 bucket, although rclone comes very close and could support this very nicely with some minor tweaks to the B2 backend.

I will be keeping a close eye on this, and hopefully if I have the time I can make some PRs to push the open source tooling in this direction.


I am surprised you didn’t find security to be a key issue with network attached storage.

* Physical security. On 1, in case you have a synology NAS, it does not offer full disk encryption. Its folder encryption also has a number of problems. Your mitigation here (back ups) doesn’t help with loss of data to others.

* Network security. In addition to the physical security, consumer NAS devices don’t do enough in network security. Some of them come with closed source operating systems with a lot of potentially dangerous sharing and networking features. The code is often not reviewed.

On 2., you can use ZFS or btrfs, and they offer good features, but come with separate set of problems.

I spent some time on NAS security and couldn’t find a good solution. I thought I better let Amazon and google to secure my data.


Regarding comsumer NASs having poor security, I completely agree, however it wasn't too relevant to my personal threat model. Backups of my computers (which could potentially compromise credentials) are encrypted before they go on the NAS, and if I needed to sync anything sensitive it would be encrypted locally.

Using my NAS for sharing photos and files with family/friends opens up security holes that encryption at rest wouldn't help with, and I accept the tradeoff of potentially leaking data. What is less acceptable to me is any risk of data loss.


Does bupstash support archiving to AWS S3? That's what I love about restic. I can backup my files to cheap cloud storage and I don't have to maintain any offsite server or storage for this to work.


Work in progress I'm afraid. It does need an offsite server to provide the access controls.


It doesn't support S3 yet.

More importantly, S3 / Backblaze B2 have 99.999999999% durability, so you can be sure that your data is there when you need it


It would be interesting to see tarsnap added to the evaluation.


Not entirely sure that I'm reproducing the benchmark correctly, but running Amazon Linux 2 I have:

    # mount -t tmpfs -o size=32G tempfs tmp
    # gdir=~/linux/.git
    # mkdir tmp/linux
    # for commit in $(git --git-dir="$gdir" rev-list v5.9 | head -n 20 | tac); do \
          mkdir ~/tmp/linux/$commit; \
          git "--git-dir=$gdir" archive "$commit" | tar -C ~/tmp/linux/$commit -xf -; \
      done
    # ~ec2-user/tarsnap-autoconf-1.0.39/tarsnap -c --dry-run --print-stats tmp
    tarsnap: Performing dry-run archival without keys
             (sizes may be slightly inaccurate)
                                           Total size  Compressed size
    All archives                          20278980504       4406609078
      (unique data)                        1767400183        258587867
    This archive                          20278980504       4406609078
    New data                               1767400183        258587867
So (assuming I reproduced the benchmark correctly!) that's 258 MB, or about 1/3 less space than bupstash.

I also had "while true; do ps aux | grep tarsnap | grep -v grep; sleep 5; done" running in a different window to monitor the RSS (I don't know what the right flags are to time(1) to get that data on Linux) and it maxed out at 11472 kB.

I suspect that Tarsnap is slower than the other options, though. That's something which I'm not able to compare easily.

If the author is interested in adding Tarsnap to his comparison, I'd love to help (and can provide a free account for benchmarking).


Since you have the well-known "ps aux | grep $THING | grep -v grep" pattern in there, I'll plug one of my favorite shell functions:

  psgrep () {
    ps aux | sed -n '1p;/\<sed\>/d;/'"$1"'/p'
  }
This does basically the same, but includes the header line from `ps aux` in the output. For instance:

  $ psgrep alacritty
  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  majewsky    3500  0.4  0.5 1427280 93644 ?       Sl   12:55   0:01 alacritty


Interesting! Might also be useful for CSV mogrification then.


Really interesting to see, thanks for doing that, I feel like many fast and good rolling hash functions are yet to be invented.


I’ve had no experience with the other tools mentioned in this post so I cant provide benchmark feedback, but just wanted to say that I’ve been a happy tarsnap customer for a couple of years now. It’s well documented and does exactly what it’s supposed to. Well done.


I once had a spectacular accident where duplicity managed to delete most of my local file system on a (misguided) partial restore attempt.

After that I went back to good old tar/cpio (for incrementials)+gpg and haven't looked back. While they have their quirks, they're at least well understood quirks, and they're never going to delete your file system under you.


> and they're never going to delete your file system under you.

So you haven't used tar with --recursive-unlink yet?


I knew someone would come up with a counter example. Or restoring over the block device.

But at least these mistakes are easy to avoid. While with duplicity destruction was fairly easy.


Author here - I agree, I can't stress enough how important it is for a backup system to be totally reliable.

I hope to do future posts compare reliability in adverse conditions such as low memory, full disk, etc.


In rust there's also Proxmox Backup Server:

https://pbs.proxmox.com/wiki/index.php/Main_Page

Can be used without proxmox, has a web UI.


Proxmox BS is recent and looks very good. The only problem, at the moment, is that the client is only available for a small set of platforms/OSes. Also no bare metal recovery of a proxmox VE installation (AFAIK).


I have used https://www.arqbackup.com (commercial, Mac and Win only) with local encryption for many years. It supports tons of local, network, and cloud storage endpoints. They do have a FLOSS restore utilities at https://arqbackup.github.io should the app ever die. IDK if they'll ever support Linux on the backup side, but it's the bee's knees IMO.


ZFS is helluva encrypted backup tool. After ZFS got encryption, I switched almost completely to it. rsync to ZFS drives when not in ZFS. Then it's just snapshots sending and receiving.


Having dealt with this at some scale.

Simplicity is everything. Skip the encryption - seriously. Or do what Microsoft does and send the keys somewhere you can absolutely get them (bitkeeper can be forced to backup into AD).

You want good access control and fantastic ability to recover. Your transfer will be over a secure link.

I did the client side encryption. I'm not convinced it's worth it or I have secrets so important that I need to worry about someone at AWS reading my S3 bucket (which has its own encryption).


> Simplicity is everything. Skip the encryption - seriously.

I think it's reasonable to assume that anything we upload "to the internet" will stay there forever and there is a chance it'll become public at some moment in the future. Anyone can become a subject of scrutinity by legal or state actors, or high value target for blackmailers. If it happens it'll be in the worst moment and will incur major costs.


Encrypting data at rest, including backups can easily become a regulatory item. Of course there is a risk that you will lose your keys, so don't and check your back-ups integrity and your ability to restore frequently so that when the time comes you will have it down to a routine.

The alternative is that one day you find that your precious backups have taken a stroll through the countryside and end up on pastebin or something worse depending on the contents. Backups are typically much less secure than the servers they are copies of, hence the good practice of encrypting them.

If your backups do not contain data that would embarrass you or someone else if it should get lost then you probably shouldn't encrypt on the off chance that you will lose access to the keys. Lots of scientific data would fall under that description (but not all, for instance, studies could easily contain PII or sensitive info).


Exactly this: people don't realize that encryption involves considering risk of losing your data (lost keys) against risk of someone seeing your data (no encryption).


Most cheap storage services available to people for offsite backup in the global west are subject to military surveillance, making the latter risk probability approach 1 if you visit certain websites or search for certain terms.


If you are that concerned about encryption keys, just print them and store in some place off-site. Make as many copies as you wish.

But don't send data in bulk into the internet without encryption. If you are not reviewing what is there before you send, then you may be sending anything.

Also, don't trust a service that will send your keys to some place you don't control.


I'm pretty much in agreement, for the most part. Which is why I didn't put encryption as a priority on my backup tool for quite a while.

Two things that you'd want encryption on

1) for compliance (dealing with customer data, patient medical records etc).

2) Lots of secrets in your browser data directory (logged in session IDs, cookie values, site data).


For home stuff, I go for simplicity and idiot-proofness. Run everything in VMs, and backup the disk image. I don’t even trust myself with vhdx snapshots. Copying the vhdx between two nvme drives first to reduce the VM down time, then to NAS.


I agree, I think with access controls the encryption is not always worth it. I am going to add am optional way to bupstash to skip encryption for this reason.


In the EU with GDPR it is nearly impossible to skip encryption.

Access control would be needed to be very stringent double so if you don't control the hardware.

If you lose that data bc not encrypting it it can cost you millions (trade-off if you lose the keys your company light be toast)


https://github.com/andrewchambers/bupstash#stability-and-bac...

> Bupstash is alpha software, while all efforts are made to keep bupstash bug free, we currently recommend using bupstash for making REDUNDANT backups where failure can be tolerated.


Interesting, I've been using bup[1] for years but it looks like bupstash is scratching the same itches I had.

[1]: https://github.com/bup/bup


Thanks for that important tidbit! Looks like alpha software that might well be worth keeping an eye on, though.


gzip is a particularly poor performer when it comes to spotting repetition across larger amounts of data. Use bzip2 instead if you want to give home-baked tarballs a somewhat fair competition against Bupstash etc.


Or lzop for speed, while still being better than gzip.


For speed, LZ4 is great.


There's also lrzip for large files: https://github.com/ckolivas/lrzip


Thanks for the suggestion, I mainly used gzip as that is what my muscle memory goes for when compressing tarballs, it is a fair criticism.


I found out that xz -1 produces smaller files than gzip -6, faster. That's what we use now.


You should try zstd, it often beats xz. xz beats zstd only in high compression scenarios (-9).


How does that compare to xz?


In my experience it depends on the type of data, though never have I seen it do better or worse than a negligible amount (on real world data, I'm sure you could find the type it's better at and exploit that).


Or zstd?


Want zstd more focused on performance without too big a decrease in compression ratio, rather than getting better compression ratios? I might be mistaken.


zstd's performance curve is entirely within gzip's. For every level of gzip compression, zstd has a level with better compression at the same speed, and a level with the same (or better) compression at a higher speed. Note that this is all for single core; zstd can also run in parallel.


That's really cool, thanks for the info!


xz for the most part does a much better job than bzip2.


Both gzip and bzip2 are very old. You should use zstd.


Or xz, my current favorite.


Lets make a list :

attic (python) - https://github.com/jborg/attic

borg (c) - https://github.com/borgbackup/borg

bupstash (rust) - https://github.com/andrewchambers/bupstash

duplicacy (go) - https://github.com/gilbertchen/duplicacy

duplicati (c#) - https://github.com/duplicati/duplicati

duplicity (python) - https://github.com/henrysher/duplicity

kopia (go) - https://github.com/kopia/kopia

nfreezer (python) - https://github.com/josephernest/nfreezer

rdedup (rust) - https://github.com/dpc/rdedup

restic (go) - https://github.com/restic/restic

rclone (go) - https://github.com/rclone/rclone

rsnapshot (perl) - https://github.com/rsnapshot/rsnapshot

snebu (c) - https://github.com/derekp7/snebu

tarsnap (c) - https://github.com/Tarsnap/tarsnap

I think there are many more out there (https://github.com/restic/others) - I personally use

  restic
while technology wise (speed, only restore needs password) i would prefer

  rdedup
which is an impressive piece of software but unfortunately without file iterator... :-)


Would be curious how it compares with Duplicacy (which seems to be really popular) as well as Kopia, which is newer.


The tar+gzip+gpg runs first, and doesn't get the benefit of the OS buffer cache. Maybe clear the buffer cache between methods.

I'd also be interested how well running gpg before gzip in the pipeline works. It would certainly change the space used.


(Full disclosure -- I'm the primary maintainer / author of Snebu)

I would like to see how these compare against Snebu (https://www.snebu.com), now that it supports encryption. The project has been around for a number of years, but had a bit of a code refactoring when encryption was added.

The interesting approach that it take is it uses GNU tar to grab the files, optionally passes it through the included Tarcrypt (https://www.snebu.com/tarcrypt.html) filter, (which LZOP compresses then encrypts the file contents while keeping tar-compliant headers within the tar file). The tar file is then ingested by the Snebu backend, each backed up file is written to the data store (vault) and compressed (if it arrived in plain text, instead of processed via tarcrypt). The metadata (filenames, size, owner, permissions... etc, client name, backupdate, retention schedule) is stored in an SQLite DB (the backup catalog).

The encryption (handled exclusively by tarcrypt so the audit footprint is small) uses public key encryption (in the usual manner -- the RSA public key encrypts a randomly generated key, which in turn is used with AES-256). You can specify multiple RSA key files, so you can have a backup key (or keys) if you want. You can also change keys any time. Since backupsets are snapshot/deduplicated, you may (most likely will) have files in a restore set that are encrypted with different RSA keys. Not a problem, you will be prompted for the passphrase for each one at the beginning of the restore (and if you use the same passphrase when changing out keys, you only have to enter it once).

(BTW, tarcrypt can be used outside of Snebu -- it may eventually get spun off as its own project).

Other unique features that others lack:

* Key files are for a backup set, not tied to the repository

* Both push and pull backups are supported -- pull backups (initiated by the server) don't need any agent or client software installed (except tarcrypt if encryption is used)

* Granular user access permissions -- you can grant a user access to backup, but not delete, or access to restore specific hosts, etc.

* Multiple-client support. You can have dozens of clients on the same backend repository (I have about 75 or so in a development lab environment).

* Low dependency count -- doesn't require specific versions of python or other dependencies. Written in C, depends on lzo2, sqlite3, openssl.

Main drawback compared to tools such as Restic is that it is server-based, in that a client can't back up to "dumb storage".

Compared to Borg and Restic, only file-level deduplication is performed. Compression is a bit weaker (but fast), so size is a bit bigger than Borg (but smaller than Restic).

Snebu has a smaller developer base. This should be easier to fix when the internal code structure documentation is finished.

Automated end-to-end tests aren't included in the repository. This is being addressed.


> * Both push and pull backups are supported -- pull backups (initiated by the server) don't need any agent or client software installed (except tarcrypt if encryption is used)

Nice! Sounds like you can configure it so the client machine can't destroy all the backups? That's a feature I've wanted in backup software for a long time but haven't actually seen.


> Nice! Sounds like you can configure it so the client machine can't destroy all the backups? That's a feature I've wanted in backup software for a long time but haven't actually seen.

Tarsnap offers this feature. Pretty nifty, but the price is that you can’t have the client handle backup rotation (keep hourly backups for a day, daily for a week etc), the pruning must be set up via a different machine.


Yes, that is correct. You can either have the backup server ssh into your clients, or set up an account on the backup server for each client to use. In that case, 'snebu' is installed owned by the user ID snebu, and is set-uid to that ID.

When starting up, it checks if your UID is different than your EUID. If so, it looks up your UID in a table to determine what you can do (listbackups, restore, newbackup/submitfiles, expire, purge), affecting which backup sets.

Back on the pull backups, the backup server ssh's to a client (optionally using a non-privliged user), then sudo's to root (or if backing up files owned by a specific account, can sudo to that account). And no software is required on the client in that case (except tarcrypt if doing encryption). So it is really easy to set up backups on a fleet of servers.

BTW, I see in your profile that you are a C++ programmer -- feel free to drop by the github repository (github.com/derekp7/snebu) if you have any issues or ideas (I have Github discussions enabled also for free form chat).


> Nice! Sounds like you can configure it so the client machine can't destroy all the backups? That's a feature I've wanted in backup software for a long time but haven't actually seen.

Check out IBM Spectrum Protect. It's impressive in very many ways, also not that cheap. https://en.m.wikipedia.org/wiki/IBM_Tivoli_Storage_Manager

Also this forum is quote goldmine: https://adsm.org/forum/index.php#tsm-ibm-tivoli-storage-mana...


borg can do that if you restrict the client's ssh key to only run borg in append-only mode: https://borgbackup.readthedocs.io/en/stable/usage/notes.html...


Borg you can reconfigure the chunkier settings to tune memory and space usage depending on what type of files you're backing up (large VMs, small text files, etc)

There's some notes in GitHub issues with some recommendations

There are also implications for backup speed so I'm not sure such a naive test tells much (and maybe the other programs tested have similar configurables?)

On another note, for encryption, I just copy the key to LastPass or make an aes 7z and upload to Dropbox/Google Drive. In big cases, the data gets synced locally to devices with the client installed


I appreciate that all the graphs so far a si saw used (Smaller is better) helpful.

However as someone else wrote, these are the metrics the author chose, they make bupstash the clear winner, and he wrote that program. So, keep that in mind.

Still, very nice presentation.


As the author, I totally agree. Especially for the deduplication benchmark there are so many variables that affect it that is definitely is worth restating.

That being said, I am quite confident the results are accurate and don't mind other people running their own tests.


Question: did you unmount/mount the source filesystem between tests? When the size of data is less than RAM you are measuring cache speed, not so much disk speed. (And yes I realize that you have SSDs.)


source filesystem was also ramfs for these.


This happens organically. If there is a benchmark the author doesn't use, then they won't be able to improve bupstash's performance on it.

The old saying is "you manage what you measure"


This is true, a few months ago bupstash was half the speed of restic for 'put' operations, because of my benchmarking efforts It is now faster.

I did have the benefit of being able to publish my benchmarks at the time of my choosing.


did you compare also tarsnap and rclone? tarsnap should be interesting




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: