Hacker News new | past | comments | ask | show | jobs | submit login
Tarsnap – Online backups for the truly paranoid (tarsnap.com)
139 points by type0 on Jan 28, 2017 | hide | past | favorite | 74 comments



Colin is probably one of the best minds in encryption that I know of. He wrote Scrypt[1] as well. My gripe has always been that while Tarsnap is a product that is clearly built for developers, it does not have to look and feel like the 90's and use obscure and non-intuitive billing (picodollar). The web interface and landing site need some designer/UX love for sure, but at the same time I think Colin is happy and satisfied with how things are going. Highly recommend Tarsnap.

[1] - https://en.wikipedia.org/wiki/Scrypt


I like that it works like tar and that the website works perfectly in lynx, which are two things that make my life easier when I'm dealing with a catastrophe.


Funny story about lynx: We were recently making some fixes to improve the website on mobile devices and the question came up "how should the site behave if the screen is too narrow to keep the navigation on the left side of the site?"

My immediate reply: "You see how the site behaves in lynx? It should behave like that."


"We"? Is Tarsnap no longer just you?


Yes. I hired Tarsnap's first (non-founder) employee in May 2015.


http://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/ - pretty much everything you just said.


In case anyone sees this and not your other comment (which is now at the bottom of the page), I replied to it here: https://news.ycombinator.com/item?id=13505465


If you want a hand with some design work for the site, drop me a note. I'd be happy to help, and local to you.


Sound like a poorly founded rant:

* No need to use DMS, cron emails you the output of your tasks anyway. A three-line script can be used to "only email on error".

* The proposed pricing model goes against everything that tarsnap stands for, charging arbitrary amounts placing users in arbitrary ranges.

* The proposed pricing model means a 13x price increase for me (WTF!?).

* No need to be a sysadmin or anything alike to configure it. Granted, it's not for the masses, but for anyone in the tech industry, it's just a matter of RTFM.

Want something for end-users? Write a nice UI wrapper around tarsnap and overcharge that. Same backend, different product.


Tarsnap saved HN's bacon when a drive catastrophically failed shortly after I started working on it some years ago. (I no longer do.)

Cannot recommend enough.

Here's my sleep deprived post shortly after: https://news.ycombinator.com/item?id=7069013


Not sure why Tarsnap has bubbled up on HN again, but I'm always happy to answer questions about Tarsnap if there's anything anyone here wants to know.


Hi cperciva can you please explain the following:

1) "Every Tarsnap archive acts like it is completely independent of all other archives"

Lets say I'm backing up `/somedir` that looks like this:

# archive1 - initial backup

  ── somedir
      ├── file1.txt
      ├── file2.txt
      └── file3.txt
# archive 2 - 2nd backup

  ── somedir
      └── file3.txt (modified)
Are you suggesting that if I delete `archive1` and `rm -rf /somedir` that I can restore `/somedir` with `archive2` alone?

If so, how is that possible?

2) Does tarsnap stream the backup data to the tarsnap service or does it create the encrypted archive and only after this upload it?

For instance, if I've got 100 GB to backup (first time), do I need 100 GB of free space for tarsnap to work?


Are you suggesting that if I delete `archive1` and `rm -rf /somedir` that I can restore `/somedir` with `archive2` alone?

If archive2 only contains somedir/file3.txt, then if you extract archive2 you'll get somedir/file3.txt but not file1.txt or file2.txt.

But if you create archive2 containing all three files, tarsnap will recognize the duplicate data in file1.txt and file2.txt and not re-upload it; but when you extract archive2 you'll get all three blocks. Tarsnap reference-counts blocks of data so that when you delete archive1 it only deletes the data which is not used by any of the remaining archives.

I often tell new Tarsnap users that they should start by forgetting everything they know about incremental backups. Tell Tarsnap what data you want to have in an archive, and let it do the work of figuring out what's new.

Does tarsnap stream the backup data to the tarsnap service or does it create the encrypted archive and only after this upload it?

Tarsnap uploads data as it collects it. Tarsnap needs a small amount of disk space to keep track of which blocks have been uploaded previously, but it's less than 1% of the size of the data archived.


Always wanted to try Tarsnap for personal use, but I can only budget if the cost was comparable to Dropbox/Microsoft OneDrive/Google Drive/iCloud/Amazon Drive which are all ~$10/TB/mo, but instead Tarsnap seems about an order of magnitude more expensive? Or is the product just not aimed at me?

I currently use Arq[1] for backups (which has its own encryption[2] with a user choice of cloud backend), but Tarsnap has such a stellar reputation I'd definitely try it if I could :)

[1] https://www.arqbackup.com [2] AES-256 with PKCS5_PBKDF2_HMAC_SHA1 for key derivation, implemented by OpenSSL, with an open file format https://www.arqbackup.com/arq_data_format.txt


Tarsnap stores in S3, which is the primary cost driver. For the extra expense, you get multi-region durability up to I think it's six nines.


Are you really sure you need all that storage? Tarsnap has deduplication, so you might end up using a lot less.

Or not. It depends on your use-case, but I'm just suggesting you consider that factor too.


i haven't used tarsnap either, but i see it as cold storage. dropbox, onedrive, etc are all hot storage and would get rendered useless by a cryptolocker

cant say anything about arq, as ive never head of them.


How can I download backups into a different directory from where they originated?

[I was just trying out tarsnap, go to the restore test section, and was a bit stumped. It seems like "-xC /other/dir" might do it, but the man page is pretty opaque. I'd like to test a restore without the risk of clobbering my data with some misconfigured backup!]


It works just like tar for the most part, so it extracts into whatever the current working directory happens to be. I'd just do (cd /other/dir; tarsnap -x -f whatever-archive).


Thanks, that makes sense


Downloading large amount of data from tarsnap archive is very slow. Any tips/tricks to speed them up?


Scott Wheeler wrote a tool for speeding up extracts by running multiple tarsnap processes: https://github.com/directededge/redsnapper

There's also the trivial (but often unhelpful) advice of "be closer to the server" -- the problem is one of round trip latencies since the internal design of tar (and thus tarsnap) works on the principle of "read a block; do something with it; repeat". This is what I feel to be the biggest technical pain point in tarsnap, and I know in theory how to fix it, but every time I've started focusing my energy on it other things have come up requiring more urgent attention. :-/


Is using spiped viable for redis servers? Can all the servers that need to connect to the redis server all connect via spiped using the same secret key? Or do multiple connections mess things up? How many simultaneous incoming connections can one listener deal with?


I don't use redis, but I understand that lots of people use spiped with redis, yes. Each connection going through spiped is just another connection; spiped is intended to be a "drop in and change where you point your services at" way to make TCP connections secure.

Re simultaneous connections: The current release is limited to 500, but the next release (a matter of weeks) will be able to handle an arbitrarily large number of connections. That said, it uses select/poll based non-blocking I/O, so performance may suffer if most of your connections are idle. (This hasn't been an issue for any spiped uses I'm aware of.)


Colin,

Any interest or thoughts about adding Google Cloud Storage as an option vs AWS S3?

S3 (us-east-1) runs $0.023 per GB. Google Cloud Storage in a single region runs slightly less at $0.02 per GB.

Howerver, bandwidth may be the biggest cost factor for you not storage.


The price difference between S3 and Google Cloud Storage isn't enough to make a meaningful difference, particularly since I'd need to set up server code in Google Cloud to manage it.

Bandwidth isn't a huge cost; but the server which keeps track of where all the bits are in S3 and shuffles them around is surprisingly expensive.


Disclosure: I work on Google Cloud (and love the idea of Tarsnap, including the picodollars).

What about if you were to age things into Nearline and Coldline using lifecycle policies? Also, someone claims you do multiple regions, which our "default" multiregional does at less than half the cost of S3.

Admittedly we only recently (finally!) added per-object storage classes. But unless I misunderstand how your blocks work in tarsnap, wouldn't that be amenable to having some large portion as Nearline and possibly even Coldline? If not, I'd love to understand! (contact info in my profile).


Curious what AWS instance type you are using for that machine that keeps track of all the Tarsnap bits? I'm guessing it is highly available, so multiple instances of that type?


How many copies do you keep of the data to avoid corruption in storage?


Tarsnap data is stored in S3, with cryptographic authentication codes to detect any corruption; while data corruption has been detected in transit on many occasions (and requests retried as a result), I am not aware of S3 ever corrupting any data in situ.


Thank you.


A "truly paranoid" that is not a software developer can use Tarsnap?


I'm not sure what you're asking, but if your question is "can paranoid non-developers use Tarsnap", absolutely. You'll need to have some basic comfort with UNIX, though.


Yes, the question was if it is possible, regarding necessary technical knowledge to use Tarsnap for a non-developer to use it. You answered it, thanks.


I see some comments here from people wondering why anyone would use this or searching for use cases. I'll share mine.

I use tarsnap to keep daily backups of a web server with a handful of sites. I send over their content, configuration, and database backups. This isn't $$$-critical stuff so I just make a once-daily database backup and I'm fine with that. Every few months I verify one of the backups to make sure everything is still working, but it is extremely hands-off. I just don't have to think about it.

It is currently costing me about 3 cents per day. I could probably reduce that a bit with some pruning. It was 2 cents per day for a while.

Here is my most recent billing statement:

  2016-01-26 Balance                   43.152827492555316729
    Client->Server Bandwidth   8.2MB    0.002069643750000000
    Server->Client Bandwidth  41.2kB    0.000010296000000000
    Daily storage             3.35GB    0.027041509164110040
  2016-01-27 Balance                   43.123706043641206689
So far I have used a total of $7 of my $50 deposit. Let's hope Colin keeps the service running at least until I've worked through that. Might be a few more years, though. Here are my statistics:

  $ tarsnap --list-archives | wc -l
  496
  $ tarsnap --print-stats --humanize-numbers
                                         Total size  Compressed size
  All archives                               200 GB            86 GB
    (unique data)                            8.4 GB           3.3 GB
It appears I have been using it for about 496 days. I can access any specific day's snapshot and download it if I need. The 3.3GB here is the same as the 3.3GB for daily storage on my billing statement. I'm guessing that is also very close to the total amount of Client->Server traffic I have generated in my account's lifetime. Probably within a few percent? Not sure if I can quickly verify though.


2¢, 3¢, 10¢ per day, does that really matter? We are talking about less than a Starbucks for a full month of backups. Sorry, but developers who literally optimize cents over their time and effort is a pet peeve of mine.


2¢, 3¢, 10¢ per day, does that really matter?

It doesn't matter right now, but if you're building a company which you hope to scale up, it's good to have costs which won't get too big when said scaling happens -- because when your company explodes overnight, you're going to be too busy keeping everything else running to spend time reworking your backup strategy.

(Also: If you don't have backups now because you're "not big enough to have data worth protecting", you're going to be too busy to start doing backups when you suddenly do have data worth protecting.)


> but if you're building a company which you hope to scale up

This is a popular sentiment among engineers at startups and in my opinion founders should work hard to (kindly) beat it out of the team as soon as possible.

Basically, you can use that line to justify nearly any engineer hobby. We need microservices! We need message queues! We need master-to-master replicated NoSQL databases spread geographically! We need Redis, Kafka and Cassandra with a CQRS event source pumping data in there, oh and also to a Postgres so we can do arbitrary queries for management reports! We need our backups to cost not 3 but 2 cents a month!

But the truth is that, statistically, the chance that a startup will not actually scale up, is much bigger than the chance that once it does, the sharply increasing Tarsnap bill is going to drive it to bankruptcy.

I agree with you that you need backups from the start, but whether they cost $.02 per month or $10 per month initially really doesn't matter. There's a lot of features to build (and kill), users acquire, content to market, and the team is tiny.

I used to laugh about Twitter during their early growth days, fail whales all over the place. "What, they made that in Rails? Idiots". Now that I'm a startup founder myself, I realize that they did it perfectly right and I strongly doubt Twitter would've been what it is today if they had wasted time getting the perfectly scalable tweet processing timeline before putting the site out there.


I think it's a false dichotomy that great architecture and fast iteration are at odds. In fact, great architecture is what allows fast iteration to happen.

Yes, simplify your system by using as few moving parts as possible. But that also means don't use bloated frameworks that silently slow down your iteration pace with technical debt.

So many startups I've consulted with were stuck with having to redo core architecture right when they found market fit. It's a tough position to be in. The main rule to follow is that good design allows better design to happen later, it's worth investing in that


I fully agree and I don't believe anything in my comment contradicts yours.

> So many startups I've consulted with were stuck with having to redo core architecture right when they found market fit. It's a tough position to be in.

Good point, but watch out for survivor bias here: plenty startups failed before they reached product market fit, and some did so because they wasted time doing the wrong things. All things be told, I'd rather run a startup that needs to hire you to fix the core at the worst possible moment, than a startup that fails. But I agree fully nevertheless: good design begets good design and it really doesn't take much time.


> I agree with you that you need backups from the start, but whether they cost $.02 per month or $10 per month initially really doesn't matter.

For one service, no. But you can easily have one or two dozen such services, and then it really adds up, particularly with per-user or per-server charges.


Of course not. I'm mainly trying to show that it is incredibly cost-effective for this particular use case.


Sorry if my comment seemed rude, not my intention. I've had too much coffee and wound a bit too tight today.


I see what you're saying but like the breakdown anyway. The fact that it's "less than a Starbucks for a full month of backups" at 2-3 cents a day is a good, selling point. If it UI was layperson friendly, that line by itself would get a bunch of customers.


Honestly, I think he was just trying to illustrate how inexpensive the system is by showing how cheap it is and saying it could even go lower if someone cared enough (Which I don't think he does). That was my take away, not that he needed to optimize it


This has appeared on HN a few times.. I'd really like to use tarsnap but it's simply too expensive to be viable (to which I guess the response is: don't use tarsnap for full-backups -- but this makes it practically useless as far as diminishing the complexity of my current backup scheme[0]).

[0] http://duplicity.nongnu.org/


Have you found services with more favorable costs? I'd be curious to hear what they might be.

Besides that, you might just consider replicating to a mini-computer of some sort in another location that you have permission to use like a friend's or relative's house. A raspberry pi connected to an external drive should probably do the trick.

Duplicity looks like a good tool by the way, thanks for bringing it up.


I also use duplicity for my backups. I don't use the S3 option, preferring to write to my own storage, but if I were to turn on S3, what exactly does Tarsnap give me beyond what I'm already getting? Other than an additional middleman taking a cut?


Duplicity isn't bad; I often recommend it to people who want something similar to Tarsnap but don't want Tarsnap itself.

The biggest disadvantage to Duplicity is that it has a far more constrained archive management model -- with tarsnap you can delete any archive at any time, but Duplicity works with a "full plus incrementals" model, which means that (a) you can't delete an archive without deleting the archives which "depend" on it, and (b) you'll inevitably need multiple full archives.

Other points which are probably of lower importance to most users: Duplicity's website and documentation is even worse than Tarsnap's; and they rely on GPG, which has a pretty lousy security track record.


I'm a little unclear - does Duplicity run on macOS?


Yes, there's a Homebrew package for duplicity.


How much data are you trying to store?


I've been a tarsnap users for years (keeping personal photos and emails). I'm really please with the service.

I added a cron script years ago with the right paths/filenames, and cron emails me the output (the default). Eventually, the script got a bit more complex, because I wanted a nicer email subject, and being other silly aesthetics.

Still, the service is great and deduplication is awesome: data transfer is extremely low every day, but I have a separate archive for every day for the last several years.

I have a backup of my tarsnap keys off-site, so I don't lose any photos even in case of things like the house burning down. As for emails, chances are I'll never have to need those backups, since they're on my laptop, desktop, and fastmail.


I do use tarsnap and like it but in the last years I've found another similar approach that I do prefer.

What I do is create a local backup using borg. So I do get deduplication, compression and encryption. Once I update my borg archive (a few megabytes per day) I sync it to a cloud storage provider, like GCS, S3 or backblaze.

The reason I prefer it, is that I have a copy of my backups locally, which is convenient if I ever need them. Also borg is more friendly when it comes to managing old archives.


Hi Colin, what would you say is the strongest argument in favour of Tarsnap over Arq?

(As an aside: I've been following your work for the past four years - as someone who is quite passionate about infosec, I'd like to thank you deeply for your contributions to the community.)


I don't know enough about Arq to feel qualified to answer this question.


How does tarsnap compare to borgbackup https://borgbackup.readthedocs.io/en/stable/ ? rsync.net has quite some praise for it: https://www.reddit.com/r/linux/comments/42feqz/i_asked_here_...


As always, picodollar pricing is so confusing.


Agreed, as a consumer I prefer flat-pricing, even if it's nominally more expensive simply because it's easier to budget for and requires less thinking (honestly).


I kind of like it, though. It signals that you pay _exactly_ what you use, not a cent more.


The price is quoted in GB-months right on the home page. All the picodollar pricing does is eliminate any uncertainty over rounding.


I use Tarsnap and it's great for two big reasons:

1. Deduplication. I have a 10gb database but only about 30mb changes every day. So I can backup the full database each day but only pay for an extra 30mb.

2. Security. Colin is an expert in security and Tarsnap has a respectable bug bounty program. Plus I can have a Tarsnap crypto key which only allows read and write - no delete - which adds an extra layer of security in case a hacker gains access to my server.


I've been using tarsnap as a backup tool for a small NGO, especially backing up DB. I take a dump of the DB every 2 hours (we have almost 1k users now) and send it over.

The biggest pain is that it now takes ages to list archives. I think i have like 1500 of them. But otherwise: i'm really glad tarsnap exists. I deposited $5 in november and have $4.673539677478560010 left (i'm pretty sure Colin can now tell who i am, using the unique amount of picodollars)

My stats now:

  pawwer@pro16:~$ tarsnap --print-stats --humanize-numbers
                  Total size  Compressed size
  All archives       137 GB            42 GB
  (unique data)    2.0 GB           489 MB

  pawwer@pro16:~$ tarsnap --list-archives | wc -l
  1425
Thanks, Colin. It's a great service.

However, i backup my personal photos someplace else, tarsnap is a bit pricey for that. You can get backblaze for $0.005 / GB month with free transfer, so i use that.


Why did you never follow Patrick's advice: http://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/

Not your thing? I've always wondered.


I do follow some of his advice. :-)

This bit: "Yes folks, Tarsnap — “backups for the truly paranoid” — will in fact rm -rf your backups if you fail to respond to two emails." is no longer true; you get three emails, and often more if you're a long-time Tarsnap user / I recognize your email address for some reason / you have a history of getting your data almost deleted.

A short time later, Patrick writes "If Colin does, in fact, exercise discretion about shooting backups in the head, that should be post-haste added to the site" -- which is something I considered doing but ended up with a firm "nope, not happening" to, for two reasons:

1. I hate to advertise "discretion" because people get upset if you exercise your discretion in a way which is not in their favour, and

2. I have solid statistical evidence that people respond to incentives. An email which says "Your Tarsnap account will be deleted soon" is far more likely to make people do something than an email which says "Your Tarsnap account needs more money and if you don't pay up soon and don't write back then I'll think about deleting it" (which is actually closer to the truth). I really really don't want to delete someone's data when they still need it -- it has happened a handful of times and I feel awful about it -- and implying the presence of a ruthless cron job is quite an effective mechanism for preventing that.

Later, Patrick writes

Current strap line: Online backups for the truly paranoid

Revised strap line: Online backups for servers of serious professionals

Here, I simply disagree with Patrick; nobody interprets "truly paranoid" as meaning "diagnosed by a psychiatrist as suffering from mental illness". I think this branding has been highly effective.

Tarsnap is for backing up servers, not for backing up personal machines. It is a pure B2B product. We’ll keep prosumer entry points around mainly because I think Colin will go nuclear if I suggest otherwise, but we’re going to start talking about business, catering to the needs of businesses, and optimizing the pieces of the service “around” the product for the needs of businesses.

I have an unfair advantage over Patrick here: I know Tarsnap's user base. With the exception of Stripe, which started using Tarsnap thanks to Patrick (err, the other Patrick...), every large corporate user of Tarsnap I can think of started using Tarsnap thanks to a sysadmin who had used Tarsnap personally. In economic terms, Tarsnap's "personal" customers provide most of their "lifetime value" as sales channels to their employers.

This is already getting a bit long to be an HN comment, so I'll stop going through point by point. Suffice to say that a number of things Patrick suggests have either happened or are in progress. Customer testimonials? There's now a page full of them (starting with Stripe). Improved getting-started documentation? Done. Advice for dealing with a variety of common scenarios? A whole page of tips. Binary packages for common platforms? Due to be announced in a few days (currently available as "experimental"). A GUI? In progress, hopefully landing soon.

I've used this metaphor before, but I like it so I'm going to use it again. Patrick gives great business advice, but Tarsnap is not just a strategy for me to make money. So I treat his advice like ships treat navigational beacons: Paying close attention to them, and using them to plot a course, but not steering directly towards them.


ironically he claims tarsnap is run as a lemonade stand mostly because he uses it for his serious business and he lacks a process to check his balance every week. That should be one extra little step to his backup check anyway.


If you're doing backups for your business, I've written on how to properly encrypt backups[1] and how to use Google Compute Engine for backups[2]. I'm working on write-ups for AWS and Azure that should post within the new few weeks.

[1] https://summitroute.com/blog/2016/12/25/creating_disaster_re...

[2] https://summitroute.com/blog/2016/12/25/using_google_for_bac...


This is certainly one way to do backups. Two things which come to mind on first reading:

1. You're encrypting backups but not authenticating them; someone with access to your archives could trivially truncate an archive or replace one archive with another, and there's a nontrivial chance that they could splice parts of different archives together.

2. Every archive you're creating is stored as a separate full archive; this will probably limit you to keeping a handful of archives at once. With a more sophisticated archival tool, you could store tens of thousands of backups in the same amount of space.


These are both accurate.

For 1, I ensure that an attacker can not modify my archives after they've been uploaded by giving the backup service "put" only privileges. This is not possible with GCE from the article unfortunately, as I point out in a warning banner there, but is with AWS that I'll post soon. My use case is primarily to have a backup in the event of a devops mistake, or malicious attacker (ransomware), so I assume if someone has write-access to my archives they would just delete them, so authenticating them isn't as big of a concern, but although this would be a good idea just to ensure the files aren't corrupt in some other way.

For 2, my storage needs currently aren't expensive (100GB archives per day, which means pennies per day for all of them), but eventually I plan on sending just diffs. I also wanted to create and send backups in the simplest possible way to help people get up and running as fast as possible, which meant limiting myself to the "openssl" command and other basic commands. The other, smarter, solutions I'm aware of are either tied to a service (ex. tarsnap) or don't maintain the data as encrypted at the backup location.


Not bad! Can I recommend that you try out the per-object storage classes and lifecycle policies? Particularly if folks are going to be effectively rsyncing things, it's really handy to minimize the combination of retrieval fees and storage fees (this really depends on the manner of backup, incremental versus full, etc.).

Also not mentioned is that each service also supports versioning, which for backups that don't do block-based backup, can be an alternative DR plan (e.g., don't allow some users to delete the last version).

All in all, a good start (complete with helpful screenshots!). Looking forward to the guides on S3 and Azure Blobs.

Disclosure: I work on Google Cloud.


Tarsnap is great if you backup a file system holding small files. If you use it to backup eg vm disks images, the slow restore makes it an impractical tool. Worth knowing before switching to Tarsnap...


They are as secure as the words on that page that they wrote the security on. Can you really trust it?


Does Tarsnap backup tarsnap?... just curious :)


I use Tarsnap to back up Tarsnap servers, yes. But obviously not to back up the backups, because that wouldn't make any sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: