Hacker News new | past | comments | ask | show | jobs | submit login
Tarsnap outage postmortem (tarsnap.com)
553 points by anderiv on July 27, 2023 | hide | past | favorite | 319 comments



blinks

Ok, I really wasn't expecting this to land at the top of HN. I'd love to stick around to answer any questions people have, but it's 10PM and my toddler decided to go to bed at 5PM... so if I'm lucky I can get about 4 hours of sleep before she decides that it's time to get up. I'll check in and answer questions in the morning.


Why would I use your service over restic?

God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service. I'm glad you're clear about no SLA, but this seems like a big liability between me and my backups.


It's a pretty well-known fact for years that tarsnap is basically a one-man show, and yet Colin has managed to provide fantastic service so far. Sometimes having ppl who built the service also managing it is actually a big plus, compared to other services where you first have to fight through outsourced & underpaid support that's limited to template answers, only to finally get some "engineer" who got that job 2 months ago and is more clueless on their system than myself...


And to be frank, I've seen plenty of mission-critical services at $bigco which may have had a team of engineers working on them, but the core functionality was maintained, understood, and supported by effectively one senior engineer. If anything went wrong, the supporting junior staff might have been able to fix reasonably simple stuff, but there was essentially one person who understood the system deeply enough to handle problems of any real significance.


Absolutely.

Early in my career, I became the second person able to support and operate a system that was public facing and responsible for billions of dollars of activity that mattered to many individuals and stakeholders. The entire team retired over a period of six months, after giving the folks in charge a year or more notice. After about 12 weeks, I was the sole guy, training a 4-5 new people.

We’re all probably using a service like this. As demonstrated by Twitter, well engineered systems can persist, even without proper care and feeding, until they don’t.


I hate to bring this up, but what about the bus factor? If Colin is physically unable to continue maintaining the service and something like this happens again, how will anyone be able to get their data out? It's not really a concern about the service Tarsnap provides today


There's an old Sys Admin saying (perhaps from Allan Jude of ScaleEngine) that goes something like "if your data doesn't exist in at least three places, it doesn't actually exist at all..."

That is to say, if Tarsnap is the only place you've keeping sensitive/important data, then you're "not doing it right" as a backup. Things happen... your hard drive can die suddenly, and a data center bursts into flames all on the same day.


I feel like ovh will never stop earing about this. This has been, frankly, a traumatic event for many sysadmins I believe, and one that was shared by many from the same source, which is quite different from the standard variation of "that time when I erased the production database" (looking at you gitlab, but also at myself!). I mean, at this point it's between a legend and a warning tale and I don't know what else to call it. A bad Wednesday probably.


> I feel like ovh will never stop earing about this.

To be fair, they deserve it a bit as they got up in flames twice .

Indeed, after the first fire, the geniuses over there collected all the UPS and batteries they could find from the DC and stored them all in a pile in a closed container... where they predictably bulged, failed, sparked and eventually triggered another fire after a couple days.


Why the scare quotes? I would expect any well-experienced power user to know a complicated system better than a fresh engineer two months into working on it, with no previous experience on the system. Especially if the power user is an engineer themself.


You really shouldn’t if that’s a major concern for you and that is a valid concern. For the same reason I’ll never use PurelyMail otherwise it’s perfect.

I know you didn’t ask me — but I don’t think Colin can answer differently other than saying that he is training a family member or friend to take over if needed.

Here’s more https://news.ycombinator.com/item?id=7514753 this is also linked there http://mail.tarsnap.com/tarsnap-users/msg00846.html

Very old threads but I am not sure much has changed there https://www.tarsnap.com/contact.html

Why would you use it instead of restic? Well, for pricing in pico dollars ;-)

and for it has a functional GUI with tiny system footprint and that there really aren’t many such solutions out there.


> God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service

Hence the toddler.


I am really confused by the communication thread and am interpreting that the toddler is somehow in charge of running the infrastructure as a joke. Yet I can’t see it as either a joke or serious.

I’m a native English speaker but sometimes I swear I’m losing grasp on communication in the Internet age and am sincerely trying to understand this all.


The joke is that the toddler is for future maintenance, not now.


My toddler runs https://rangerovers.pub and it mostly holds up okay. He's not great at yaml because he can't really read so the significant whitespace is a problem, but he knows how to run the backups and ensure the mail handler isn't choking on all the Russian spammers. We try to limit his screen time though so he's only on for the 15-minute maintenance window. The Aprilscherz frontend for Docker is a big help.


Are you suggesting that those who build enterprises don't have time for kids? Seems plausible, but is the difference in lifestyle so consistently prevalent as to be stereotypical? Elon has 10!


Raising the toddler to have some help running the business.


Might take a while. Tarsnap has never had an employee without a doctorate. She's a very bright girl but I'll be surprised if she gets her doctorate before 2040.


So you're the one in charge of the unix epoch rollover?


Not just help: there is now a clear heir to take over if (the gods forbid) cperciva ever succumbs to illness or is defeated in battle.


tarsnap natively protects against inadvertent or malicious deletion or corruption — old tarsnap backups are immutablez The low-cost competitors (restic, borg, etc) seem to have this feature as an afterthought, and they make it surprisingly difficult.

(FWIW, S3 can be somewhat straightforwardly configured so that old data is effectively immutable. Google Cloud Storage’s similarly named versioning feature appears to be far weaker.)


Yep, S3 is reasonably easy to configure for immutability. I personally use restic to send (encrypted) blobs to https://www.borgbase.com which has append-only mode and monitoring to warn me if some backups didn't happen.


borgbase is another “little” service that I use and like just like tarsnap and to some extent rsync.net. And they also have an excellent gui app Vorta (it’s FOSS; borgbase dev is the maintainer).


Even large organizations can have fairly regular availability issues. I appreciate the noted flaws of "single point of failure", but I also see orgs where 100s of people have access to the infrastructure, make a change, and then it breaks something. I wouldn't do business with an org just because they have many people, that won't mean they're operationally sound, at least not to my expectations.


If the data is super important you should be setting on two different providers anyways for backups.


Honestly, whose data isn't "super important"?. All my data is super important. Even the crap I just throw on my Google drive. I want to keep it.

What is this mythical unimportant data that people still want to back up?


I mean, you called it crap and then said it's super important. That's what hoarders say.

Subjectively you may feel that your data is super important, but objectively it probably isn't.

When people talk about 'super important' (totally a technical term), I think of things like DB backups in software companies, backups of financial reporting for firms, etc. Not your tax return from 2008.


My nginx config is not super important. My old reports written for study are not super important. My pirated movie copies are not super important.

These are examples of data that I could easily live without. Where losing it would either be a matter of re-doing old work, or just forgetting about old and minor things.


>What is this mythical unimportant data that people still want to back up?

I have lots of stuff like this. Often it is easier to just back up an entire folder than go through sub/sub folders separating stuff into: important, not very important. Storage costs are low enough to just backup everything (almost). Also, one often doesn't know what may be important/useful in future. For example a couple of years ago I had this huge buildroot system (600gb) to build firmware images for a single board computer I spent quite a while to put together. The project I was doing it for got cancelled so I had no need to keep it. Still I wish I did, as I'd love to be able to tinker with it now, but 600gb is not a trivial amount to store so it got deleted. Most of this data was pulled from various online resources that don't exist anymore too.

What's the morale of my story? If you have a fast internet connection (I don't) backup "everything" to cloud. Then find "really important stuff" like the pictures of your children etc and back it up again to a different cloud.

If you're in a middle of nowhere on a slow LTE connection like me, building a nas box is not a bad idea for backups.


I have stuff like that on a hard drive in my home, on a persistent storage volume from Linode, and on Dropbox.


Anything that you stashed just for convenience, but you could re-download or re-generate it if really needed, or simply live without it... frankly, like 90% of stuff on my disks fall in the category "I'll read/view it one day", which in reality I'll probably never have time or patience to open ever again.


Strange, 90% of the things trapped in my flash memory are system files.


You should get another drive and reserve it to Data. The cost is negligible and it really makes everything much simpler.

Optimizing your system or upgrading it just becomes a "trash boot drive and reinstall" operation, applied without a care in the world.


The stuff I don't want to fuck around searching re-downloading from torrent for example.


I really used to enjoy formatting my machines about every 6 months.

Well I used to until macOS kinda went off the rails a bit. Now it’s mostly an exercise in running my arch script for my thinkpad.

Being stuck between operating systems is kinda a mess though, makes backup and file sync in general really hard. But everyone’s gotta have their own cloud, right?!

Why can’t I just put a cloud under my bed and forget about it?


> Why can’t I just put a cloud under my bed and forget about it?

Just buy a Synology NAS. Keep default settings, set up a few user accounts, tweak a few things here and there, enable encryption, install Active Backup on all your devices, done.

There are many cheaper/more open options for self-owned NAS storage, but contrary to a Synology they're definitely far and away from "and forget about it".


What use is SLA? If a service goes down for too long, are you really going to hire a lawyer sue it over SLA or just... use another backup?


Not even then - most SLAs say that if it's breached, you pay less. Not that you get money back


It's not about suing, but defining expectations about how you can rely on a service.

For example, my team has people across the world for HW bringup, so we can't allow our code hosting or CI to be down for more than a few hours. Of course, backups have different uptime requirements, but as for everything, it's a tradeoff between features, of which an SLA is one.

Tarsnap's features are granularity of cost, reliability of storage, and encryption, but not 99.999% uptime.


> It's not about suing, but defining expectations about how you can rely on a service.

Meeeeh, my ISP cut of around 100+ fiber connections in my town and spend three weeks fixing it. My neighbor have business line, there's an SLA on those that among other things, require them if reestablish his connection within 3 - 5 hours. It took them over 500 hours, so that SLA is useless for anything but forcing compensations.

The problem is that the SLA should give an indication of available resources, but in reality it's mostly a contractual thing for most companies, they'll pay the "fine" or refund a customer if they fail to hit their SLA and that's about it. Tarsnap most likely have better availability than many midsize competitors simply because it's just one person who really cares about it. Doesn't help if he's hit by a bus though.


SLAs can be meaningless like that. However the better ISPs have in place a backup system that doesn't use the same fiber/wires. Sure the backup might be a radio or satellite feed and so be slower, but it will get/keep you online. This costs are lot more per month though, so if you are not paying for that service your SLA will probably just be we give you a free month (which hurts them enough that they will do some things to prevent downtime, but not enough that they put redundant fiber paths in the ground)


The "problem" is that no sane company will sign for any damage compensation on some cheapo few dollars a month service.


A company could... if you have N users and you pay M for storage per user and downtime cost you X then it could be that a discount of Y means (M - Y) * N = X


Agree. You get a discount if something breaks. But SLA really only works for larger services where the cost of fixing something is small when compared to the discounts.


Then SLO (service level objectives) should be enough.


huh, never heard of SLO before.


and now also go google what SLI means ;-)


Very roughly:

SLI - Service Level Indicators - Metrics ie Latency of each request / response cycle

SLO - Service Level Objective - What threshold we are aiming for - 10 ms from request to response averaged over 1 hour period.

SLA - SL agreement - contract with custom yet what happens if we breach (credits given, put the CTO in stocks and throw eggs at him etc)


I know it's a joke, but I think if an SLA involved putting a CTO in stocks and throwing eggs at him then that'd encourage me to sign up for the service. Especially if the video of it were posted after every incident.

Instead we get refunded some pitiful amount when our business is seriously disrupted for an extended period of time.


:-)

My youngest once found some sort of chocolate drops called "unicorn poo" - which seems a more ironic thing to chuck at CTOs !


Don't let the CTO be a scapegoat. Entire executive leadership, all board members and the 5 largest shareholders.


We have just written the Sarbanes-Oxley for the tech regulation industry- all we need now is a congresswoman and a senator and a good acronym

Secure

Technology

Oversight for

Corporate

Software

STOCS Act here we come !

Edit : yeah I could not get the K in ... that's hard


Korporate.


I'm curious how the prices shake out against services like Wasabi, since it's just dumping to an AWS S3 bucket

Wasabi does $7/TB with no ingress/egress fees. My NAS is set up to rclone to it about once a day and I've yet to have any problems


I haven’t checked the pricing in a long time but you can use Tarsnap also if you have to backup only 7.3kb (okay I might ne exaggerating here but you get the drift) and pay for only that much. You can’t do that with Wasabi et al.

Also it’s really simple and does what it says it does, nothing more, nothing less. In today’s everything convoluted and bloated world this is a luxury imho. The GUI app is also quite good and functional. Support is prompt (that is if you need it).

You don’t have to worry about file being deleted just because your machine didn’t connect or backup for some time even if you keep paying (hello Backblaze) etc. I mean there’s no circus, melodrama , and cliffhangers involved.

I personally would never use it backup my entire laptop, due to price alone. But I have a subset of VVI files and Tarsnap is one of more than one backups for those files. So for that use-case Tarsnap is perfect for me, so far.


Backblaze has kept my ‘shutdown two years ago’ machine data without issue. What problems did you have with them (or did others have)?


Backblaze has a policy of allowing backups of external disks, but the disks have to be connected at least once every 30 days, or they'll delete the backups. I understand they want to avoid abuse, but the lack of any grace period, or ability of support to ad an override, really soured the service for me.


You can just pay extra for extended or infinite retention. https://www.backblaze.com/cloud-backup/features/extended-ver...


I did this, it's come in use a few times..


Huh, I had to start paying them $2 more for my nonexistent PC I think, but otherwise was fine. I have only 1 TB of total storage on that PC though, so maybe that’s the reason.


Uptime isn't an important property of a backup solution, so I'm not sure where the expectation comes from?


It sure should be up when you need it, exactly at the time you need it.


In future postmortems (of which I hope there will be very few or even none) you may want to spell out your 'lessons learned' to show why particular items will never recur.


It always amuses me how people want reassurance that the next crisis will be a fresh, new problem, and not one the person can demonstrably solve.

A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.


That's an overly cynical take, post-mortems are not for anyone's reassurance, they are a learning opportunity.

The airline industry is as safe as it is because every accident gets thoroughly investigated with detailed reports ("post-mortems") including what to do differently going forward. These are taken as gospel among all players in the industry and as a result, you very rarely see two different accidents caused by the same thing anymore.


That was entirely not what I was getting at and is a cheap shot that is well beneath you, especially because I suspect that you know that that wasn't what I was getting at.


My comment wasn't intended personally; your words about "will never recur" just reminded me of this peculiarity of software systems, where it's often error handling/monitoring/backups/etc. that cause cascading failures in the systems they're intended to safeguard.

I'm sorry if I misconstrued your meaning, but I am flattered that you think there are things beneath me!


Fair enough. I see the whole function of a postmortem in a very simple way: to avoid recurrence of the same fault. Yes, there will be plenty of new ones to make. But if you don't change your processes as the result of a failure you are almost certainly going to see a repeat because the bulk of the conditions are still the same. All it takes then is a minor regression and you're back to where you were before. This I've seen many times in practice and I suspect that Colin isn't immune to it. And yes, I look up to you, your writing is usually sharp and on point and has both amused me and educated me. So you have an image to live up to ;)


You'd love my team's recent postmortem, featuring the comment "action items haeve been copied from the previous postmortem".


Could be as simple as "test restore a new server every 1-2 years"


You should consider this possible lesson:

"Our simple model that fails gracefully did so and was simple to recover"

Redundancies and failsafes are not free - they add complexity.

99.9% availability fails in boring ways.

99.999% availability fails in fascinating ways.


Yeah, I was going to do that but it was getting late, I wanted to get some sleep, and the post-mortem had already been waiting far too long to be sent out.

The main lesson learned was "rehearse this process at least once a year".


Agreed, that's the big one. But also: when sleep deprived: take a nap!


The infrastructure page* says,

> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred

I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.

If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.

For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.

* http://www.tarsnap.com/infrastructure.html


This was an extremely well written and thoughtful postmortem, but I hope to never see one from you again. :)


It was a postmortem without the mandatory "how can we prevent this in the future" steps…


In 15+ years of running this service, this is one of two (2) postmortems he's ever published, and the first in eleven (yes, 11) years.


I think that's a little unfair given what was in the postmortem. It may not be a separate section with the key points, but the information is all there of what the issues were and what the solutions are. I think it's fair to assume they're actually acting on those without them needing to be reiterated at the bottom of the page.


I agree, we don't really need a "key points/future actions" section that boils down to "The service will be geo redundant"


Well, for sure he has fixed several bugs, but he didn't say that he would be testing his disaster recovery procedure every year in the future for example.


Yes, rehearsing the process every year is the main lesson learned. Sorry, it was getting late and I wanted to get the email out so I cut it short.


Time to get your toddler providing round-the-clock support! ;)

Have been having some luck reading https://www.amazon.com/No-Cry-Sleep-Solution-Toddlers-Presch... - available everywhere libraries (blockbuster for books!) are found.


She's generally a wonderful girl. Right now she's dealing with her second molars coming out and just picked up a cold though, which is throwing off her sleep schedule.


How long do you keep the transaction logs before rewriting them?

I too had a few EC2 instances go down with signs of being severed from the EBS in the recent couple of weeks; mine were in eu-west.


There's a continual background cleaning process which depends on the amount of storage which can be reclaimed -- there's a tradeoff between cleaning too slowly (and paying for wasted storage) and cleaning too fast (and paying for lots of S3 operations). I think it averages a couple weeks right now.


Thank you for the post-mortem Colin and I hope you get some sleep!


Thanks, I did! My long suffering wife was up at 3:30 though. :-(


What I'm wondering is, I had data on Tarsnap, why am I only hearing about this now?


Some recommendations on the AWS front (not sure if some of these are already implemented since the postmortem does not go into AWS details).

- Setup nightly automatic snapshots of EBS volumes (this is supported natively now in AWS under lifecycle manager).

- Use EBS volumes of the new GP3 type, and perhaps use provisioned IOPS.

- Setup a auto-scaling group with automatic failover. Of course increases cost, but should be able to automatically failover to a standby EC2 instance (assuming all the code works automatically which the blog post indicates is not currently the case).


Can you say a bit more about the log-structured S3 filesystem? I wrote something very similar recently (https://github.com/isaackhor/objectfs) and I'm curious what made you settle on that architecture. The closest thing I know of that's similar is Nvidia's ProxyFS (https://github.com/NVIDIA/proxyfs)


> the central Tarsnap server (hosted in Amazon's EC2 us-east-1 region)

What prevents you to distribute load among other regions?

(Also: did you ever think about abandoning AWS?)


Nice write up. A couple questions:

- The use of “I” begs the question: what’s the “bus factor” of Tarsnap? If you were unavailable, temporarily or permanently, what are the contingency plans?

- Will you be making any other changes to improve the recovery time, or did the system mostly function as designed? For example having a hot spare central server?


Are you gonna switch to us-east-2?


> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, on 2023-07-13 (after some dust settled and I caught up on some sleep) I credited everyone's Tarsnap accounts with 50% of a month's storage costs.

This speaks volumes to me about what kind of person Percival is; that credit would appear to be generously on the "make customer whole" side of the fence, and unlike the major cloud providers, he didn't make each customer come and individually grovel for it. And a clearly written, technical, detailed PM, too. This is how it ought to be done, and done everywhere. Thanks for being a beacon of light in the dark.


"Thanks for being a beacon of light in the dark."

That's well put.

It makes me very happy to live in a world where tarsnap exists and is priced in picodollars.


For the record, I'm happy to live in a world where rsync.net exists. I've pointed quite a few customers in your direction over the years, when tarsnap hasn't been suitable for their needs for a variety of reasons.


They make a good pairing. I backup my ZFS NAS to rsync.net for all my media and Tarsnap for all my documents/critical things.


I am only using rsync.net at the moment, more specifically with the discounted "borg" mode without an explicit full shell.

Your comment sounds like tarsnap is more secure (in terms of longevity) than rsync.net. Is this true? If yes, why?

Genuine question, because I'm using rsync.net for my critical stuff and would gladly move to tarsnap if appropriate.


For me it's three fold.

First is that my usage of Tarsnap pre-dates my usage of rsync.net so it's been the primary backup of my home directory since ~2010. I haven't felt a need to change it and the 2 occasions I needed to restore it everything worked perfectly. i.e don't fix what isn't broken.

The second is that while I can in theory restore from rsync.net I actually never have... this is more a testament to the relability of ZFS though I guess and local snapshots have always been enough. That said the convenience of send/recv is sort of awesome.

Lastly is that I don't use ZFS on my client machines. If I did I would probably consider rysnc.net for everything.

So it's not really an explicit I think one is more secure or durable than the other it's that Tarsnap has fulfilled my DR needs successfully for a long time and I have come to trust it to do so in the future.


The downtime could have been much shortened if you had properly setup and _tested_ disaster recovery steps. Create a full fledged separate staging system which you can bring down and recreate and periodically test various failure modes + document all detailed steps of system restore etc.

Also I would suggest to think about the business long term and seeing if you can increase the revenue enough to enable you to hire a part-timer who can be of great help in case a similar event happens.

We are also a small cloud solution provider (we focus on ML API's) and over the years it has become clear to us that when you use cloud hardware (either dedicated or virtual), from time to time the outages periodically happen. RAM, HDD or other parts of the hardware just can malfunction anytime. So this is something which 100% needs to be taken into consideration when running any high availability online service over long-term.


Hats off to you for an honest postmortem and your capable handling of a difficult situation. The only remark I would offer is with respect to sleep deprivation—when you're the only person who can fix a problem, there's no shame in trading some additional outage time for a fresh mind. Though it feels weird to go nap when all the klaxons are blaring, problems are too easy to compound under the combination of adrenaline and inadequate sleep.


Don't worry, I had a couple naps in there. "This seems to be running smoothly but it will take several more hours; I'll set my alarm to wake me up in two hours and have a nap" is part of why I didn't notice the second step was unnecessarily I/O bound.


IIUC the process had a few steps were you only had to wait while data was transferred or processed for long times. They were probably useful to take a nap or eat or just drink more coffee.


Based on the description it sounds like it should be relatively easy to test this recovery process on a regular basis, to catch any lingering bugs and evaluate the recovery time. As they say, the only backups are the ones you have tested.


As someone who just discovered my DR process does not work by testing it, 100% this. The only plan that is likely to work is a repeatable tested one.


Ideally, the thing you do in an emergency is largely routine, so that it happens by instinct rather than being a special case you need to remember. It should not be different in arbitrary ways.

For example in both trains and cars, thanks to anti-lock braking, the correct way to stop the vehicle ASAP is to brake just like normal but as hard as you can, the computers will automatically solve the much trickier problem of turning your input into maximum deliverable braking force by periodically releasing brakes on sticking wheels.

If you run a fire drill, it's surprisingly difficult to get employees to use fire doors that they're used to finding alarmed and unusable. Even though intellectually they know that, say, the door at the bottom of the stairwell is a fire door, with crash bars and leads directly to the outside world, and this is a fire drill, they are likely to (for example) exit on a higher floor and go through a chokepoint lobby, as they would normally, instead of following this safer path that is emergency only. Sadly it is hard to fix buildings after construction if they were designed with such "unused" emergency exits.

For a backup process, having restoring machine images be a service that is sometimes, though not constantly, used anyway for some other reason, is a good way to be comfortable with how it works, that it works, etc. At work for example we routinely test upgrades on test servers restored from a recent backup. Restore serviceA to testA, apply upgrade, discover upgrade completely ruins the service, throw testA away and report this upgrade is garbage. But in the process we gained confidence in the restore process, infrastructure people instead of trying to recall something they only ever did in a drill, when things go badly wrong are very used to this procedure because they do it "all the time".


This is terrible. Instinct cannot be trusted. Write it down.


There are two types of emergencies - checklist ones, and panic ones. You need to have both, but realize that in the panic ones people do NOT operate rationally.

This is why house doors open in but business doors have to open out - if there’s a crush against a fire door it opens.

You even see this in aviation, where everything is checkisted; the pilots will first stabilize the plane in an emergency and then run the checklist. And small plane that operate unexpectedly are always higher in crash rates.


Pilots are a little special, their panic mode is also a checklist, known as the memory items.

This doesn't work for normal people because normal people don't drill non-normal events until the response is instinctive.


Normal people should drill certain non-normal events (for example, all drivers should know how to deaccelerate and get off the road quickly).

But you should NEVER design a system that requires normal people to drill non-normal events; even planes have been redesigned to "fix" problems where the pilot had to do something unintuitive or unexpected, because eventually it WILL catch up to you.


Note that in airplanes (unlike cars) you normally cannot just get in a new one and fly. You first get training on that particular plane. If everything goes perfect any pilot can get in any plane and fly it, but if any little thing goes wrong they better know how the plane flys very well so they can get it stable enough to run the checklist.


You probably shouldn't just get in a new car and drive it, but people do. I remember at a hire car place once the team I worked with were given an automatic, the guy driving has never driven an automatic transmission before, but his license authorises it (UK licenses allow everybody to drive an automatic, but you need to test in a manual to drive manual), and so they just lent him a car with a completely different driving style. He had to get them to show him how to even drive it away out of their car park.

I learned in the small car from the same brand as my father's larger car, so that the controls are in the same place, the symbols on stuff are identical, all that was different once I have a license and borrow dad's car is it's longer and has more power.

It also probably shouldn't be legal for me to drive today, but it is. I learned 25 years ago, and I haven't driven anything in over a decade, so a rational system would say nah, you're too rusty, get a refresher course, but there's no mandate for that.


It is kind of mind-boggling insane that you can be 25 years or (younger in some states/places), having only ever driven a smart car (so you have your license) and you can walk into U-Haul and rent a 26 foot box truck with a trailer, and the most they do is tell you not to go under low overpasses or into drive-thrus.


Yep! I've been meaning to do it for a while but there was always something higher priority... I didn't realize until this outage that it had been almost a decade since I had tested it.

Rehearsing this annually is definitely going to be a high priority.


I always appreciate seeing a professional, courteous, and honest postmortem like this one.


(caveat: I may be running on old tarsnap company info but) I must say, the ONLY thing that has ever made me shy away from seriously using tarsnap was the prospect of an unexpected Colin Percival outage. i.e. key person risk. I'm guessing I'm not alone in this.


It's an MTBF like calculation: do you trust the well engineered one person company that has a well engineered solution with few moving parts over the much larger company that has far more moving parts and their probably less well engineered solution with far more moving parts?

I personally would go with the simpler solution because in my experience you need an awful lot of extra complexity before you get to the same level of reliability that you have with the simpler system. Most complexity is just making things worse.

You can see this clearly when it comes to clustering servers. A single server with a solid power supply and network hookup will be more reliable than any attempt at making that service redundant until you get to something like 5x more costly and complex. Then maybe you'll have the same MTBF as you had with the single server. Beyond that you can have actualy improvements. YMMV and you may be able to get better reliability at the same level of performance in some cases but on average it always first gets far more complex, costly and fragile before you see any real improvements.

I strongly believe that the best path to real reliability is simplicity (which is: as simple as possible) and good backups. For stuff that needs to be available 24x7 and 365 days per year this limits your choices in available technologies considerably.


While I get this as a risk, I'm not convinced it's any more risky than a larger corporate entity.

This is Colin's job. Colin has his name attached to it. It's really important to Colin.

You're not going to get the same kind of service from BigBackupCorp. Their employees are replaceable, their management is replaceable, and to be honest, you as a customer are replaceable, if they decide to move in a different direction and become BigFlowerArrangementShippingCorp.

The neat thing about a small business is that it runs entirely on its own profits. There are no stock price games or VC jiggery-pokery or anything like that. If it's a profitable business, there will be somebody to come along and take it over and make it their job with their name attached to it. I think the open Internet benefits a lot from this sort of thing.


It isn’t necessarily about Colin quitting. Key person gets hit by bus is also always a concern. You can say someone will pick it up, but I know nothing of whether such plans are in place. Does the person who would inherit the business have the know how to sell it? Is there enough documentation in place for a transfer of assets to be successful?


This is how that scenario shakes out:

  1. Key person gets hit by bus
  2. You see the black bar on Hacker News and learn the sad news
  3. You go download all your data from the service, which is still up because there is no bus access to data centers.
  4. You feel like a jerk for all your creepy "hit by bus" talk.
  5. A few weeks later, some VC-funded operation with multiple employees you depended on disappears overnight without a trace.


> You go download all your data from the service

Just about this step... you are supposed to have it already. You just have to find another service and start using it.


I wonder if Tarsnap could stand up to everyone downloading their data from it at once, especially without anyone to help keep it alive.


Companies and corporations get "killed" often too, even if the people in them are alive.


Make a list of the competitors tarsnap has outlived and maybe it will change your calculus a bit. The risk you need to evaluate is not "what if something happens to the proprietor" (which I've always found pretty macabre), but "what if something happens to him and then the service goes down and also I never backed up my backups". This is a risk you can make as small as you want with judicious planning.


I mean, if you are on HN, you will probably learn of a Colin outage within 24 hours, so practically speaking you would really only have a problem if your primary data storage, Tarsnap, and Colin all failed in the same 24 hour window or so before you had time to switch to a new backup provider.


Pretty sure his brother works on tarsnap too.

They should take separate buses to ______.


Pretty sure his brother works on tarsnap too.

Yes, I hired him in 2015 IIRC. If you look at tarsnap's GitHub you'll see a lot of commits from gperciva.


Oh! Do say Hi to Graham for me.

He mentored me so that I was able to contribute and eventually help maintain and manage LilyPond's Documentation and Patch Testing in a meaningful and rewarding way - all without any programming experience.


Nice. Being able to work with your family is great.


I would never consider a backup provider to be more reliable than that, because if you depend on it, it will fail you at the hardest time.

Better to have multiple layers of backup, of which tarsnap and friends are only one, and verify regularly.


> The second step failed almost immediately, with an error telling me that a replayed log entry was recording data belonging to a machine which didn't exist. This provoked some head-scratching until I realized that this was introduced by some code I wrote in 2014: Occasionally Tarsnap users need to move a machine between accounts, and I handle this storing a new "machine registration" log entry and deleting the previous one

Recommend writing a TLA+ model to catch stuff like this


What would be the benefit of tarsnap over using something like restic+backblaze at order(s) of magnitude lower cost? What specific need would motivate you to pay $3000 per TB-year?


Some of us have lots of extra money and like an excuse to give some of it to cperciva so he doesn't have to work a shit job and can apply his skills and talents to bigger, better things?

(People here asking about the low Bus Factor: you don't keep your backups in one service/location, eh? You use Tarsnap and Restic with Backblaze, Rsync.net, S3, etc. right? "Backups are a tax you pay for the luxury of restore.")


Extremely good deduplication means that for the core set of very important data I backup to Tarsnap the costs are negligible. I imagine the math is probably different if your data is changing more frequently. I for instance use other services to manage my video and photo libraries but my accounting databases, critical documents, etc are backed up to Tarsnap.

I have been using Tarsnap for a decade and not only has there been minimal availability issues there have been almost no issues of any kind that I can recall.


It sounds like most of the 26h downtime was spent restoring backups. Incidentally, this is exactly the reason why Tarsnap is unusable for me for production environments. Backup restoration (as a user) is excruciatingly slow. When my systems are offline, I have no patience to wait for hours for my backup service. Maybe things are better now; Last I tried was a few years ago when Tarsnap took on the order of magnitude of one hour to restore a backup of a few GBs.


Unfortunately, looks like https://www.tarsnap.com/infrastructure.html will have to be updated.

>> So far such an outage has never occurred; but over time Tarsnap will become more tolerant of failures in order to minimize the probability that such an outage occurs in the future.


Unrelated to the outage, but I'm curious nonetheless: would it be possible to hook up Tarsnap's encryption software to a Dropbox folder? I'm not sure if it even makes sense to use Tarsnap for this, but I'd love to have an easy setup that allows me to use Dropbox's servers but only let them see encrypted data so they can't snoop.


You probably want something like https://cryptomator.org/


Doesn't plain old Duplicity (https://duplicity.us/) do that already? (except for de-duplication)


Tarsnap is undoubtedly expensive, but it also donates to various efforts!

Neglecting the pricing, does Tarsnap have any advantage over Restic?

Restic also deduplicates, using little data.


The deduping in restic is just on the edge of acceptable for me, making me think I'd have trouble with a lot more data. Basically the one a month "prune" operation takes about 36h (to B2) . I feel I could be tuning something but also it works and I don't want to touch it.


I backup around 2TB with Restic, also tried locally with Borg. The size is nearly the same. Sadly, I can’t even test with Tarsnap! (absurd pricing for 2TB).


> absurd pricing for 2TB

Well, it can't be that ba..

    $0.25 x 2000 = $500
Yikes. And this is without BW costs.

At $500/M you can just rent a dedicated physical server with a lot of HDDs and still have money left for your favourite pumpkin latte.

For comparison rsync.net says it's $0.015 per GB/Mo, for 2TBs that's $30/m and no BW costs.


Not in any way affiliated but I'm a happy user of Scaleway's Object Storage [0] together with S3QL [1]. It's not the fastest but they give you 75GB of storage for free so that's a fair trade [2].

[0] https://www.scaleway.com/en/object-storage

[1] https://github.com/s3ql/s3ql

[2] https://www.scaleway.com/en/pricing/?tags=storage


I'm renting $15/mo 2TB atom machine from OVH/kimsufi as second target for backups.

Now that I think about it... some kind of micro-distributed backup server (throw on few of your machines, auto-replicate between) would be a neat project...


It's not even that neat.

Just slap rsync/syncthing to the backup dir.


I do use syncthing on NAS + remote cheapo server for my day to day stuff, and bareos for rest.

It's just PITA to add another instance.


Curious how much you backup, which version of restic you're running and why you think the deduplication is borderline unacceptable. There were several major (orders of magnitudes) improvements made to pruning within the past ~1 year, that's why I'm interested.


A straight upgrade, that I can do :) It's been running for years without one.

I was only edgy about it because when it takes 36h it blocks the next daily backup, and I wondered whether that was going to get worse (it hasn't).


The max-unused percentage feature is well worth it to 80/20 the prune process and only prune the data which is easiest to prune away (i.e. not try to remove small files big packs but focus on packs which have lots of garbage).

In general, there's an unavoidable trade-off between creating many small packs (harder on metadata throughout the system, inside restic and on the backing store but more efficient to prune) versus creating big packs which are more easy on the metadata but might create big repack cost.

I guess a bit more intelligent repacking could avoid some of that cost by packing stuff together that might be more likely to get pruned together.


Tarsnap is undoubtedly expensive, but it also donates to various efforts!

I mean.. you could purchase a cheaper service and also donate to various efforts. Bonus: Then you'd also be able to pick those efforts.


How do you compare the two, price-wise? With Restic, you have to provide your own storage.


Aren't these storage prices absurd? Please let me know if I'm misunderstanding.


The prices are absurdly high if your use-case is storage of large volumes of data that regularly change. It wouldn't be sensible to use Tarsnap for that, and you probably want to use one of the bulk backup services instead.

Tarsnap makes a lot of sense when you benefit from the encryption and (especially) de-duplication features that it offers. For me, all of my most important personal and business data, from multiple decades, compresses-and-deduplicates down to around 6GiB. Considering the high value of the data I store in it, tarsnap's pricing actually feels absurdly low.


> Tarsnap makes a lot of sense when you benefit from the encryption and (especially) de-duplication features that it offers.

Can you provide more detail why you think so? I don't believe there is any use case in which tarsnap makes sense, other than maybe some Plan-C backup solution which you fall back on in the highly unlikely event that neither Plan-A nor Plan-B worked.

Concretely, what benefits does tarsnap offer over restic or borg in combination with rsync.net, to make up for the substantial downsides (such as insanely slow restore, complete lack of wetware redundancy or being written in C[1])?

[1] https://www.tarsnap.com/bounty-winners.html


I use tarsnap because the asymmetric crypto means I can give my cron job authorization to create backups, but it doesn't have authorization to read or delete(!) backups.

This ability is critical to prevent a compromised system from having its data wiped and having all backups wiped as well.

I haven't been able to figure out how to do this in any other system. But if someone has a tutorial, I am all ears.


Replying to myself:

I've been musing on this subject all afternoon. I'm a user of Tarsnap, and I do find it expensive, in the sense that I would prefer to backup larger amounts of data for less amount of money. At the moment I backup photos separately from Tarsnap and in an adhoc way.

But I still cannot figure out a way to get all the benefits I get from Tarsnap from any other software solution.

* Must be usable under Nixos.

* Backups must be asymmetrically encrypted so that backups can be automated, yet a compromise of the system cannot immediately gain read authorization to arcived data.

* Backups must be append-only without further credentials, or otherwise prevent a compromised system from being able to delete existing archives.

* Deduplication between archives while still allowing archives to independently be deleted.

Using the ZFS snapshot functionality with rsync.net, for example, with Duplicity comes close. However, as I recall, duplicity wants to regular (typically monthly) full backups and then incremental backups from there. You cannot remove these full backups without deleting the entire month's worth of backups, and because the full backups are independently encrypted, there is (of course) no deduplication between full snapshots, even though the data is still likely largely the same. And because the snapshots are encrypted, it is impossible for the rsync.net storage to see or even know that large parts of the encrypted data is identical.

AFAICT there is really nothing else that does what Tarsnap does.


IMO AWS S3 works fine for this:

* Create a S3 bucket and enable versioning * Create a new user and give it only s3:PutObject on your new bucket * Create an auth keypair for that user and put it on your server

Now any server compromise that gets those keys can only add new data to your backup bucket, and can't read, overwrite, or delete any previous backup.

There's no dedup, so that could be a deal-breaker.

There's also no real encryption (though that shouldn't be too hard to add I guess). I don't really see the gain though. Anyone who compromises the server keys is blocked from reading by AWS permissions. Granted, that's not quite as reliable as good crypto for blocking reading, but on the deleting side, there's never going to be anything but the auth system of whatever solution you're using to block that.

I get that there's some applications out there where preventing data exfiltration is important enough to need strong crypto (though is that really important when we're talking about full compromise of your server, which gets the attacker direct access to the data anyways?), but I decided that the risk of failing to implement properly or full data loss due to losing the keys or them being corrupted wasn't worth the risk of blocking somebody who somehow compromised the AWS account security from being able to read backup data.


Deduplication isn't necessarily a deal breaker. Let's see.

My main machine is currently storing 1.6 TB (compressed) of total archives with tarsnap, but only 33 GB (compressed) of unique data within those archives. So if S3 is 50x cheaper, then not having deduplication would be a wash.

However other comments here suggest that S3 is only 10x cheaper.


Restic's rest-server does this too, or afaik you can configure restic to use S3 with object locks or whatever it's called

Edit: just saw your sibling / reply-to-self comment. This setup would fulfill the requirements you posted, or at least I would assume that restic runs under (or compiles for) your nix OS. It doesn't use asymmetric encryption for this but the goal of append-only is there

> because the snapshots are encrypted, it is impossible for the rsync.net storage to see or even know that large parts of the encrypted data is identical

If they don't see a large amount of data incoming, they'll know large parts of the data are identical (or removed, I suppose). Hiding traffic volumes is fundamentally only possible by introducing dummy data


Rsync.net can do similar, because of their snapshot system. By default there's no way to delete a snapshot except through the schedule set up (you can write to them and ask if it's necessary for some reason). It doesn't use asymmetric crypto to do this but it's neither necessary not sufficient for the purpose of preventing accidental or malicious deletion of backups

https://www.rsync.net/resources/howto/snapshots.html


Right I've considered that. It is however limited to like 7 snapshots.

The thing is that tarsnap deduplicates over arbitrarily long time periods, letting me make arbitrarily long staggered sequences of retained archives.

Perhaps I should really reconsider if I really need such long lived archives, but it is hard to bring myself to drop them.


You can do the same with rsync, they just charge you for the extra space (the differential space, like with tarsnap) instead of providing them for free (from what I can see the limits in the web UI are like 1000 daily and weekly snapshots, 200 monthy snapshots, 100 quarterly snapshows, and 10 yearly snapshots, which I suspect are arbitrary 'good enough for most' numbers, not some hard limit based on what they can profitably provide). I personally use the direct ZFS option so I can set up the snapshots exactly how I want, but it is extra effort and doesn't provide quite as good a guarantee that they won't be overwritten (it's resilient against a compromise of the server uploading the backups because I've set up scripts that way, but it doesn't protect against compromise of the logic credentials for the VM in the same way).


Oh thanks. I did not know that. That does seem good enough.

I just now need a deduplicating asymmetrically encrypted backup program.

I've tried duplicity in the past, and maybe I should try it again. But my recollection is that duplicity will just fail to do backups at the slightest hint of any problem. Like maybe if the last backup was interrupted then no more backups for you until you attend to it.

Edit: More memories returning of having to dig out my decryption key to resync the metadata when duplicity gets unhappy, and then since my target server was append-only, duplicity was upset when it wasn't allowed overwrite any of it's incomplete metadata files. I guess the ZFS snapshot technique would alleviate the latter issue.

To be fair, if tarsnap gets confused it needs the keys to do its fsck command, but I recall this sort of thing happening regularly with duplicity and almost never with tarsnap.


No, not limited.

An rsync.net account can have any arbitrary schedule of snapshots - including days, weeks, months, quarters and years.


My post was specifically addressing the comments around cost.


No, you specifically claimed that "tarsnap makes a lot of sense" in certain situations. I think that's incorrect and that there are basically no situations where tarsnap makes sense as a primary or even secondary backup solution. Even when completely ignoring costs. This is a strong claim, so it should be easy to provide a counter-example if one exists.


That if there was zero other options that offer encrypted backups, but other software offers that too. Many also offer deduplication. And deduplication is less of a needed feature if you dont pay thru the nose for GB


It's insane. Not sure how anyone can accept such a rip off pricing.

Tarsnap : $0.25 / GB storage, $0.25 / GB bandwidth cost

rsync.net : $0.015 / GB storage, no bandwidth cost

s3 : $0.023 / GB storage, some complicated bandwidth pricing

If tarsnap is built on top of s3, they're charging 10 times for the storage cost. Easy money from the uninformed?


How's the saying in every HN thread go? "Don't set your prices based on your costs, set your prices based on the value you deliver." or something like that.

Tarsnap is a wonderful piece of software. You're paying for that.

That said, is the value of "Tarsnap" worth the price difference from "Borg+rsync.net"? (Or Restic, I've been meaning to look into Restic). I'm not so sure. These days I'm a customer of rsync.net, not of Tarsnap.

But I still firmly disagree with the "Colin's just exploiting the uninformed" angle.


Completely ignoring costs, can you name a single use case for which tarsnap would be better than Borg or restic on rsync.net?


As a consultant on a DR project, you can increase your billables by 10x due to the extremely slow backup restore speeds.


Backing up anything Windows with more granularity than top level directories?

Ugh.

Try picking a choosing specific file types or file extensions from filesystems holding thousands of files.

I ended up having to cobble together some god-awful pre-process powershell with multiple pipes just because restic fails to be able to grep using Windows reliably.

:(


> restic fails to be able to grep using Windows reliably

That is news to me. I backup almost a million files spread across 4 Windows devices, with heavy use of --files-from and --iexclude and it seems to work. What am I missing?

I agree that restic filtering options are pretty limited. Too limited, really. But what's there seems to work?


With regards to restic, Tarsnap preforms asymmetric encryption which lets you perform automated backups without needing to enter any passwords (or otherwise storing your encryption passwords in plain text).

With regards to duplicity, Tarsnap does full deduplication across all backups for any given "machine", while still letting you independently remove any snapshots you like. i.e. no special "full snapshot" that must always be kept around, and no need for multiple full snapshots that have no deduplication between them.


rsync.net is also overpriced for strictly backup purposes. Make sure you do check out restic; it can use S3 or Backblaze B2 (I actually use both) as backends instead of something expensive like rsync.net. The value of these boutique storage services evaporates when you start using restic.


The complexity with S3 or Backblaze B2 is the granular pricing for various operations on the service. It’s difficult to optimize costs compared to a fixed “$X per GB” pricing that’s easier for people to understand and forecast expenses a bit more easily. Other than that, rsync.net provides the same pricing for different days center locations, which means there is some subsidization going on.

There are services like rsync.net that support borg at a lower price. Borgbase is one of them. I haven’t used either of these.


> There are services like rsync.net that support borg at a lower price.

And rsync.net is even one of them!

"Special "borg accounts" are available at a very deep discount for technically proficient users." -- https://www.rsync.net/products/borg.html

...hrm, it seems they didn't update that page with last year's price drop. https://web.archive.org/web/20220319135035/https://www.rsync... It used to be a deep discount, now it's the same for <100TB. I wonder if they did drop the Borg prices too and just forgot to update that page?


I treat zfs far more reliable than restic which is still rough around the edges the last time I tried a few years ago.

Run rsync to the target and forget is quite easy, though I admit rsync.net's deal is getting worse these days posing minimum usages here and there.


Yes. That's the thing about Tarsnap, a service with a TikZ diagram on its front page, built around a Unix utility, that meters in picodollars. It's meant to bilk money from uninformed mom-and-pop backup users.


I'm having a hard time to believe that anyone remotely interested in Tarsnap's value prop is also an "uninformed mom-and-pop".

This "uninformed mom-and-pop" is potentially compiling the client application from source, but can't do basic math to compare tarsnap's pricing to the top 20 or so competitors that rank above tarsnap in SEO?


That was sarcasm. :)


Should have looked at the username, massive egg on face.


I tried Tarsnap briefly once and was charged billions of picodollars. It definitely preys on the ignorant.


S3: Upload bandwidth is free, download is what I'd consider to be astronomical at $0.09/GB ($90/TB).

Geez, that's really not improving the comparison with Tarsnap.


If you are primarily cost-driven, you missed one:

Backblaze: $0.005 / GB storage, $0.01 / GB download.


You can send your backups to backblaze and S3 and it still would be cheaper


There is also cost development and ongoing support. You haven't factored that in your 10 times the storage cost calculation.


The 10x doesn't seem like it's enough to pay for more than a single EC2 server though.


Sounds like a very poor excuse when competitors are way cheaper.


You don’t really need an excuse when people are paying regardless.


>Easy money from the uninformed?

I don't think so. Anyone who can use this software I'm sure knows what other options exist.


Then I'd like to know what they think of the benefit of spending $25/mo for just 100GB.


This is why I use s3 sync, versioning and lifecycles for mine on a Standard-IA bucket. My 120Gb costs $1.80 a month. No way would I pay tarsnap prices.

The 120Gb is the contents of my OneDrive and local repository trees. This is everything I've ever done that I want to keep and is approximately 115Gb of photos and not a lot else!


> If tarsnap is built on top of s3, they're charging 10 times for the storage cost. Easy money from the uninformed?

That's pretty much any SaaS... look at the various log or metrics gathering solution, where you pay serious multipliers of what would cost to run same software on your own instance.


You can get a HN reader’s discount on rsync.net (email them to ask for it or search on HN), bringing the price down to $0.12 / GB, and everything else remains the same.


The HN reader’s discount is lower than that these days (we probably "normalized"[1] parents account to reflect that).

We also have .edu / student / nonprofit discounts. Email us.

Finally, Debian and FreeBSD project members get free accounts. See the committers handbook, etc., for details.

[1] Whenever we lower our prices, we increase quota on existing customers to "normalize" them to the new price/GB. If you do nothing, your rsync.net account just grows over time due to this.


I'm not here to defend it but I only use it as a secondary backup for my mail server (so basically append-only) and for a low amount of gigabytes it's fine and just works. No, I wouldn't want to write changing 100GB there.


in that case I'm curious about all the downvotes for asking a reasonable question


Well while I use Tarsnap for a very small amount of data (due to pricing) and I quite like Tarsnap for my use-case scenario, your question might have been downvoted due to two reasons:

- your comment was a very valid question but rather quip-like, offhanded, seemed off etc etc. I mean something like that…

- Tarsnap is an hn darling

If I have to pick one I think it’s the latter :)


For anyone not familiar with the (now 16 year old) comment that Colin is perhaps best remembered here for - https://news.ycombinator.com/item?id=35079

He’s brought far more value to the community than that, of course.


There’s also the fact that he has quite publicly been repeatedly nagged to raise his prices by another “HN darling” and he has been resistant to it. It’s actually quite an interesting read that covers a lot of the things brought up in this discussion:

https://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/


To be fair, any backup service should probably best have model of "pay per length of backup stored".

If I want to store my 100GB of data now, and I want to have it stored for a year, I want to pay for that year's worth of storage of 100GB of data now and not worry about any money or account problems for that bit of data.


But basically what he suggests is turning Tarsnap into yet another SaaS offering I wouldn't give a second glance to. The screenhot of his proposal for the new entry page says... nothing.

Which is probably okay if you want to pivot from a geek-ish service to one that geeks don't use, of course. Does the owner want that?


If it's the latter, then reader quality is quite screwed up in here trying to make people endorse crazy pricing because he's some known guy around here.


> because he's some known guy around here

Oh dear. It’s an HN thing. I have had brushes with it only once or twice across various accounts across years but it’s very much an HN thing.

Whenever you see an utterly useless quip (or sometimes even name calling or offensive words) being heavily upvoted you should know that some alpha HNer has arrived on the scene :)

But to be honest I have never seen author Tarsnap engage in such privileged gentlemanly d-baggery. He is quite cool, as they say it.

Anyway I just ignore it and move on. But again OP could have worded the question better. I mean no matter how good or bad you want to feel about it — it’s just a vc run anon forum and just another forum.


I definitely could have worded it better, but it's a little crazy how different the discourse is seen when the subject of a discussion is something well-liked by HN vs hated.


I didn't mean for it to sound like the first one but I understand ;)


Don't want to speak for Colin, but every time this is brought up, it's explained that Tarsnap uses very little data due to its design. Probably much less than rsyncing your data every hour to a cheaper provider.


Since pricing is purely based on storage used, it's very cost efficient for certain use cases.

I've been using Tarsnap for 10+ years. There's some Linux stuff getting backed up, configs and such. It costs next to nothing for this kind of usage.


It’s meant for people who have a lot of duplicate data and store small files. Anyone who has data that cannot be deduplicated much would be paying tons of money.

While on the price, patio11 (Patrick) has written an article about tarsnap’s issues more than nine years ago (April 2014). One of the suggestions was to raise prices, IIRC. It’s a long post, but you can read it [1] and the HN post [2] from that time.

[1]: https://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/

[2]: https://news.ycombinator.com/item?id=7523953


yes, people have been saying they should "charge more" for over a decade


Can you not be snide and please help me understand? It seems 50 times more expensive than B2. I'm genuinely curious about the product.


“What I Would Do If I Ran Tarsnap” goes into a lot of detail:

https://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/


Roughly: The number of hours of my time that would be required to get something with even theoretically equivalent features would be sufficient to make the cost - and opportunity cost - involved seem far more reasonable.

Plus "written by cperciva and heavily battle tested by Serious Sysadmins" is a feature I couldn't recreate myself - notice that while there was an outage, part of the reason for it taking a while was a conscious choice to take a much longer path to resolution than bringing up the previous server in the name of paranoia. Paranoia about data corruption is a nice thing to have in a backup system and something I'm happily willing to trade-off uptime for.

However: For backups of bulk data then, yes, it's going to be relatively expensive. I wouldn't put e.g. my media backups on tarsnap, but "use tarsnap for your git repositories and other high value data, and something else for the rest" is both perfectly doable and an approach I suspect cperciva himself would endorse.


> notice that while there was an outage, part of the reason for it taking a while was a conscious choice to take a much longer path to resolution than bringing up the previous server in the name of paranoia. Paranoia about data corruption is a nice thing to have in a backup system and something I'm happily willing to trade-off uptime for.

As Actual Serious Sysadmin that Actually Manages Big Systems for Living that screams more lack of preparation than anything else.

Yes you should be careful but you should also have procedures in place and know the system well enough to trust it. And the fact is that the "boring" architecture of RDS DB instead of that S3 database abomination thing would just start right up if master DB server failed.

It honestly looks like a trap many intelligent people fall into where they turn their cool-but-ulimately-flawed mental excercise into bedrock of the product. I don't want to use baby's-first-database on my production servers (I'm looking at you Lennart Poettering and journald) and I don't want my data/metadata stored on some experimental one.


Your comments seem to me to be about how long it took to bring up a new system.

Without agreeing or disagreeing with those, "I'm not going to trust the filesystem on the existing machine" was the choice I was talking about.


You don't need to build it yourself. You can use restic or rclone with S3/B2 and both support encryption.


I am aware there are existing tools that handle encryption.

The tarsnap architecture still does more things.

You're welcome to feel that you don't need those things, but that wasn't my point.


The service is a layer on top of S3 if that helps answer things.

https://www.tarsnap.com/faq.html#is-tarsnap-reliable


Then the obvious question is "why would I use this instead of something else over S3" (ex. rclone), to which I think the answer is ease of use (don't need to deal with AWS yourself, encryption/deduplication/compression handled for you, nice interface), which isn't everything to everyone but is certainly useful.


You need a remote service that keeps backup readonly. You’re not covering attack scenarios if you just use raw object storage from your client machine.

I have written about this some time ago if you’re interested: https://www.franzoni.eu/ransomware-resistant-backups/


I'd classify that under "ease of use" - you can do it with S3 yourself (your post is a pretty good explanation of the how, from a quick skim), or you can just use tarsnap and not worry about it.


You can see from my post that doing that _properly_ is quite convoluted and requires a good deal of technical skills.

So it's not just ease of use. It's actual _functionality_ to me - getting from raw object storage to a fully working, attack-resistant backup strategy, is not trivial; hence, comparing tarsnap (or rsync.net, or borgbase, or whatever) to B2 or S3 makes little to no sense.

You _could_ compare it to crashplan or backblaze personal backup if you like, but IIRC those don't work for *nix systems, only for Win and Mac.


How tarsnap keeps backups readonly ? Just having service being a barrier is not enough


It supports distinct authentication keys with read, write, and/or delete permissions for the same data protected by a given encryption key.

Those restrictions are enforced by the service.


So one bug and it is gone.

Thought it used readonly features of S3/Glacier or something..


Learning to use some backup tool that does same things sounds way better than paying 10 times more for the storage cost that lasts.

There's also a service like rsync.net where you can just rsync to the destination and they do the versioning and so on for less than 10th of the cost of tarsnap.


Eh, cost/benefit; some people are backing up 100MB of documents and don't care, some people are backing up terabytes of media and have the time.


Thank you.


i'm being snide if the prices are considered high; i don't have to agree with that


Everything about tarsnap is absurd. It's basically the world's most absurd backup service (insanely expensive, poor UX, bus factor of ~1, restoring moderate amounts of data appears to take days (!)[1]), brought about by an absurdly bad allocation of human capital (it's run by a double Putnam challenge winner, with several other impressive accomplishments), and as such, absurdly beloved by HN.

[1] In case of an emergency, you will always be able to get back your data from tarsnap at a blazing rate of 50kB/s https://github.com/Tarsnap/tarsnap/issues/333.


If tarsnap has even a modicum of popularity, thanks to these prices it would be bringing in bank. If he's making bank, that means he's providing value (even if it's just to "uninformed"). And it seems this system mostly runs itself, so it's a side gig. It's probably a far more effective allocation than many other possible allocations of human capital.

How many of the world's best and brightest are doing all sorts of busywork? At least Colin has some time to do whatever he wants to do while running tarsnap.


> If tarsnap has even a modicum of popularity

It won't, though, because of the points mentioned by the post you're replying to. It's been 15 years; tarsnap is as popular as it's going to get.


> If he's making bank, that means he's providing value

I don't find that that logically follows from making bank. Not everyone who makes bank is a positive influence.

Tarsnap does provide value, even if I think it's less than its cost: I'm just commenting on the general case that making money would mean you're providing good value



Not to be that guy, but it’s unreadable either zoomed in or in reader mode either horizontal or landscape on iOS.

Colin, could the website be updated to the 2010s? :P


"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."

https://news.ycombinator.com/newsguidelines.html


Shame. It actually spawned an interesting discussion about reader mode, ASCII emails, browser rendering engines and the meeting of old and new. And I learnt things.


I believe you and yes those are upsides - but we have to moderate for how these things work in general, and in general the downside of such digressions is much greater.


It's not Colin's fault that you're using a browser that can't render an html rendition of an email which has been widely in use since before iOS existed.

This is entirely Safari's fault for not having good compatibility with a common existing webpage format.

Anyway, if you're the intended audience (someone using tarsnap), you also received a copy to your email address, where you can read the text with your email reader of choice.


A <pre> is the correct kind of HTML use for the body of plain text email ? It looks like paragraphs of text to me.

<p> is far more appropriate

That isn’t apple’s problem, nor mine.


pre is the only correct element to use since in many emails, the exact formatting and linebreaks and such are important.

For example, a code review on a mailing list can only make sense with the linebreaks and spacing preserved.

However, as you knew to try, there is "reader mode", which is meant to heuristically ignore the exact html in order to display textual content.

Firefox's reader mode has no trouble figuring out that this is a block of text that can be reflowed.

Safari's heuristics clearly fall short on one of the more common kinds of textual blurbs you might want to reader-view-ize.

Seems like a safari problem to me.


> Firefox's reader mode has no trouble figuring out that this is a block of text that can be reflowed.

It was absolutely unreadable on mobile Firefox (Android), initially didn't even think to use the reader mode, which indeed did help make it readable!

I think this is the first time I've ever actually needed that functionality, never quite got it beforehand. Thanks for the suggestion!


Ok you changed my view


<pre> is the correct HTML. People writing plain-text email expect to be able to do things like add ASCII diagrams:

  -------        -------
  | foo |  --->  | bar |
  -------        -------
It's an older technology but it checks out.


it's a plain text email, which is exactly the sort of thing &lt;pre&gt; is for - pre-formatted text


The HTML rendition in this case did it’s best to be hard to read.

It’s not impossible, and likely just a fault of whatever list thing is used, but it could be better, and it’s nice if people let you know as such right?


How can simple black on white text with a lisible font be hard to read?


It doesn’t adjust to my screen size. So in portrait mode I have sentences with 80 characters per line that are 5 points wide.

That’s barely possible to read on a high dpi screen, but fairly uncomfortable.


I understand some websites are doing lots of funky stuff, but at this point the issue is on your particular client if it cannot zoom bare text correctly and smoothly. Any decent client should be able to allow you to zoom it so that you can maximize those 80 character column to the width of your phone screen in protrait mode regardless of font size and dpi.


Try it if you don't have a perfect 100% vision (20/20 this is called in USA I think?) and have a phone that fits in a normal pocket

Optimal reading width for speed/comprehension is also fewer than 80 characters afaik. I think different sources I read years ago were undecided whether it's closer to 60- or 70-character lines. Either way, rescaling when there is no ASCII art or position-dependent characters seems rather basic and <pre> disallows that


I am using prescription glasses. That is what they are for. If you are using yours and still can't read small text, you need a visit to your optiometrist and get new ones.


What about us poor slobs whose vision is less than perfect where the 80-character line is too small even when it takes up the entire width of the display? The way zooming is implemented in the mobile browsers, we must horizontally scroll back and forth for every line we read.


I am using prescription glasses. That is what they are for. Maybe you should use yours?


I don't think my glasses can correct my vision well enough to be able to read 80-char-wide text on a smartphone in portrait mode. I can't say for sure because I don't own a smartphone--partly because I worry about being able to read web pages on it without regularly resorting to the aforementioned tedious horizontal scrolling.

I always wear my glassses when I'm using my iPad, which is in landscape mode almost all the time (the exceptions being the rare apps like Uber's that won't do landscape mode).


How should a browser differentiate between a <pre> of hard-wrapped prose (that it could reflow in reader mode) and a <pre> of code?


In an email you can't of course since an email could contain one, the other, or a mix. And, in fact, most mailing list emails I read these days do contain code and code reviews.

However, if the user clicked the "reader mode" button, that's a good sign the user thinks this is reflowable text. Firefox's reader mode figures this out. Safari's doesn't.


Or you might use Reader Mode on a programming blog, which has both prose and code, and you wouldn’t want to change the code in that case.


Just FYI, Firefox Reader mode works great with it.


"Great" is a bit of a stretch as there are random short lines where they were originally hard-wrapped. But it certainly does make it readable.


It's off-the-shelf MHonArc[1]. If implementing a decent mailing list archive were a prerequisite to launching a business, no business would ever be launched.

[1]: https://www.mhonarc.org/


edit:

i assumed the parent did not know how to do that, i tried locally and it seemed to work, but i did not pay attention to the text

original:

on the left side of the url input field you'll find "AA"(the first smaller then the seconds), tap that

then, near the bottom of the pop-up menu you'll have "Show Reader", tap that

if you're not happy with the text as displayed then, you can go back to the "AA" menu and change the options


[flagged]


oh shit, i'm sorry


This should work in reader mode: https://pastebin.com/raw/hanm8mgG


Thanks!


Works fine for me, on iOS, in safari.


It's a mailing list archive. Use a real computer.


How dare I use my phone before I get to my computer.


>The process of recovering the EC2 instance state consists of two steps: First, reading all of the metadata headers from S3; and second, "replaying" all of those operations locally. (These cannot be performed at the same time, since the use of log-structured storage means that log entries are "rewritten" to free up storage when data is deleted; log entries contain sequence numbers to allow them to be replayed in the correct order, but they must be sorted into the correct order after being retrieved before they can be replayed.)

Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?

The backups themselves should be in S3, but using S3 as a NoSQL append-only database seems unwise.

This would benefit from being further from the metal.


Forget about software or data architecture. S3 is the most reliable data storage mechanism in the world, and insanely simpler than a relational database. There is no operational failure mode to S3, other than "region went down". There is no instance to go down, no replication to fail, no worry about whether there's enough capacity for writes or too many connections, no thought to a schema, no migrations to manage, no storage to grow, no logs to rotate, no software to upgrade on a maintenance window. Plus S3 is versioned, has multiple kinds of access control built in, is a protocol supported by many vendors and open source projects, and is (on AWS) strongly read-after-write consistent. I would also argue (though I don't have figures) that it's faster than RDS. Almost guaranteed it's cheaper. And it's integrated into many services in AWS making it easier to build more functionality into the system without programming.

On a less technical note: Always avoid the fancy option when it makes sense. (From a veteran of building and maintaining large scale high performance high availability systems)


The "fancy" option here is trying to act S3 to act like database instead of simple blob storage...


Fancy would be writing an application to talk to RDS, creating an RDS instance, creating a database, creating whatever IAM link is needed for auth into the db so you don't need a second set of credentials, creating a schema, creating columns with different data types, and then modifying the application to handle edge cases for the different data types, logic to insert, update, delete rows, select items, yadda yadda yadda.

Dumb is `aws s3 cp` and being done in 5 minutes.


I envy job where you think a database is something fancy and hard to do...


[flagged]


This comment is unnecessarily antagonistic, and you are consistently and willfully missing all the points in your replies.

He could have set up a more complex architecture and paid much more in hosting costs over the years to overengineer the solution. What would the benefits be? It might have avoided this one outage or saved a few hours restoring the data. The drawbacks? Much more time developing and maintaining the solution and higher subscription costs for users.

The solutions you are familiar with and comfortable with are perfectly valid. But you are falling into the trap of thinking "what I'm familiar with and comfortable with is the only valid answer and everyone else is wrong and stupid".


Please see https://news.ycombinator.com/item?id=36897868 and please stop posting in the flamewar style to HN. Regardless of how right you are / how much smarter you are or feel you are, it's exactly what the rules here ask you not to do. We're trying for a very different quality of conversation here.

https://news.ycombinator.com/newsguidelines.html


You're right, that was uncalled for. I apologize for lowering the quality of site.


Thanks for the kind reply.


I'm only wildly guessing here, but most likely the "cloud storage" backing all those managed databases is actually S3-like blob storage under the hood.


The database engine implements ACID and crash recovery so you don't have to. This is precisely what failed in this outage.


>S3 is the most reliable data storage mechanism in the world

S3 is not the problem here. The problem is building a database on top of S3, and having to reimplement all the consistency, atomicity, transactions etc. on top.

>no thought to a schema, no migrations to manage

There is, in fact, always a schema. Some people choose to ignore it's there, to their detriment.

>Always avoid the fancy option when it makes sense.

It's not the 1980s. Postgres is not fancy, and Greenspunning it is a mistake.

>Almost guaranteed it's cheaper.

Cheaper than a 26-hour outage?


> having to reimplement all the consistency, atomicity, transactions etc. on top.

Most of those problems are moot if you're only ever writing from a single head node. If all your data is strictly ordered and you have no meaningful concurrency, this is a far, far simpler system.


Did I fall into a timewarp into the 70's? How on Earth, by what sane standard, is a Postgres instance too complex? If you're `fopen`ing files as a "database" you are wasting your time and lowering the world's economic productivity.

Complex is Greenspunning a database and having it blow up in your face and cause a twenty-six hour outage. You never hear about such things with Postgres because Postgres is rock-solid.


I'm not defending the choice to use S3. I probably wouldn't have made the same choice. But I am pointing out that it's empirically wrong to say that storing data in flat files necessitates the considerations of an ACID-capable RDBMS.

But to your point, if your system requires less than a thousand lines of code to open a file, do basic parsing and processing (which no data storage system is going to do anyway), and write the output to another file, I personally can't say that Postgres or MySQL or any other solution is really worth the effort/cost to build and maintain. In the system being discussed, the benefits of an RDBMS simply don't matter: any strongly consistent key-value store would work.

> Complex is Greenspunning a database and having it blow up in your face and cause a twenty-six hour outage.

S3 didn't cause the outage, and from the look of it, neither did the code that processes the files. It was an application logic problem which caused issues during the restore process, and this would have been an issue regardless.

You could make an argument that the recovery being slower than it could have been was a problem, but it's wild to say that and imply that traditional databases have no performance cliffs. Especially when dealing with corruption or data recovery. Raw file storage will never have a Postgres transaction id wraparound incident (see: Sentry outage for most of a day in 2015, MailChimp/Mandrill for over a day in 2019) or have to rebuild a critical index.

In this case there was a hard coded concurrency limit with S3 of 250 outstanding requests. Bumping that up to 3000 would have been easy and reasonable (S3 rate limits at 5000). How confident would you be that your database can performantly handle a backfill during recovery? Have you provisioned enough iops? Are you running an RDS instance with only a burstable vCPU limit? To say Postgres is "rock-solid" (and make no mistake, I am a Postgres fanboy) dismisses the many and varying ways that it can fail in unusual and surprising ways.


[flagged]


Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


> having to reimplement all the consistency, atomicity, transactions etc. on top.

Did you miss where I said it's read-after-write strongly consistent?


Having those things at the base layer is not the same as having them in higher-level behaviour. This is why God invented SQL transactions.


I never knew Donald Chamberlin and Raymond Boyce were God, it doesn't sound right but I don't know enough about theology to refute it.


>Cheaper than a 26-hour outage?

AFAICT, using HN theres been roughly 30 hours of non-availability over 15 years. RDS didn't even support Postgres when Tarsnap was released.

EDIT: Tarsnap predates RDS.


>Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?

Cost and reliability?

* Using S3 as a simple database is generally going to be much cheaper than RDS.

* If you turn on point in time restore, then losing data stored in S3 is not a possibility worth worrying about on a practical level for most people. RDS replication is easy enough to use, but adds more cost and a little bit of extra infra complexity.


>Cost

It's a bad trade. Thousands of hours of a high human capital computer scientist vs. a few tens of dollars a month for RDS.

>Reliability

Empirically false: none of this would have happened if Tarsnap used Postgres instead of a home-spun database.


>> Cost

> It's a bad trade.

Maybe. But that's the reason. You never acknowledged that advantage in your question so it needed to be emphasized


It never occurred to me that anyone would need it explained to them that RDS is cheaper than the time of any software engineer.

The opportunity cost of building your own database is 10,000x the cost of running RDS for a year.


That sort of logic doesn't really apply here because:

* RDS costs obviously scale linearly with ongoing time and probably scale linearly with the total amount of data being backed up. So depending on the revenue of the business, these extra costs could easily end up outweighing the (notional) cost of the time saved, which is mostly a one-off expense.

* The cost of a software engineer's time is notional in the context of a one-person business. The author of Tarsnap isn't going to be able to employ fewer than zero additional software engineers to maintain Tarsnap because of the time saved by using RDS.


[flagged]


Perhaps I'm biased, but I read every part like:

> I realized that this was introduced by some code I wrote in 2014: Occasionally Tarsnap users need to move a machine between accounts,

with an implicit "and I fixed that code now" or "and I will fix that tomorrow when I get enough sleep". Let's hope he writes later a follow up explaining the details.

> [a] postmortem on the front page of HN

Is that bad? I upvoted this almost instantly, then went to the comment section to upvote cperciva if he was here, then I read the full post and verified my first upvote was correct.


The outage was caused by a hardware failure and (I assume) the lack of any redundancy. Using RDS wouldn't have made a difference as far as I can see.


RDS can have replication.

But more than that: servers should be stateless! A server going down should never take down your business.

If you use Postgres, and stateless servers, then if a server goes down it's no problem, it gets rebooted and there may be other servers and a load balancer to pick up the load. If Postgres goes down, you have a replica, or it gets rebooted, and Postgres always recovers from crashes (in my experience), and if it doesn't you have PITR.

AWS has everything under the sun to prevent this kind of thing happening. This is a 1990's outage. This didn't have to happen.


The hardware failure was on the server running the application code, so RDS replication wouldn’t have helped. You’re right of course that this failure points to a lack of redundancy – but that’s a separate issue from choosing S3 vs. RDS as the data layer.

By the way, S3 is insanely reliable and in fact more reliable than a replicated RDS setup. So switching from S3 to RDS would almost certainly reduce the basic reliability of the data layer, however many conveniences it might bring.


Once again, the problem is not S3, it is reinventing a database on top of S3, the logic of which runs on EC2.


Once again, no, that’s not the problem.

PostgreSQL and RDS are quite a bit more than just a log-structured data store, and are not prima facie the correct solution for this problem domain, regardless of how much arrogant ignorance you bring to bear on the debate.


You broke the site guidelines badly in more than one place in this thread. I realize you're trying to defend someone's work against what you feel is unfair criticism, but breaking the site guidelines yourself, with swipes and name-calling and flamewar, is exactly the wrong way to do this.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


You’re not wrong, and thank you for holding me to account.

In retrospect, I’d delete and/or edit the comments if I could.


No need to delete - the only thing we care about is fixing things going forward. Thanks for the kind reply!


[flagged]


Would you please stop posting flamewar comments and breaking the site guidelines? You've been doing that repeatedly and badly in this thread. We end up having to ban such accounts, and I don't want to ban you.

Fortunately it doesn't look from your recent comments that you've been in the habit of posting this way, so it should be easy to fix.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


Just … stop. This is the first outage in 11 years.

You’re being unnecessarily arrogant and antagonistic up and down this thread

You’re not smarter than everyone else here, you don’t have better or more perfect knowledge, and you almost certainly wouldn’t have built a better or more reliable system.


There are some established patterns for S3 as a database. It's extremely common in "data lakes" (throw data of various schemas in and use a tool that can parse at query time).

There's client libraries like Delta Lake that implement ACID on S3.

Much of the Grafana stack uses S3 for storage (Mimir/metrics, Loki/logs, Tempo/traces).

That said, I'm not sure about the implementation Tarsnap uses--if it's completely ad-hoc or based of other patterns/libraries.


> This would benefit from being further from the metal.

How, exactly, is that a good thing?


How is not rolling your own database a good thing? Mainly because the business of tarsnap is 1) encrypted 2) backups, not building a database storage engine.


Implementing client-side encrypted, deduplicated, snapshot-enabled backups with server-mediated access control inherently requires building a minimal storage engine to represent your opaque log-structured data.

Embedding the log-structured representation of user data in Postgres would increase complexity and overhead without offering significant resiliency or recoverability advantages — in fact, quite the opposite.


It’s cute that you think implementing client-side encrypted, deduplicated backups doesn’t involve building a database storage engine.


FWIW Tarsnap was launched in 2008, the initial RDS for MySQL was launched in 2009.


You can always self-host Postgres.


The VM crashed, corrupting the file system. This could have made a Postgres database unrecoverable. For rock solid reliability you need more than a database instance.


Make two of them.


Keeping some kind of Postgres cluster running for over a decade seems like a lot of work. tarsnap seems to require roughly no maintenance.


[flagged]


What you see is someone who is actually willing and able to learn from any mistakes during this outage, no matter how small. That degree of attention to detail is exactly what I would expect from Colin. Novelty accounts created with the express purpose of slinging crap however are the equivalent of heckling in a theater, they don't contribute and in this case seem to be motivated by malice.

I do tech DD for a living and pretty much every company could do better if and when something goes wrong, but rarely do companies extract the maximum of learnings from an outage. That is what should impress you rather than to perceive it as a negative.

Note that most companies don't make any information about outages public and note that if and when they do it is usually heavily manipulated to make them look good. Colin could have easily done the same thing and the fact that he didn't deserves your respect, not your scorn. Consider the fact that even the best make mistakes. I'm aware of a very big name company that lost a ton of customer data through an interesting series of mishaps that all started with a routine test and not a peep on their website or in the media. Tens of thousands of people and hundreds of customers affected. And yet, you probably would trust them with your data precisely because they are not as honest as Tarsnap.


There are multiple comments in the post-mortem about what should - in hindsight - have been done instead and I think it's fair to expect that those things -will- get done reasonably soon.

Pretty much all ops problems come down to the interaction of multiple mistakes that hadn't previously been an issue - GCP and AWS post-mortems tend to show exactly that, although usually with somewhat less detail.

So I'd expect that any equivalent service has a similar number of gremlins hiding in their infrastructure and procedures, and I'd suggest to anybody reading this that a 43 minute old account that was created just to post the comment I'm replying to is perhaps not the most reliable judge of competency or otherwise on the part of M. Percival.


> This post-mortem just lists mistake after mistake, but gives no indication as to what the maintainer will do to prevent this in the future.

Each to their own - I myself wouldn't expect that from a comprehensive "what didn't go smoothly" list such as this.

Clearly Colin is aware of every point listed and no doubt is already mentally dot pointing procedural changes and additional guard rails to ease recovery in future outages and to ensure no data is lost (which appears to be the primary goal here).


> It was an honest post-mortem that revealed far too much incompetency to trust this service.

That's why post-mortems are heavily sanitized. Or not posted publicly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: