Hacker News new | past | comments | ask | show | jobs | submit login
'Check Your Backups Work' Day (checkyourbackups.work)
461 points by ParadisoShlee on Feb 1, 2017 | hide | past | favorite | 147 comments



Imagine if you're the CEO of Gitlab and seeing this right now. I think showing compassion and solidarity may be a better response. The problem of restoring non-existing backups should be treated as a more serious problem in our industry. This happens too often and not because people who made them were careless people but because to catch any errors when the backups are not working can take unreasonable amount of time and even then it could still just stop working one day.

We should probably treat this issue as something more like a disease like high blood pressure, namely, you don't know you have it, but it is probably doing irreparable damage to your internal organs. If we had no name for the disease or understanding of it we would just die at an earlier age without an obvious cause.

Let's identify the diseases of this sort in our industry and work on prevention, diagnosis and treatment of them instead of just saying you should have been more careful.


I have nothing, but respect for the GitLab team.... offering live notes for the recovery is stunning.


Agreed ! Even when they screw things up, they make people benefit from it! Kudos for such a good spirit and dedication to your customers.

Everybody screw up from time to time in our industry. And when this happens, you have 2 types of guy : those who try to hide it, and those like Giltlab team who communicate as fast as they can because they respect their customers. Paradoxically, to me, it creates more trust than it damages it !


Thanks for the kind words. I'm sorry for letting our users down. We'll ask the the 5 why's https://en.wikipedia.org/wiki/5_Whys We need to go from the initial mistake (wrong machine, solve by better hostname display and colors), to the second (not having a recent backup), to the third (not testing backups), to the fourth (not having a script for backup restores), to the fifth (nobody in charge of data durability and no written plan). The solutions above are just guesses at this point, we'll dive into this in the coming days and will communicate what we will do in a blog post.


Good morning (posted from throwaway for reasons Ill describe).

I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.

We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.

So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.


I agree that 6 hours is way too much.


Just know that your transparency here in this situation has put you leagues and bounds ahead of other vendors in my mind.

Thank you for sharing.


I admire the response and quick action, but after reading this, I get the strong feeling Gitlab et al didn't know they were running a real business with real projects and real people who trusted you to be a good custodian of their data.


I also would ask Why are we running our own Postgre setup and not using RDS? And then Why do we not have production Postgre DBAs on staff that would do this rather than an engineer?


They want to have more control over the database and moving providers etc., which is why they're not using RDS, as was explained on the live stream.


make sure everyone is doing ok, don't let them beat themselves over it

all the love and support in the world towards the team


Thanks mozair. Everyone is OK but sad. I sent this tweet earlier https://twitter.com/sytses/status/826598260831842308

Thanks for the support, we've received a lot of kind reactions and are very grateful for them.


If you do the 5 why's and are honest you'll discover the real root cause is not having a backup and restore procedure that is programmatic.

All outages are blameless. It's always a process failure or lack of a proper system or tool.


Completely agree. This is 1000x better than the typical cooperate PR response that says nothing which we'll usually get. Kudos for their honesty and openness, even in a somewhat embarrassing situation such as that.


it is also very, very risky from a liability perspective, both for insurance purposes and third-party indemnification. That's why most companies will keep as silent as possible.


I've never heard of anyone successfully suing AWS/Google/Azure over lost data or downtime.

And presumably even if your notes are kept private, it'll still come out in discovery if you end up in court?

What are the risks here? I'd be interested to learn more.


> I've never heard of anyone successfully suing AWS/Google/Azure over lost data

Those companies have huge legal departments writing iron-clad contracts and stocking a lot of very sharp knives for any such eventuality. Smaller companies make for much easier targets.

> even if your notes are kept private, it'll still come out in discovery

Those notes might have been lost by then - that is, if their existence at any given time is even known. It is perfectly reasonable for people working quickly after an outage, to not actually write down every step or observation they make.

> What are the risks here?

Not prison, but you might be forced to pay a bunch if someone manages to get a ruling against you for negligence or the likes.


> I've never heard of anyone successfully suing AWS/Google/Azure over lost data or downtime.

Doesn't mean it never happens.


Imagine how deep their pockets are. Now imagine how deep your pockets would need to be to go toe to toe with them.

Sure plenty of uber larges Incs are "in the cloud" but with mission critical (read: high value worth sueing over) applications and data? That's a pretty small universe.


I could not agree more @soheil, let me also say that Sid aka @sytse is not only a top notch CEO but he's also a hard working, caring individual who truly not only cares about the product but the whole community so, so much.


Wholeheartedly agree.

Incidents are inevitable and it's important to have a proper RCA/Service-Disruption process in place to handle those.

As mentioned on the other thread, maybe Gitlab doesn't have enough operational/SRE expertise in-house yet, but that was the case in every fast-growing company I've worked for last decade.


> Imagine if you're the CEO of Gitlab and seeing this right now

I kind of hope the CEO of Gitlab isn't reading HN right now


If he's getting some much needed sleep right now, he'll probably read it tomorrow. The CEO of GitLab is really active on HN, and a pretty classy guy.


Thanks. Just landed in Europe for Git Merge. It is heartwarming to read some of the comments here. We realize that we only get this leniency once and we'll learn from this and communicate our lessons in a few days from now.


The lesson here has been learned over and over again, painfully.

For every backup system you have, test restores, from scratch, periodically. The more critical the data is to your business, the more frequently and more automated you want those backup checks.

Of course you also want procedures to try to prevent tired admins from deleting production databases. To the greatest degree possible, implement systems to prevent manual tampering with production data; often there's an alternative. But databases can get broken or corrupt without admin interference, so preventing manual database removal or corruption is not an ultimate solution.


Why is it you (GitLab) and every other company has to learn from your own mistakes again and again instead of learning from others? Just a cursory glance at a few log files would have told you that your backups are not happening. Nevermind verifying actual backups...

Do you have DBAs? Are they completely inept at their job? Don't answer.


Yours is an entirely appropriate response. While I appreciate the gushing solidarity in the rest of this thread, non working backups really is an elementary IT error, accurately attributable to incompetence.


Hindsight is always 20/20. Berating people who now KNOW mistakes were made accomplishes nothing.


Backup is elementary. It should not need to be learned nor manifest itself as a hindsight revelation. Frankly GitLab seems more and more like a mom and pop shop that learned Rails over a weekend. They did not even know if they had functioning backups before shit hit the fan.


I bet you're making as successful of a software as they are. Don't answer...


I didn't mean as a slight against him, just kidding :)



> We should probably treat this issue as something more like a disease like high blood pressure, namely, you don't know you have it, but it is probably doing irreparable damage to your internal organs.

That's why we need awareness days.


Precisely this.

Things that aren't tested can pretty much be counted on to fail a non-trivial percentage of the time. If it's going to be business critical, then it needs to be tested. (Lesson hopefully learned here)

By the same token, things which aren't monitored can also be counted upon to fail a non-trivial percentage of the time and even worse, go unnoticed. (Another teachable point here)

...but people being what they are tend to learn these things the hard way unless we've got concrete, real-world incidents like this for them to learn from.


This page did a lot to raise my awareness of my own inadequate backup situation. Any hurt feeling of the CEO caused by this page will be outweighed by the greater good of the message being spread.


> even then it could still just stop working one day

This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.

My source control services and static web servers do something similar. None of the shadow copies are available to the world, though I can see them via VPN to perform manual check occasionally and if something nasty happens they are only a few firewall and routing rules away from being used as DR sites (they are slower as in their normal just-testing-the-backups operation they don't need nearly the same resources as the live copies, but slow is better than no!).

This won't catch everything of course, but it catches many things that not doing it would not. The time spent maintaining the automation (which itself could have faults of course) is time well spent if done intelligently. For a system as large in scale as GitLabs then a full restore daily is probably not practical so a more selective heuristic will need to be chosen if you are operating at such a scale. My arrangement still needs some manual checking and sometimes I'm too busy or just forget, so again it isn't perfect, but the risk of making it more clever and inviting failure that way is at this point worse than the risk of my being lazy at exactly the wrong time.

One thing my arrangement doesn't test is point-in-time restores (because sometimes the problem happened last week, so last night's backup is no use) but there is a limit to how much you can practically do.

> The problem of restoring non-existing backups should be treated as a more serious problem in our industry

It is by people that care about it, but not enough people care, and too many people see the resources needed to get it right as an expense rather than an investment for future mental health.

It isn't just non-existent backups. Any backup arrangement could be subject to corruption either through hardware fault, process fault, or deliberate action (the old "they hacked us and took out our backups too" - I really must get around to publishing my notes on soft-offline backups...).

> Let's identify the diseases of this sort in our industry

Apathy mainly.

The people who care most are either naturally paranoid (like me), have lost important data at some point in the past so know the feeling first hand (thankfully not me, though in part to having a backup strategy that worked) or have had to have the difficult conversation with another party (sorry, I can't magic your data back for you, it really is gone) and watch the pitiful expressions as they beg for the impossible.

The only way to enforce the correct due diligence is to make someone responsible for it, it is more a management problem than a technical one because the technical approaches needed pretty much all exist and for the most part are well studies and documented.

Of course to an extent you have to accept reasonable risks. It is usually not practical to do everything that could be done, and understandable human error always needs to be accounted for as do "acts of Murphy". But someone needs to be responsible for deciding what sort of risk to take (by not doing something, or doing something less ideal) rather than them just being taken by general inaction.


> picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup.

This is sensible, but one problem with this is developing "blindness" for messages you receive every day. For some recurring tasks you can automate a bit further. Instead of receiving a success message every day, receive a failure message the first day there's an absence of success message. A number of tools and services exist for setting this up, one is linked in my HN profile (shameless plug ;-)


> This is sensible, but one problem with this is developing "blindness" for messages you receive every day.

This can be an issue, I do get lazy about checking them when I'm busy.

The next step will be to automate checking them a bit more, rather than stopping the positive messages. A simple script that for the mail example logs in via POP/IMAP, checks for the last notification and checks the relevant timestamps. Similar for other services (is the last commit in the repo from more than 24 hours ago?). Then I get a simple "all OK" or not. I still have to run the script, perhaps hooking it to CGI and setting it to my browser home page to further remove my laziness from the equation, but I don't want to just be told when something is wrong as I'll never really trust that something going wrong won't block warning messages - I want to make checking as easy as possible though.

I may play with your service and add it as a secondary set of checks though!


> This is where automated testing helps.

Now that I'm developing email stuff I get all the time the existential crisis of "how to I make sure my testing and alerting infrastructure is working?"

Really, how can you be sure your restoration tests are running correctly and their messages get all the way to you?

I think I'll add periodic positive messages for my restoration procedures. Stuff like (if success and day % 23 == 0 then email_something). Still doesn't guarantee the test is correct.


> Really, how can you be sure your restoration tests are running correctly and their messages get all the way to you?

I get a daily message, so if I know if things aren't getting sent or getting through. I've been wary of setting up anything that only sends problem messages because regularly getting the "all OK" messages proves that at least that part of the infrastructure is working.


The problem with that is when you start getting too many "all OK" messages.

Getting a message must be mostly a surprise. One daily message is OK for me, but if so it may be good to tell about more than a backup. A daily digest of "everything is up and running" may indeed be something very nice to have.

And how do you ensure a team will always have someone surprised not to get a daily message? Maybe sending for may people, or requiring some acknowledge, and creating a warning over the warning channel if there isn't one. (And you certainly aren't going to send those positive messages over the warning channel, because it requires a completely different kind of action.)


That is why I'm thinking of scripting up something to scan the status messages and present a list (in the console or on a web page), colour highlighted (normal for info only, green for all OK, red for error or missed message, big red bold (flashing?) for an issue still not resolved from the previous day. That automates checking but I've still got the individual updates in case that fails. Maybe I'll make the output of this my homepage and/or console login banner.

> And how do you ensure a team will always have someone surprised not to get a daily message?

For a team you have a management problem not a technical one! Somebody needs to be responsible for it in the end, plain and simple.

On a less authoritarian note a technical helper might perhaps be to have a screen, always on, showing a dashboard of statuses, where everyone can see it? Make it beep incessantly when a new warning arrives until someone acknowledges it. If you acknowledge a message either deal with it (if you can) or notify whoever needs to be told so they can take action.

People will still sometimes ignore or miss the warnings. Nothing and no one is perfect.

But I'm only dealing with my own collection of personal data and projects here so it isn't that scale of problem - at work I used to be responsible for infrastructure as well as my actual job but we no longer operate at a scale where that is at all practical so I have to trust someone else is dealing with it (though I'm not that trusting I do sometimes take time out to check things myself where I have access).


> This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.

That's awesome. I've always tested my backups by having shitty, cheap servers so bad that restoring from backup happens about once every few months.

Going to an automated restoration test is a great idea, though also a lot of work for most people.

Just checking, manually or automatically that your backups are occurring and are of a reasonable size is probably sufficient for most operations and would have caught most if not every case in the GitLab instance.


I think this is a big slap to company when they need to talk to their enterprise customers. I am not sure how we can really prevent this from happening by implementing an industry standard.


You know, I remember reading years ago that during the launch preparation for Apollo 16(?), they did a routine pressure test of the Command Module atop the rocket a few days before launch. The technician in charge made the simple error of forgetting to open a pressure release valve during the test which lead to an overpressurisation of the CM causing significant damage, including separation of the heat shield.

The ENTIRE Saturn 5 rocket had to be wheeled back to the hangar and the CM dismantled and rebuilt, resulting in a month long delay in the launch.

When the Apollo launchpad manager was asked if he had fired the technician in question, then answer was allegedly "Nope. He is the one guy on the next launch team that I know will NEVER make the same mistake again."

I am willing to bet that Gitlab is now a company that will never slacken off their backup checking in the future.


>When the Apollo launchpad manager was asked if he had fired the technician in question, then answer was allegedly "Nope. He is the one guy on the next launch team that I know will NEVER make the same mistake again."

This same general story appears in different forms with million dollar trading errors, etc. But I have to wonder if it's really that good of a lesson. If you truly take it to heart, you could potentially keep people on the team that are legitimately incompetent which could result in another catastrophic failure.


And if one starts applying your principle at a workplace, you end up with the whole company being afraid of doing anything at all, to avoid being fired.


No, there is a level of competency where once you make mistakes below that, firing is the right move.

If a surgeon keeps killing patients because he keeps forgetting to wash his hands, then he needs to be fired.

If a developer skips all company policies and deploys directly to production without a good reason, then he needs to be fired.

There are reasonable mistakes, then there are errors resulting from wreckless actions indicative of a larger problem with the person's view towards their work.


It is not hard to distinguish between someone who makes a rare, but totally understandable simple mistake, and someone who regularly screws up significantly more often than everyone else. Being tolerant of mistakes makes you a good employer. Being tolerant of regular incompetence makes you a bad one.


Make one mistake, it's an expensive training lesson.

Make the same mistake again, you get fired.


That is the wrong question. The real question is what did they do to ensure that this mistake cannot be made again by anybody. In the form this variation is in, the real question is what changes were made to the design so that the system cannot be over pressurized in that way by mistake.


Sometimes you can put in all the checks and balances you wish, but humans will just make a, well, human mistake.

All commercial aircraft have checklists in place, and a two person cross checking system in the cockpit, as well as mechanical means to detect potential problems - but aircraft still every now and then land with their gear still up etc.

Does it make economic sense for an organisation that has invested possibly millions of dollars in training a pilot to immediately sack him/her for a costly mistake like this? Sure, disciplinary action and possible demotion within the ranks is a given, but will you get rid of the one person who will be almost fanatical about never landing with the gear up again?

If they consistently make technical mistake and are sloppy in their airmanship, then sure, fire away. But it is going to be pretty much impossible to design a 'system' that can prevent human errors from creeping in.


Indeed, I just started planning for a side project to solve this by automating as much of the backup sanity checking procedure as possible.


Is it too much to ask for a few days dedicated to this? I mean, pretent that your critical infastructure goes up in smoke - can you recover, and how quickly? For most companies it simply means that you try to restore the machines from your backup. You can do this as everything else continues as normal.


oh I thought GitLab was testing because of this day.


Not referring to Gitlab's incident but the general problem with this stuff is too often people treat something like "make a backup" too literally.

As in, they think the act of generating a backup file is the last stage of the process and they are done with it. Maybe they go the extra mile and throw it in a cronfile too.

What you have to do is to consider any and all backups non-existent until you have a complete backup strategy.

In other words you have to appreciate that "having a backup" is a means to an end and another way of saying "being able to successfully recover from data loss".

So just generating a file is not sufficient.

You have to complete a successful test restore before you can call it a backup.

You have to have heart-beat measures in place to make sure you will not be impacted by a silent failure (example: check $last_successful_backup >= last X hours).

You have to periodically manually check that your automated checks work ("simulate" backups not being generated and wait for alert).

Far too often people don't appreciate the depth and the weight that a phrase like "backup plan" carries.

So you may ask them to "take care of backups" they will go run pg_dump and mysqldump and say "It's done". No goddamn it, it's not done.


There are levels. Obviously the best test is 3 backups on 3 different continents (in the future one of the backups will be on Mars), with regular tests that each site can restore from scratch.

Even the minimum level: an untested backup to the same disk is better than what most people have. That untested backup is there and protects against accidental file deletion - which will eventually become a test for most people. If the disk fails it is also something to work with. If you tell a disaster recovery service that there is a backup they can use that redundancy to their advantage: odds are physical damage isn't to both the backup and the read data and they only have to recover one. Even if the damage is to both there is something to work with.

As we move up the ladder there is more and more. An untested backup has data - give me the team a few years and we can recover it. We might have to recreate the restore from scratch, but there is something.

Remember though, the farther down the ladder you are the more expensive recovery can be. If you have tested backups on 3 continents recovery only costs a couple hours downtime - an actuary can put an exact dollar cost on this. If the price is too high you can invest in redundancy on the live system. If you have an untested backup it might be years before your team can recover it: millions of dollars in labor to recover the data and sever years of no/reduced business while they recover it. (hint: the company will go out of business because they cannot afford to recover the data)


The graver problem is programmers — and, humans in general — don't realise the gravity of recklessness until the moment when shit hits the fan. Startups begin with general imprudence towards checks and processes, which is understandable but even in the growing stages, the idea of incorporating them is snubbed because of 'priorities'. "Time is precious and there are more important problems to solve in the way".

Unauthenticated MongoDb on default port? The likelihood of a person port scanning the entire web just doesn't strike as anything adverse, and then, someday someone does exactly that. I guess this is one important reason to bring at least some experienced people on board because there is a likely chance that they can give a better perspective on seriousness of such issues.


In my day we used to develop "Disaster Recovery" programs. They were massive, and we tested on a regular basis including renting massive systems from IBM and flying the team to the IBM data center to run a full restore of everything. End business users had to login afterwards and sign off.

I understand we "live in a different world" is the favorite motto these days. But do we really? If anything data is bigger, more complex, not in one place, you can't just ship a truck load of tapes and three people somewhere to test.

IMHO the more we try to reinvent technology the more we realize some of the things we all felt were weighing us down were actually smart ideas brought on not by fear but by real life experience.

And the pendulum will swing once more here and back again at some point in the future.


I don't recognise the 'different world' you describe. Practiced disciplined DR is a key part of modern software engineering at large and small companies across the startup/enterprise spectrum (in my experience). The popularity of tools like Netflix's Chaos Monkey/Gorilla/Kong server as testament to that.

That's not to say all companies do it (it seems GitLab didn't) but the tone of your comment doesn't reflect a lot of people's experiences.


I think the key is knowing that this is routine at a subset of shops while most others are winging it to various degrees — and always has been.

If you worked at a responsible place decades ago you might be aghast at what happens at a random sampling of companies now, but the same would have been true at a random sampling decades ago. The difference is that unless you were a customer or the outage was especially prominent you probably never would have heard about it.



A couple of months ago, in a public chatroom for DBAs, one of GitLab's engineering leads made some very insulting comments about Facebook engineers & eng practices. Pretty ironic.

I want to feel bad for GibLab, but it's really, really hard when they hire people like that.


Oh wow, some interesting stuff in here. Looks like they use MySQL as a queue for scheduling the ORC Peons? Would have loved to hear more about why they did that.


Author here.

We use it because it works well for us. We've put a lot of work into making MySQL scale for us to the point where it's a very well supported system and one of the main choices for a lot of storage decisions.

We even use MySQL as a queue for Facebook Messenger. More details about this:

https://www.youtube.com/watch?v=eADBCKKf8PA

https://code.facebook.com/posts/820258981365363/building-mob...


Thanks! Standardizing on MySQL because of internal expertise is definitely a great reason.


GitHub also migrated persistent data out of Redis and in to MySQL, with their expertise in MySQL as one of the motivating factors.

https://githubengineering.com/moving-persistent-data-out-of-...


Because it works perfectly fine? I tend to write anyone off who scoffs at simple database-as-queue designs without understanding what the scaling requirements are. You can use a database as a job queue for 10s of thousands of jobs per day without any sweat.


Please forgive me, but I don't understand your hostility.

I understand that there are perfectly legitimate reasons for using a database as a queue. If you frequently need to look at and rearrange your jobs while they're in flight, chances are a pure FIFO structure probably doesn't work that well for you anyways. If the enqueue is contingent on a transaction committing, probably makes a lot of sense for the job to be in the same database. You don't need to tell me -- I've seen more than a few in production.

But an actual message queue also "works perfectly fine" based on the information provided, and I would imagine that a company like Facebook probably already has a few of those lying around. It would have been a conscious choice to use MySQL as a queue.

I swear I'm actually, seriously just curious!


Scoffing? GP said it was "interesting stuff" and that they'd "love to hear more". What's wrong with that?


We've done this every day for 2 years now. Anyone can do it (and should). You don't need to be Facebook's size.


> After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Everyone is talking about backups, but why not about this? How is it even possible to delete the production database by accident? Why does he have SSH access there? Why do they test their database replication in production? Why are they fire-fighting by changing random database settings _in production_?

I know that all of this is common practice. I am questioning it in general.


> Why does he have SSH access there?

Because when something goes wrong he has to be able to discover what it was.

> Why do they test their database replication in production?

Many places do that. With free databases there is simply no reason to, but the practices are inherited from the non-free databases best practices arsenal.

People, if you are using postgres and have a cluster in production, make sure you also have the script that creates this cluster in a VCS and is able to create an equal cluster on virtual machines in your computer.

> Why are they fire-fighting by changing random database settings _in production_?

Oh, that's because they are fire-fighting. The problem is that when you get a problem you know nothing about, you don't really know what to replicate in another environment to reliably test things there. The most you can do is verify that your changes aren't harmful, the real test is always in production.


How would you resolve a production data replication issue in an environment other than production, or without access to the system?


I would start by giving them more descriptive names... calling it db1 and db2 is a sure way to trip someone some day.


This is already on their todo in the same document.


I think having a day for a whole year is a bit sparse. Some startups start and shutdown within a year. Apart from checking backups after any big code change related to backups, I think backups should be checked quarterly.

It takes no more than couple hours most of the time, and as our wise said, "An ounce of prevention is worth a pound of cure"


I think that once a quarter is better, but if your start up shuts down at the end of the year, you probably don't need to worry about it


Except the reason they had to shut down might be because they never checked their backups and then they lost a critical amount of data and it turned out that the backups were indeed no good.


I don't think any startup knows that they're going to shutdown within the year. No one would take the time if they knew they were going to shut down soon.


Correct. So save some concerns (e.g. weekly verification of backups) until after a year.


Indeed, If one day a year is what it takes for you to remember to care about backups, you probably shouldn't be involved with backups. This could happen to anyone from a person losing their home photo collection, to a hospital missing critical data within a patient management system - I know because I've been involved or seen first hand both of these exact scenarios. It's what we learn from our mistakes that defines our future, not the people telling us we made them.


Check your backups daily. The whole process is automated and shouldn't cost much.


Then we'll have a check your check your backups day where you make sure your automation isn't making sure the sky is blue.


It's in there now. Row counts per table and checksums are written out with the backup for every table. If things don't match up, alarm in a loud and noisy way.


The point is that you could have a bug in the verification process that returns "all good" when it really isn't.


I have a SELECT of the most recent updated_at fields for a few tables emailed to me so I can eyeball the values


Thats a neat idea, I might look at that, thank you.


My point is that there are sanity checks built into the process. It's not difficult. There doesn't need to be a human verifying things there. Is the row count increasing over the previous backup? Check. Is the checksum different? Check. Is it non-zero? Check. And on and on.

Stop making excuses and automate testing your backups.


so, when your check process does not report because it failed to run, how do you know? Do you have a monitor for your monitor? Does that monitor have a monitor?

There is no reason not to automate your backup tests, but there is no reason not to eyeball the check is actually working from time to time.


Yes, the restore process checks in daily; alarms are thrown if it doesn't check in.

MySQL is also pretty good about not starting if the data is corrupted for whatever reason. The process not starting is a pretty obvious alarm, too.

Checks are cheap. Better to automate your logic and invest in good monitoring and automation.



Heh, "Don't be an April Fool" reminded me of the old annual "Internet Spring Cleaning" [0] ritual.

[0]: http://www.snopes.com/holidays/aprilfools/cleaning.asp


I wonder if anyone would fall for it now...


Along the same lines as https://xkcd.com/1053/, I believe so.


International Backup Awareness Day is EVERY day.

https://blog.codinghorror.com/international-backup-awareness...


One time at work I accidentally triggered delete on an RDS CloudFormation stack It was not fun. Automatic backup from AWS is useless because automatic snapshots are removed as soon as the RDS instance is removed unless you tell AWS to make a final snapshot. We didn't have that flag in the stack template at the time, so ugh.

Oh how I deleted the stack? I was using the mobile app and was trying to look at the status of the Cfn stack, but the app was laggy my finger pressed the wrong button... sigh The other interesting thing was I checked the status because the previous night I changed my RDS instance to provisioned IOPS (took 8 hours and failed too). I felt sad and guilty, but at the same time I felt whatever because the upgrade didn't go through so perhaps this accident was all meant to be....


Ouch.

Doubly ouch that it seems that there's no confirmation dialog with a 5-second countdown before you can hit Yes, or whatever.


For critical stuff IMO there needs to be more than just a confirmation dialog. The user needs to be transitioned into a totally different state of mind from the usual click, click, click, click.

E.g. forcing someone to manually type the characters D E L E T E before allowing deletion of something potentially important.

Either that or everything should have multiple levels of undo. Everything.


There's certainly some stuff in AWS that requires you to type the name of the thing you're deleting into a text field in order to delete it, but I suspect that's purely a UI check -- there wouldn't be so much point in requiring the name twice in the underlying API.

So if you've got a dud client implementation then you're going to lose the check.

One way to do stuff like this is to have separate roles for read-only and read-write access. I pay a lot more attention to what I'm doing on the rare occasions I assume permissions to change things.


Ever since I removed the app from my phone. It's buggy and right, I really do expect them to have a dialog both on PC and mobile...


Suggested workflow:

Make a backup, restore to test environment, run checksums, anonymize, release test environment.

That way each and every backup is tested both for integrity and ability to rebuild a working environment from it.

In my practice insufficient backup is still (unfortunately) a very common occurrence.

On another note: just having stuff stored with triple replication in the cloud is emphatically NOT a backup.

And it also helps if the same people that have access to the live environment do not have write access to the backups, but that's only feasible past a certain team size.


Ooohh something similar happened in our company some time ago:

We had a MongoDB server (with a read-replica) in our production environment (this was an ec2 instance with MongoDB).

One day, a dev accidentally deleted the main collection in the DB during a night coding session. Next morning, when we realized that, we went straight to the daily backups that we have been doing. It happened that for some reason the backup of the previous 2 or 3 days did not work.

We had to get into Mongo-OpLog (which was on only because of the read-replica) and reconstruct the missing 3 days from it.

That was fucking scary.



While we're discussing the importance of backups, I would like to pause for a minute to think about a common systemic failure model that simply making backups doesn't solve.

I've realized in the past that in day-to-day situations it is more likely to lose data because of temporary programmer carelessness. Examples: deleting the backup version of a folder, deleting the copy on the wrong server, etc. This seems similar to what happened at Gitlab.

How to protect oneself from this failure mode? Can we design a better system than to assume that every human command is well considered? (Sort of like guard rails, when purging backups)


I've tended to make backups read only so as to minimise the impact of accidental deletions.


It's exemplary to be so honest.

The CEO should laugh with us a bit and be proud to inspire a day be named after their hiccup, thanks to their transparency.

I'm totally compassionate but we should never lose our sense of humour!


I would like to take this moment to impart to all of you concerned about your business backups to make sure you are enabling your systems administrator with the budget, tools, personnel, and backing from management he/she needs to get the job done.

I have seen far too often one man miracle teams swimming in technical debt solving problems constantly but failing to have the time to play the political games needed to push for the kinds of changes they need. Obviously small operations setups are different, but for example, I've seen a ~250 person 6 branch business have 1 senior and 1 junior part time sysadmin, while his requests for a budget and personnel were constantly denied, and so he said his backup system worked but he knew it wasn't as good as he could make it. He eventually quit in frustration. He was a great sysadmin but didn't play enough politics and therefor he failed and his management failed him, all the while jeapordizing the business. Please don't do this to your sysadmin.

CTOs and CIOs, please take a moment to ask your sysadmin what things they need they haven't been able to convince you of yet, and see if you can compromise or otherwise try to lend their arguments importance.

In all but the leanest of SV web startup land, sysadmin are the backbone that keeps your company running. Don't neglect or forget them.

If you do, one day that backup may fail, or a cryptovariant will hit the server, and although you will scapegoat your sysadmin, it will have truly been your fault.


It's one thing to schedule a backup.

Quite another to test the backup's restorability.

Most of us back up. Very few test that the backup works as intended. I need to do better in the later.


Or you know, just confirming the backups actually ran!?


Really wish "backups" as a term would be replaced with a term that means the data was copied and proofen to fully accessible. Any ideas what that term might be and why?


Recoverables?


disaster-safe


clone


I feel a special need to congratulate and offer a tip of the hat to the GitLab team for their transparency during this outage. Excellent work!


This actually inspired me to set up my daily / monthly backups, which I had not done since upgrading to Devuan almost half a year ago. Fortunately, I had already written a blog post [1] about backups, so setting myself back up and adding a cron task took under 30 min with a fresh disk.

[1] https://thatgeoguy.ca/blog/2013/12/26/encrypted-backups-in-d...


I discovered my Dev server DB backups weren't working after reading this. I used to consider my dev server just a place where I tossed test code to mimic production, but forgot that my DB server running on it was the only copy of potentially months of work! I have it dump nightly to my desktop now which also gets backed up locally and to the cloud.


Too bad "Check your site is using HTTPS" day, or "Check your website meta data is setup for sharing" day, or "Check your site is legible on mobile" day weren't first.

Look, shit happens... we don't need to make fun of people for it. We all cut corners at times... when we are lucky, nobody notices. When we aren't...



Would extend this to: "keep your personal data backed up too" - and for low-value or low-security data, it's easily done with consumer commercial services, since they're designed to allow you to be lazy and have no backup discipline.


Mean-Time-Between-RM-RF.


> So in other words, out of 5 backup/replication techniques deployed, none are working reliably or set up in the first place.


With something like Borg where you can just mount your backups and look at them normally it's fairly easy to see whether they're ok / include what you wanted.

Of course, backing a whole platform up is more complex, and things like databases normally require custom scripting (dump -> backup dump, eg. pg_dumpall | borg create ... -).


> Our backups to S3 apparently don't work either: the bucket is empty

Ouch. I can't even imagine how that feels. This is why even despite monitoring and paging scripts, I still have an event to check my company's backups weekly. Now I don't feel so paranoid.


Gitlabs behavior is a testament to success !

No hidding, no eupheminization, their live doc stream actually made me question what things did I do on my systems. Looks like convergent evolution in some parts, like prompt changes.

Thanks Gitlab.


From someone who works at GitLab, thank you for your kind words! Our infrastructure team is working hard!


This has always been my worst fear. Solidarity to the people at gitlab dealing with this no doubt incredibly stressful situation. <3


When stuff "just works", you don't need to check your backups. I fully trust my iPhone's iCloud backups, my Time Machine backups, and my cloud rsyncs. Time Machine also lets me know if they get corrupted, or if I haven't backed up in a while. That's how backups should work - an adage of "you don't have backups unless you check them" just won't work for most people.


You might want to check your Time Machine backups by hand from time to time using the tmutil[1] tool.

My Time Machine backups in the past have been missing gigabytes of data, without me telling anything about it. And not just volatile data, like temp files or caches, but photos, music and documents.

[1]: http://osxdaily.com/2012/01/21/compare-time-machine-backups-...


I use Time Machine for quick recovery stuff, and then Arq Backup to send to remote storage.


Big difference between consumer-level backup of phones and PCs and enterprise backup for servers crammed with custom software.


Why should "check your backups day" apply to only enterprise backup? I've seen it repeated as gospel many times on Hacker News to folks who've lost backups that it was their own fault for not checking them. My point is we should focus on a software solution to making this more robust, and not blaming people.


I'm not saying it should apply only to enterprise backup, it's just that as you pointed out, the consumer-oriented backup solutions already work pretty reliably, making a "check your backups day" unlikely to matter to the average consumer (nevermind that given daily interaction with the gamut of Google's and Apple's services, consumers frequently "check their backups" in the course of their daily activities).

I think easy, good server-level backup software would be great. The problem is that enterprise servers are usually highly customized, part of a large, unique architecture, and contain a lot of proprietary, confidential, and potentially personal, legally protected data. That makes it a lot harder to get a one-size-fits-all backup solution set up, which means that the onus of reliable backups will, of necessity, rest upon the company's administrators.

It is very sad when no one checks backups. This bites companies every day and it's usually easy to sympathize, but there's no excuse for it. GitLab needs to perform a serious review of its processes.

I've checked out the GitLab job listings that get published on HN Who's Hiring and other places regularly (I'm currently 100% remote with my current employer and like to track other 100% remote employment opportunities). They have a salary calculator/estimator and personally, I was really underwhelmed with the values it would put out. That calculator makes a city price index adjustment from the base salary and contains a statement that says GitLab prefers to hire people who live in less expensive cities. I also remember feeling that their interview process sounded a little overbearing.

It may be time for GitLab to consider upping the ante on its recruitment procedures and adding some more experienced people to the ranks.


Having Time Machine report to you is 'checking' as well. Nobody said anything about any manual labour as far as I can tell.


In the thread about the Gitlab backups, manual checking (not "labour") was suggested at least twice.


Time Machine backups are great for a lot of use cases, but beware to not trust your virtual machines to Time Machine [0].

You might end up disappointed if you do.

[0] https://kb.vmware.com/kb/1013628


I've seen the occasional report of Time Machine failure: https://news.ycombinator.com/item?id=10681776


"Check your backups work day" is like "Earth day" - it should be every day.


Aaaand just found my personal backups at home haven't been running for about a month.

ALWAYS CHECK !


Well, I guess that explains my MIA repos earlier today.


I check my backups on Feb 29th ;)


Typo in the page: should be @gitlab, not @gitlib.


PR was pulled. Thanks.


Can't someone working for the company I bank with accidentally delete my debt and its backups? :P


That gave me the chuckles. I'll put it in the calendar. Should be a real thing.


It's nice to see that people still use Microsoft Word for web design.


Man I thought I was having a bad day. If you get spam from $fakeUserName@studyswami.com you have my profound apologies. Now I have to figure out how to mea culpa to Gmail (and everyone else) for the "why the hate?" protests. Ugh.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: