Postmortem of database outage of January 31

ky738 · on Feb 11, 2017

RIP the engineer

YorickPeterse · on Feb 11, 2017

Still here, and doing just fine.

sslalready · on Feb 11, 2017

I love your tag line here on HN! Some 15 yrs ago I was hired as a UNIX administrator at some larger company. Despite being fresh from school I already had plenty of experience from spending the 90's hacking and programming on whatever UNIX system fyodor or rootshell.com had an exploit for. When the DBA was leaving for vaccation they didn't hesitate in letting me take over his daily routines. On this particular summer day, I had a simple job: dump the production database and load the data into the test environment. I had to make sure the dump was finished before 4pm when the daily production run started (which, FWIW, continued well into the evening). This was an Oracle shop so I believe the commands were mostly "exp" and "imp" -- with the caveat that the imp command would need an additional parameter to select the test database instead of the production one that was the default.

Yeah, you see where this is going. The prod dump finished in time and shortly before leaving work I started importing the data. Then I sat around for a while before I realized I had forgotten about that additional "use the test environment" parameter -- and now I was importing a several hours old dump into the production database while the daily production run was running. I had to call company execs and explain the catastrophe to them, who in turn had to call in the vendor that sold us the system. Those were some pretty scary hours for a 20 yr old kid. Luckily it was just a matter of aborting the production run, reload the prod dump and then reschedule the production run for the day.

The next day I had to start my day at the vendor's place to get some shaming, but also a good piece of advice - "always say destructive things out loud before doing them". Then they continued to tell me stories of people they had worked with who really messed things up, and we all had some good, evil laughs.

Mistakes build experience, and hard learned lessons even more so. You now have a pretty good conversation starter to put on your CV. Personally I'd rather hire someone who was a "removal" specialist over someone who hadn't learned the skill yet. :)

I believe both GitLab and the community in general will come out stronger from this incident. Thank you all for being so transparent about it.

fiedzia · on Feb 11, 2017

>Personally I'd rather hire someone who was a "removal" specialist over someone who hadn't learned the skill yet.

Oh absolutely. Problem is that gitlab didn't do that. And they will most likely have to learn many more lessons before become reliable.

tuyguntn · on Feb 11, 2017

Probably this fail makes Gitlab more reliable in the future. What doesn't kill you makes you stronger.

boulos · on Feb 11, 2017

I enjoy your updated profile.

AsyncAwait · on Feb 11, 2017

Actually, I'll be surprised if he hasn't received any offers by now. I would perhaps specifically hire him to deal with databases as I am pretty sure he's never going to make this mistake again.

YorickPeterse · on Feb 11, 2017

I haven't received any offers so far. I don't intend on leaving GitLab any time soon either.

illumin8 · on Feb 11, 2017

I have to say - if they were using a managed relational database service, like Amazon's RDS Postgres, this likely would have never happened. RDS fully automates nightly database snapshots, and ships archive logs to S3 every 5 minutes, which gives you the ability to restore your database to any point in time within the last 35 days, down to the second.

Also, RDS gives you a synchronously replicated standby database, and automates failover, including updating the DNS CNAME that the clients connect to during a failover (so it is seamless to the clients, other than requiring a reconnect), and ensuring that you don't lose a single transaction during a failover (the magic of synchronous replication over a low latency link between datacenters).

For a company like Gitlab, that is public about wanting to exit the cloud, I feel like they could have really benefited from a fully managed relational database service. This entire tragic situation could have never happened if they were willing to acknowledge the obvious: managing relational databases is hard, and allowed someone with better operational automation, like AWS, to do it for them.

renaudg · on Feb 11, 2017

This is usually true, except when it's not :

I have personally experienced a near-catastrophic situation 3 years ago, where 13 out of 15 days' worth of nightly RDS MySQL snapshots were corrupt and would not restore properly.

The root cause was a silent EBS data corruption bug (RDS is EBS-based), that Amazon support eventually admitted to us had slipped through and affected a "small" number of customers. Unlucky us.

We were given exceptional support including rare access to AWS engineers working on the issue, but at the end of the day, there was no other solution than to attempt restoring each nightly snapshot one after the other, until we hopefully found one that was free of table corruption. The lack of flexibility to do any "creative" problem-solving operations within RDS certainly bogged us down.

With a multi-hundred gigabyte database, the process was nerve-wracking as each restore attempt took hours to perform, and each failure meant saying goodbye to another day's worth of user data, with the looming armageddon scenario that eventually we would reach the end of our snapshots without having found a good one.

Finally, after a couple of days of complete downtime, the second to last snapshot worked (IIRC) and we went back online with almost two weeks of data loss, on a mostly user-generated content site.

We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I still don't 100% trust cloud backups unless we also have a local copy created regularly.

koolba · on Feb 11, 2017

> We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I don't 100% trust cloud backups unless we also have a local copy created regularly.

Cloud backups, and more generally all backups should be treated like nuclear profliferation treaties: Trust, but verify!

If your periodically restore your backups you'll catch this kind of crap when it's not an issue, rather than when shit had already hit the fan.

andruby · on Feb 11, 2017

Years ago I had my side project server hacked twice. I've been security and backup paranoid ever since.

At my current startup, we have triple backup redundancy for a 500GB pg database:

1/ A postgres streaming replication hot standby server (who at this moment doesn't serve reads, but might in the future)

2/ WAL level streaming backups to aws s3 using WAL-E, which we automatically restore every week to our staging server

3/ Nightly logical pg_dump backups.

9 months ago we only had option 3 and were hit with a database corruption problem. Restoring the logical backup took hours and caused painful downtime as well as the loss of almost a day of user generated content. That's why we added options 1 and 2.

I can't recommend WAL-E enough for an additional backup strategy. Restoring from a wal (binary) backup is ~10x faster in our usecase (YMMV) and the most data you can loose is about 1 minute. As an additional bonus you get the ability to rollback to any point in time. This has helped us to recover user deleted data.

We have a separate Slack #backups channel where our scripts send a message for every succesful backup, along with the backup size (MB's) and duration. This helps everyone to check if backups ran, and if size and duration are increasing in an expected way.

Because we restore our staging on a weekly basis, we have a fully tested restore script, so when a real restore is needed, we have a couple of people who can handle the task with confidence.

I feel like this is about as "safe" as we should be.

voidlogic · on Feb 11, 2017

I agree, the backup that isn't used to populate a stage/qa instance right after being taken is untrushworthy.

koolba · on Feb 11, 2017

Even before that there are steps you can take. For example if you take a Postgres backup with pg_dump, you can run pg_restore on it to verify it.

If a database isn't specified, pg_restore will output the SQL commands to restore the database and the exit code will be zero (success) if it makes it through the entire backup. That lets you know that the original dump succeeded and there was no disk error for whatever was written. Save the file to something like S3 as well as the sha256 of it. If the hash matches after you retrieve you can be pretty damn sure that it's a valid backup!

Otherwise you get the blind scripts like GitLab had where pg_dump fails. No exit code checking. No verification. No beuno!

saganus · on Feb 11, 2017

Are there any guidelines on how often you should be doing restore tests?

It probably depends on the criticality of the data, but if you test say every 2 week, you can still fall in the OPs case, right?

At what size/criticality should you have a daily restore test? maybe even a rolling restore test? so you check today's backup, but then check it again every month or something?

koolba · on Feb 11, 2017

Ideally it should be immediately after a logical backup.

For physical backups (ex: wal archiving), a combination of read replicas that are actively queried against, rebuilt from base backups on a regular schedule, and staging master restores lesa frequent yet still regular schedule, will give you a high level of confidence.

Rechecking old backups isn't necessary if you save the hashes of the backup and can compare they still match.

x0x0 · on Feb 11, 2017

Not immediately (imo); you should push the backup to wherever it's being stored. Then your db test script is the same as your db restore script: both start by downloading the most recent backup. The things you'll catch here are, eg, the process that uploads the dump to s3 deciding to time out after uploading for an hour, NOT fail, silently truncate, and instead exit with 0!

illumin8 · on Feb 11, 2017

Wow, I'm sorry you experienced that. This points to the importance of regularly testing your backups. I hope AWS will offer an automated testing capability at some point in the future.

In the meantime, I hope you've developed automation to test your backups regularly. You could just launch a new RDS instance from the latest nightly snapshot, and run a few test transactions against it.

bogomipz · on Feb 11, 2017

This is certainly true of all backups to an extent though not just the cloud. Back in the day of backing up to external tape storage it was important to test restores in case heads weren't calibrated or were calibrated differently between different tape machines etc.

I am curious did you manage to automate an restore smoke test after going through this?

mikiem · on Feb 11, 2017

Snapshots are not backups, although many people use them as backups and believe they are good backups. Snapshots are snapshots. Only backups are backups.

debaserab2 · on Feb 11, 2017

What is the difference exactly?

cookiecaper · on Feb 11, 2017

A snapshot could be a backup depending on what you're calling a snapshot, but yeah, in general, to be a backup things need to have these features:

1. stored on separate infrastructure so that obliteration of the primary infrastructure (AWS account locked out for non-payment, password gets stolen and everything gets deleted, datacenter gets eaten by a sinkhole, etc.) doesn't destroy the data.

2. offline, read-only. This is where most people get confused.

Backups are unequivocally NOT a live mirror like RAID 1, slightly-delayed replication setup like most databases provide, or a double-write system. These aren't backups because they make it impossible to recover from human errors, which include obvious things like dropping the wrong table, but also less obvious things, like a subtle bug that corrupts/damages some records and may take days or weeks to notice. Your standbys/mirrors are going to copy both of obvious and non-obvious things before you have a chance to stop them.

This is one of the most important things to remember. Redundancy is not backup. Redundancy is redundancy and it primarily protects against hardware and network failures. It's not a backup because it doesn't protect against human or software error.

3. regularly verified by real-world restoration cases; backups can't be trusted until they're confirmed, at least on a recurring, periodic basis. Automated alarms and monitoring should be used to validate that the backup file is present and that it is within a reasonable size variance between human-supervised verifications. Automatic logical checksums like those suggested by some other users in this thread (e.g., run pg_restore on a pg_dump to make sure that the file can be read through) are great too and should be used whenever available.

4. complete, consistent, and self-contained archive up to the timestamp of the backup. Differenced backups count as long as the full chain needed for a restoration is present.

This excludes COW filesystem snapshots, etc., because they're generally dependent on many internal objects dispersed throughout the filesystem; if your FS gets corrupted, it's very likely that some of the data referenced by your snapshots will be corrupted too (snapshots are only possible because COW semantics mean that the data does not have to be copied, just flagged as in use in multiple locations). If you can export the COW FS snapshot as a whole, self-contained unit that can live separately and produce a full and valid restoration of the filesystem, then that exported thing may be a backup, but the internal filesystem-local snapshot isn't (see also point 1).

fh973 · on Feb 11, 2017

Backups protect against bugs and operator errors and belong on a separate storage stack to avoid all correlation, ideally on a separate system (software bugs) with different hardware (firmware and hardware bugs), in a different location.

chousuke · on Feb 11, 2017

The purpose of a backup is to avoid data loss in scenarios included in your risk analysis. For example, your storage system could corrupt data, or an engineer could forget a WHERE clause in a delete, or a large falling object hits your data center.

Snapshots will help you against human error, so they are one kind of backup (and often very useful), but if you do not at least replicate those snapshots somewhere else, you are still vulnerable to data corruption bugs or hardware failures in the original system. Design your backup strategy to meet your requirements for risk mitigation.

gregmac · on Feb 11, 2017

I'd also add not just different location but different account.

If your cloud account, datacenter/Colo or office is terminated, hacked, burned down, or swallowed by a sink hole.. You don't want your backups going with it.

Cloud especially: even if you're on aws and have your backups on Glacier+s3 with replication to 7 datacenters in 3 continents... If your account goes away, so do your backups (or at least your access to them).

illumin8 · on Feb 11, 2017

RDS snapshots are backups. They are copied to S3 storage, which is replicated across 3 datacenters within a region.

YorickPeterse · on Feb 11, 2017

RDS, or any hosted database solution, is not some kind of silver bullet that solves all problems. While it's true it takes care of backups automatically, it does also restrict you in terms of what you can do.

For example, you can't load custom extensions into RDS. Also, to the best of my knowledge RDS does not support a hot standby replica you can use for read-only queries, and replication between RDS and non RDS is also not supported. This means you can't balance load between multiple hosts, unless you're OK with running a multi-master setup (of which I'm not sure how well this would play out on RDS).

Most important of all, we ship PostgreSQL as part of our Omnibus package. As a result the best way of testing this over time is to use it ourselves, something we strive to do with everything we sihp. This means we need to actually run our own things. Using a hosted database would mean we wouldn't be using a part of what we ship, thus not being able to test it over time.

phonon · on Feb 11, 2017

> Also, to the best of my knowledge RDS does not support a hot standby replica you can use for read-only queries

RDS has very nice Read Replicas.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_R...

For HA you can use High Availability (Multi-AZ).

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

dcosson · on Feb 11, 2017

But to be fair if you enable both you're paying for 3 servers total instead of 2, you can't read from the HA standby. (I'd imagine there are reasons not to do that anyway, but you don't even have the option of making that compromise to save on the cost)

illumin8 · on Feb 11, 2017

With RDS Aurora, any of your read replicas can be promoted to read/write master in the event of a failure of your primary master. This happens in 10-15 seconds, and is very fast.

So, you can get the benefit of up to 15 read replicas, and not have to pay for an extra standby server that is sitting idle.

phonon · on Feb 11, 2017

Well, you can use the Read Replica only, and if you have an outage on the primary, promote the Read Replica to recover... (will take a few minutes though.)

In general though-- TANSTAAFL

YorickPeterse · on Feb 11, 2017

Aha, I probably overlooked read replicas (haven't used RDS in about 1.5 years).

cjbprime · on Feb 11, 2017

That said, it did take AWS a disturbingly long time to add support for Postgres read replicas.

grzm · on Feb 11, 2017

Why disturbingly long? As opposed to frustratingly long? Or something else? What does the time it took Amazon to add such support imply to you?

cjbprime · on Feb 11, 2017

You're welcome to read frustratingly. I didn't think hard about the word choice.

I suppose disturbingly is meant to imply "It was frustrating to me, and I think it would be frustrating to anyone in the situation of seriously using Postgres on RDS[1], and perhaps it ought to even decrease their opinion of the RDS team's ability to prioritize and ship features that are production-ready".

Does that make sense?

[1]: There was no workaround for getting a read replica. RDS doesn't allow you to run replication commands. So your options were "Don't use Postgres on RDS, or don't run queries against up-to-date copies of databases." There was never any announcement of when read replicas were coming. It was arguably irresponsible of them to release Postgres on RDS as a product and then wait a year to support read replicas, which is a core feature that other DB backends had already.

grzm · on Feb 11, 2017

Without insight into what the AWS RDS team workload and priorities are, I think it's unfair to use a term like disturbingly. Sure, as a user, we want features to be rolled out as quickly as possible. From what I've seen, Postgres RDS support has been slow, but consistently getting better: nothing to warrant suggesting Amazon isn't serious about continuing to improve their Postgres offering. That would be disturbing. Or data loss failures. Slower-than-I'd-like roll-out of new features? Frustrating.

By all means, RDS isn't perfect. It doesn't suit my current needs. But I understand that getting these things to work in a managed way that suits the needs of most customers is not an easy task. I'll remain frustrated in some small way until RDS does suit my needs. I hope they continue to add features to give customers more flexibility. And from what I've seen, they likely will.

carterehsmith · on Feb 11, 2017

>> RDS does not support a hot standby replica you can use for read-only queries

This is not true anymore.

I set up two read-only RDS replicas, one in a different AWS region, and another in the same region, for read-only queries, just by clicking in AWS console.

wahnfrieden · on Feb 11, 2017

You can use the failover standby replica for reads with Aurora at least. And you can manually via MySQL set up replication with non RDS, just not via AWS APIs.

nilved · on Feb 11, 2017

RDS also comes with its own set of tradeoffs. There is no free lunch, and the cloud is just another word for someone else's server. There are reasons Gitlab opposes that.

discodave · on Feb 11, 2017

In the meantime solution architects and sales people from AWS are going to run around with annotated copies of this public post-mortem to enterprises and say "look, RDS would have solved x,y,z and we can do that for you if you pay us"

:)

carterehsmith · on Feb 11, 2017

>> the cloud is just another word for someone else's server.

No. The cloud (AWS, GCE, Azure etc) is not "just" like your own server.

Just consider some basic details - you pay someone else to worry about things like power outages, disk failures, network issues, other hardware failures, and so on.

ovao · on Feb 11, 2017

I think that's a little pedantic. The point he was making is that, conceptually speaking, the cloud is comprised of servers not unlike the servers you run yourself. The difference, obviously, is who runs them, the manner in which they're run, the exact manner in which they're utilized by you, etc., but they are still just servers at the bottom of the stack.

carterehsmith · on Feb 11, 2017

"The point he was making is that, conceptually speaking, the cloud is comprised of servers"

But... that "point" is trivial.

Did anyone ever claim that cloud servers are made of magic pixie dust? No.

The real "point" is that, cloud = hardware + service, with service > 0.

As the OP describes, GitLab tries to do their own service (because service is expensive... it is), and they find out, the hard way, that the "service" part is not easy at all.

Amazon & Microsoft & Google run millions of servers each, so they can afford to hire really good people, and establish really good procedures, and so on.

That, I think, is a better point

not_kurt_godel · on Feb 11, 2017

snackai · on Feb 11, 2017

You are completely right. There are reasons to oppose the cloud, but maybe they should focus on improving their systems before moving out of the cloud. At this point in time it is clear that GitLab lacks the talent to run everything themselves. I mean 5 backups worthless or lost? You can't let interns write your backups system. After all backup is a large portion of their product.

angry_octet · on Feb 11, 2017

The worst part of the whole episode, even worse than 'deleted the active database by accident', was '(backups) were no one's responsibility'. This is not an oversight by an individual engineer, but an aspect of the management and company culture. It shows they lack processes derived from requirements. Lots on introspection required from gitlab at this point.

cookiecaper · on Feb 11, 2017

Yes. This should be treated as a serious management failure. Blame does not lie with the individual who made a simple mistake; it lies with the supervisory structure that allowed simple mistakes such as this to result in major data loss (and, as discussed in yesterday's thread [0], has made a series of other serious strategic mistakes that have likely caused them to end up with such inadequate internal hierarchies).

Something like this is not a mere oversight on the part of technical leadership; it's either negligence or incompetence. Whoever is responsible for GitLab's server infrastructure should be having very serious thoughts right now.

[0] https://news.ycombinator.com/item?id=13607890

manojlds · on Feb 11, 2017

That's why I was critical of the other article on HN recently that evangelized their remote work.

AsyncAwait · on Feb 11, 2017

This has nothing to do with remote work, it has to do with having dedicated devops people on staff.

kshitij_libra · on Feb 11, 2017

Smaller companies that do not have enough senior/good technical guys that they can afford for whatever reason they have... benefit greatly from the cloud. 1 master, no read partitions, weird backup policies and the saviour of the day is some engineers lucky manual snapshot. That sucks. It's better people start with cloud and manage when they are really confident.

Operyl · on Feb 11, 2017

It's worth noting that compared to a good number of recent ish startups, GitLab now has (I believe) more than 160 or so employees. Someone could've owned a recurring task to work on backup processes (and I imagine, now, someone (or likely multiple people)).

overcast · on Feb 11, 2017

This likely would have never happened, if one of their one hundred and sixty employees, just took the time to make sure backups were setup at all. You also need to be a sufficiently large enough organization to warrant the prices that the "cloud" services demand. As stated below, cloud computing is just someone else's server somewhere, and they are making lots of money doing it. Unless you need that level of scalability, and processing, then it's not worth it. I think Gitlab stated their entire PostgreSQL database was only a few hundred gigabytes. That's not exactly huge.

RhodesianHunter · on Feb 11, 2017

"As stated below, cloud computing is just someone else's server somewhere. Unless you need that level of scalability, and processing, then it's not worth it."

I keep seeing people throw this around as if it's God's truth and it frustrates the hell out of me. It may be the case for your organization but everywhere I have worked (from startups to Fortune 500) the cloud allowed our engineers to focus on our product rather than infrastructure maintenance and contributed massively to our success.

cookiecaper · on Feb 11, 2017

The cloud provides convenience, which absolutely does have [some] value. That value usually does not approach the actual cost incurred for companies of a reasonable size (including, IMO, GitLab), but it does exist and it means that everyone should be able to find a smattering of uses for cloud.

The issue I think is that so many people just go balls-to-the-wall 1000% AWS and consider it a done deal, which is terrible, and then go around and telling everyone else they should do the same thing, which is also terrible.

The fact is that you can't just lay the responsibility for all of this in Amazon's lap. We'd be even less impressed if GitLab's excuse was "Yeah, we had the Amazon nightly snapshots enabled, so we only lost 19 hours of data" (whoever coincidentally took the backup 6 hours before the incident should get enough of a bonus to make his GitLab salary market-competitive!).

Amazon does start you out with some OK-ish defaults, which is better than allowing someone with 4 days of experience to set everything up, but ultimately that's not going to mean much in unskilled hands.

When it comes down to it, every company still need someone internal to take responsibility for their infrastructure; that means backups, security, permissions, performance, hardware, and yes, cost. If your company already has someone with those responsibilities, giving that person $500k to hire a few hardware jockeys is going to be much better than giving Amazon $3M to be the sole host for all of your infrastructure. If your company doesn't have anyone with these responsibilities, it needs to get on the stick, as GitLab has clearly demonstrated to us this month.

memracom · on Feb 11, 2017

RDS PostgreSQL is like the Hotel California, you can check in any time you want, but you can never leave. Maybe it is OK as a simple data store for a single app, but not for a real database. I gained a lot of my knowledge of PostgreSQL internals by helping my company get off of RDS and onto a dedicated EC2 instance solution. RDS imposes too many limitations.

Also, your snapshot backup solution is trivial to implement on EC2 or anywhere else for that matter. But it is not easy to do it right in some scenarios. Read https://www.postgresql.org/docs/9.6/static/backup-file.html for details. LVM or ZFS are likely needed under the db layer.

RhodesianHunter · on Feb 11, 2017

"Maybe it is OK as a simple data store for a single app, but not for a real database."

Currently working at company number 2 with large (many terabytes) databases on RDS and can safely say this is horse shit.

The amount of time and energy it allows our engineers to spend on our actual products instead of database management is worth all of the extra cost and lock in and then some.

Edit: I just realized that you were talking about Postgres on RDS in particular. I don't have experience with Postgres so you may well be right.

_Codemonkeyism · on Feb 11, 2017

I'm interested in

a. How many hours would you guess you are saving a month?

b. What takes by a factor of 10 less time on RDS than doing it by hand? What task sees the largest time saved?

Because I always wonder when reading this, what am I missing? What haven't we done? Were we lucky? We were running MySQL and Postgres for multi hundred million EUR companies with millions of users and we did not spend a lot of effort into managing them.

RhodesianHunter · on Feb 16, 2017

1. Zero effort backups.

2. Zero effort re-deployment from backups.

3. Almost zero effort encryption at rest.

4. Zero effort hot backups, automatic fail-overs, and multiple datacenter deployments

5. Low effort migrations of massive amounts of data between DBs when someone inevitably wants to refactor something

6. Zero effort logging and log aggregation

7. Almost zero effort alerting of issues via sms/email/other

I could go on but I'm on my way to work...

When you're paying engineers north of 150K all of this adds up, and I'd much rather throw the money at Amazon to handle this and pay the engineers to focus on our actual product.

pgaddict · on Feb 11, 2017

We're in the business of PostgreSQL support, and some of our customers use RDS for various reasons. Not having to care about the deployment is one of the usual goals of using a managed environment, but considering they subsequently go and buy support from a third party might be a sign of something.

Of course, my view is biased because we only hear about the issues - there might be a 100x more people using RDS without any issues, and we never hear about them.

In general, the pattern we see is that people start using RDS, and they're fairly happy because it allows them to build their product and RDS more or less works. Then they grow a bit over time, and something breaks.

Which brings me to two main RDS issues that we run into - lack of insight, and difficulty when migrating from RDS with minimum downtime. Once in a while we run into an issue where we wish we could connect over SSH and get some data from the OS, attach gdb to the backend, or something like that. Or using auto_explain. Not even mentioning custom C extensions that we use from time to time ...

RhodesianHunter · on Feb 16, 2017

"and difficulty when migrating from RDS with minimum downtime"

They're simply uninformed then. AWS database migration service makes zero downtime migrations trivial between just about any major databases (mysql, oracle, postgres, aurora, sqlserver, etc.)

atmosx · on Feb 12, 2017

Could you describe the limitations you encounter in the RDS PSQL setup vs a self-managed?

seldo · on Feb 11, 2017

Funny you should mention a managed relational database service; Instapaper uses one of those and had more than 12 hours of downtime this week: http://blog.instapaper.com/post/157027537441

No database solution is totally reliable. If storing data is my primary job, like it is GitLab's, I'd like to have as much control of it as possible.

illumin8 · on Feb 11, 2017

Let's just say that Instapaper's outage was self-inflicted. You don't see them blaming their cloud provider, do you? People make mistakes, and even with a managed relational database service, you can still make mistakes.

The difference is that Instapaper was able to restore from backups, because their managed service performed them properly. The archive data is taking longer to restore, but that's due to design decisions Instapaper made.

imperialdrive · on Feb 11, 2017

I'm typing off the top of my head, but didn't they have like 400GB of database? That would probably take 27 hours to get fully available via S3 at 32,000/kbps which is about what s3 will provide for first time hits in my experience.

bdcravens · on Feb 11, 2017

RDS can restore a point-in-time snapshot in a couple of hours on databases over 1TB (speaking from experience)

qaq · on Feb 11, 2017

RDS has severe performance limitations as in you can't provision more than 30K IOPS which is about 1/2 the performance of low end consumer SSD and about 1/20 the performance of a decent PCI-E SSD. You way better of running on decent dedicated hardware for the DB.

illumin8 · on Feb 11, 2017

You can get 500K random reads per second and 100K random writes per second using RDS Aurora.

If you truly need more than 30K IOPs, I would recommend leveraging read-replicas, a Redis cache, and other solutions before just "throwing money at the problem" and purchasing a million IOPs.

qaq · on Feb 11, 2017

how is buying a quality hardware component for a few K throwing "money at the problem" vs paying thousands a month for a very subpar product?

illumin8 · on Feb 12, 2017

You can't just buy a single enterprise-grade NVMe SSD and call it a day. Are you planning on buying enough to populate at least 2-3 servers with multiple devices, then setting up some type of synchronous replication across them? What type of software layer are you going to use to provide high availability for your data? DRBD? How are you going to manage all of the different failure modes (failed SSD, network partition, split brain, etc.)? How are you going to test it?

I'm afraid you are seriously underestimating the operational capabilities required to successfully operate a highly-available, distributed, SSD storage layer.

qaq · on Feb 13, 2017

Nope but if I pay a few K for a service I expect it to scale to perfomance at least comparable to a very low end device. Why would I use DRBD in a Postgres cluster? I am not underestimating anything I am simply pointing out that RDS is overpriced and crappy service. A proper setup & operation of a postgres cluster is manageable task. What you totally can not manage on AWS is the risks of having single tenant that is using 30% of resources or having lengthy multi AZ outages due to bugs in extremely complex control layer etc.

yeukhon · on Feb 11, 2017

I will add my inputs on RDS. I gave this comment on the GitLab incident thread. I actually managed to delete an RDS cloudformation stack by accident. The night before I pushed an update to Cloudformation to convert storage class to provisioned IOPS. Next morning I woke up really early, drove my girlfriend to work. While waiting in the car I wanted to check the status of the update so I went on AWS mobile app to check. Mind you I have iPhone 7, but the app was very slow and laggy. As I was scrolling down to find out the failure. But there was a lag between the screen render and my click. Damn. I clicked on delete. Yeah, fucking delete. No confirmation. It went through. No stop button.

There was no backup because the cfn template I built at the time did not have the flag that said take a final snapshot. If you do not take the final snapshot (via console, api, cfn) you are doomed: all the auto snapshots taken by AWS are deleted upon the removal of the RDS instance.

This was our staging db for one of our active projects which I and the dev team spent about a month working to get to staging and was under UAT. Fuck. I told my manager and he understood the impact so he just let me get started on rebuilding. The next morning I got the DB up and running since luckily I compiled my runbook when I first deployed it yo staging. But it was not fun because the data is synced via AWS DMS from our on premise Oracle db so I needed to get sign off from a number of departments.

So I learned my first lesson with RDS - make sure final snapshot flag is enabled (for EC2 user, please remind yourself anything stored on ephemeral storage are going to be loss upon a hard VM stop/start operation, so backup!!!).

I also learned that RDS is not truly HA in the case of upgrading servers, both minor and major upgrade. I've tested major upgrade and saw DB connection unavailable up to 10 min. In some minor version upgrades both primary and secondary had to be taken down.

Other small caveats such as auto minor version upgrade, maintenance windows, retention for automated snapshot are only up to 35 days, event logs in RDS console doesn't last for more than a day, converting to provisioned IOPS can be expensive are just some small annoyance or ugh kind of things I would encourage folks to pay close attention to. Oh yeah, also manual snapshots have to be managed by yourself, kind of obvious but there is no life cycle policy... building a read replica can take up to a day in my first attempt of ever creating a read replica.

Of course now I learned these lessons so we have auto and manual snapshots and a better schedule. I encourage you take the ownership of the upgrade even for minor version so you know how to design your applications to be better at fault tolerance.....in the end hing I liked RDS the most is the extensive free CW metrics available. I also recommended people not to use the mobile app and if you do, setup a read-only role / IAM user. The app is way too primitive and laggy. I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

illumin8 · on Feb 11, 2017

There is no magic silver bullet that will let you upgrade a database without some minor amount of downtime. RDS minimizes this as much as possible by upgrading your standby database, initiating a failover, then creating a new standby. Clients will always be impacted because you have to, by definition, restart your database to be running the new version.

You can select your maintenance window, and you can defer updates as long as you want - nobody will force you to update, unless you check the "auto minor version update" box.

Please don't blame AWS for your lack of understanding of the platform. They try to protect you from yourself, and the default behavior of taking a final snapshot before deleting an instance is in both CloudFormation and the Console. If you choose to override those defaults, don't blame AWS.

yeukhon · on Feb 12, 2017

You come across strong. No one is blaming AWS.

cookiecaper · on Feb 11, 2017

>So I learned my first lesson with RDS - make sure final snapshot flag is enabled (for EC2 user, please remind yourself anything stored on ephemeral storage are going to be loss upon a hard VM stop/start operation, so backup!!!).

This bit us once. Someone issued a `shutdown -h now` out of habit in an instance that was going for reboot, and it came back without its data, because "shutdown" is the same as "stop", and "stop" on ephemeral instances means "delete all my data". Since the command was issued from inside the VM, no warning or message that would've appeared on the EC2 console was displayed.

Amazon's position on ephemeral storage was shockingly unacceptable and unprofessional. They claimed they had to scrub the physical storage as soon as the stop button was pressed for security purposes, which is a complete cop-out. Of course they can't reallocate that chunk of the disk to the next instance while your stuff is on it, but they could've implemented a small cooldown period between stoppage, scrubbing, and reallocating the disk so that there would at least be a panic button and/or so accidental reboots-as-shutdowns don't destroy data. The only reason they didn't do that is because they didn't want to need to expand their infrastructure to accommodate it. Very sloppy, and not at all OK. That's not how you treat customer data.

Fortunately, AWS has moved on; I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

>I also learned that RDS is not truly HA in the case of upgrading servers, both minor and major upgrade. I've tested major upgrade and saw DB connection unavailable up to 10 min. In some minor version upgrades both primary and secondary had to be taken down.

You need multi-AZ for true HA. Failover within the same AZ has a small delay, as you've noted.

>I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

As many others in this thread have said, AWS and other cloud providers aren't a silver bullet. Competent people are still needed to manage these sorts of things. GitLab most likely would not have fared any better under AWS.

illumin8 · on Feb 11, 2017

Don't blame AWS because you don't understand what ephemeral storage is.

There is a significant security reason why they blank the ephemeral storage. How would you feel if a competitor got the same physical server as you, and was able to read all of your data? AWS takes great lengths to protect customer data privacy in a shared, multi-tenant environment. They are very public through their documentation about how this works, so I think it's a bit negligent to blame them because you don't understand the platform.

cookiecaper · on Feb 11, 2017

Did you read my post? I understand what ephemeral storage is, and that giving another instance access to that physical device without scrubbing it is insecure. That's not the point. There's no reason that AWS needs to delete that data the instant a stop command is issued.

AWS gets paid the big bucks to abstract such concerns away in a pleasant manner. The device with customer data can sit in reserve, attached to the customer's account, for a cooldown period (of maybe 24 hours?) that would allow the customer to redeem it. AWS could even charge a fee for such data redemptions to compensate for the temporary utilization of the resource, or they could say ephemeral instances will always cost your use + 1 day. They can put a quota on the number of times you can hop ephemeral nodes.

They could do basically anything else, because basically anything else is better than accidentally deleting data that you need due to a counterintuitive vendor-specific quirk that conflicts with established conventions and habits and then being told "Sorry, you should've read the docs better."

This is an Amazon-specific thing that bucks established convention and converts the otherwise-harmless habits of sysadmins into potential data loss events. It's very bad to do this ever (looking at you, killall Linux v killall Solaris), but it's especially bad to do it on a new platform like AWS where you know lots of people are going to be carrying over their established habits and learning the lay of the land. It is not reasonable for Amazon to tell the users that they just have to suck it up and read the docs more thoroughly next time.

This is not like invoking rm on your system or database root, which is a multi-decade danger that everyone is aware of and acclimated to accounting for, and which has multiple system-level safeguards in place to prevent it: user access control, safe-by-default versions of rm that have been distributed with most major distributions lately, etc., and for which thorough backup and replication solutions exist to provide remedies when inevitable accidents do happen.

The point is that just instantly deleting that data ASAP and providing 0 chance for recovery is wanton recklessness, and there's no excuse for it. Security is not an excuse because there's no reason they have to reallocate the storage the instant the node is stopped.

If such deletions could only be triggered from the EC2 console after removing a safeguard similar to Termination Protection, that may be more reasonable, but allowing a shutdown command from the CLI to destroy the data is patently irresponsible.

Good system design considers that humans will use the system, that humans make mistakes, and it will provide safeguards and forgiveness. Ephemeral storage fails on all of those fronts. Yes, technically, it's the user's fault for mistakenly pressing the buttons that make this happen. But that doesn't matter. The system needs to be reasonably safe. AWS's implementation of ephemeral storage is neither safe nor reasonable.

Amazon has done a good job of tucking ephemeral storage away. It used to be the default on certain instance sizes. As another commenter points out, it now requires one to specifically launch an AMI with instance-backed storage. It's good that they've made it harder to get into this mess, but it's bad that they continue to mistreat customers this way, especially when their prices are so exorbitant.

illumin8 · on Feb 12, 2017

So, the solution to some customers not understanding the economics and functionality of ephemeral storage is to charge all customers for a minimum of 25 hours of use, even if they only use the instance for a single hour? That seems crazy.

Look, AWS is trying to balance the economics of a large, shared, multi-tenant platform. It would be great if they had enough excess capacity around to keep ephemeral instance hardware unused for 24 hours after the customer terminates or stops the instance, but frankly, that's an edge case, and they would be forcing other customers to subsidize your edge case by charging everyone more.

cookiecaper · on Feb 12, 2017

>So, the solution to some customers not understanding the economics and functionality of ephemeral storage

Let me stop you there. In our case, it wasn't that we didn't understand what ephemeral storage was or how it functioned, or that it would get cleared if the instance was stopped (though I've frequently met people who are confused over whether instance storage gets wiped when a machine is stopped or when it's terminated; it gets wiped when an instance is stopped).

The issue was that someone typed "sudo shutdown -h now" out of habit instead of "sudo shutdown -r now" (and yes, something like "sudo reboot" should've been used instead to prevent such mistakes). Stopping an instance, which is what happens when you "shut down", can have other ramifications that are annoying, like getting a different IP address when it's started back up, but those annoyances are usually pretty easy to recover from, not a big deal. Much different ball park from getting your stuff wiped.

Destroying consumer data IS a big deal. It's ALWAYS a big deal. If your system allows users to destroy their data without being 1000% clear about what's happening, your system's design is broken. High-cost actions like that should require multiple confirmations.

Even the behavior of the `rm` command has been adjusted to account for this (though it could be argued that it hasn't been adjusted far enough); for the last several years, an extra flag has been required to remove the filesystem root.

>is to charge all customers for a minimum of 25 hours of use, even if they only use the instance for a single hour? That seems crazy.

One of several potential solutions. It doesn't seem crazy to me; at least, not in comparison to making a platform with such an abnormal design that something which is an innocent, non-destructive command everywhere else can unexpectedly destroy tons of data.

The ideal solution would be for Amazon to fix their design so that this is fully transparent to the user. Instance storage should be transmitted into a temporary EBS disk on shutdown and automatically re-applied to a new instance store when it's spun back up (it's OK if this happens asynchronously). The EBS disk would follow conventional EBS disk termination policies; that data shouldn't be deleted except at times that the EBS root disk would also be deleted (typically on instance termination, unless special action is taken to preserve it).

That could be an optional extension, but it should be on by default -- that is, you could start an instance store at a lower cost per hour if you disabled this functionality, similar to reduced redundancy storage in S3, etc. Almost every company would be thrilled to pay the extra few cents per hour to safeguard against the accidental destruction of virtually any quantity of data that might be important.

>Look, AWS is trying to balance the economics of a large, shared, multi-tenant platform. It would be great if they had enough excess capacity around to keep ephemeral instance hardware unused for 24 hours after the customer terminates or stops the instance, but frankly, that's an edge case, and they would be forcing other customers to subsidize your edge case by charging everyone more.

A redemption fee would punish the user who made the mistake for failing to account for Amazon's flawed design. Under this model, such fees should be at least high enough to make up the cost incurred by Amazon in keeping the hardware idle.

This way Amazon can punish people who impugn upon its bad design choices by making them embarrass themselves before their bosses when they have to explain why the AWS bill is $300 higher this month or whatever, and the data won't be gone. Winners all around.

illumin8 · on Feb 13, 2017

A redemption fee is a good idea, but it would still take engineering effort to build such a feature, so the opportunity cost is that other features customers need wouldn't get built.

Another thing I'd like to point out is that you really need to plan for ephemeral storage to fail. All it takes is a single disk drive failure in your physical host, and you've lost data. If you are using ephemeral storage at all, you should definitely have good, reliable backups, or the data should be protected in other ways (like HDFS replication).

mortar · on Feb 11, 2017

> I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

Still around, can be launched with an instance-store backed AMI:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceS...

AlisdairO · on Feb 11, 2017

Not sure how it works on cloudformation, but in the console and API you have to explicitly skip the final snapshot.

nhumrich · on Feb 11, 2017

I know about the daily snapshots, but didn't know about the archive logs. Is this something I have to enable? How do I get the logs and how do I restore using them?

illumin8 · on Feb 11, 2017

It's automatic. Go ahead and launch a new instance, restoring to a point in time (that's how you do restores in RDS). Notice that it gives you a calendar day/date/time fields where you can select the recovery point down to the second. This is enabled by replaying the archive logs to get you to the exact point in time.

nhumrich · on Feb 13, 2017

Thank you! I never realized I could do this for some reason. I always restored from snapshots.

robodale · on Feb 11, 2017

Jesus christ, thanks for the infomercial.

bitshepherd · on Feb 11, 2017

A great number of issues can be attributed to the selection of Azure as the platform of choice. That said, a little bird told me that the decision was largely a cost factor. "You get what you pay for" never rang more true.

tedunangst · on Feb 11, 2017

But none of the issues were azure/cost related except for the slow recovery? I mean, neither AWS nor GCE can make you notice youre not getting cron mail.

teej · on Feb 11, 2017

Setup a cloudwatch alarm and have cron send a keepalive email to it..? Without looking, I'm pretty sure that does exactly what you want.

tedunangst · on Feb 11, 2017

And also entirely possible to do with azure. AWS doesn't make you do that.

Also, depending on precise nature of the dmarc fuckup, wouldn't help anyway. Cloud watch receiving mail doesn't guarantee that you will receive mail.

ronack · on Feb 11, 2017

Yes, I recall seeing a ticket that referenced Gitlab using Azure because it was heavily subsidized. My company uses Azure for much the same reason, and my experience has been largely positive.

MichaelGG · on Feb 11, 2017

Is Azure cutting deals beyond the usual free 60k over a year or two or whatever it is for cool startups? Azure seems significantly more expensive in general, problems and slowness aside.

ronack · on Feb 11, 2017

Azure offers $500k for YC startups and $360k for "lesser" accelerators. It's by far the most generous offer out there.

KayEss · on Feb 11, 2017

The engineers still seem to have a physical server mindset rather than a cloud mindset. Deleting data is always extremely dangerous and there was no need for it in this situation.

They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.

Only when the replication is back in good order do you go through and kill the servers you no longer need.

The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.

MichaelRenor · on Feb 11, 2017

This type of thing always sounds good and all, but the reality is people get desperate and emotional when their website is down and everyone wants it up ASAP.

KayEss · on Feb 11, 2017

I certainly don't disagree with that, but if you have this automated it is also the fastest way to get it back up and running. Besides, the site wouldn't have been down if they had this.

jjirsa · on Feb 11, 2017

Desperate and emotional is no way to run a business

IA21 · on Feb 12, 2017

I agree but engineers are people, not machines. They're bound to make mistakes under pressure.

sneak · on Feb 12, 2017

If that's true, the process is wrong.

lomnakkus · on Feb 11, 2017

> They should have spun up a new server to act as secondary the moment replication failed.

In a perfect world everything is cluster-ready &c at the outset. In this world it usually... isn't.

EDIT: ... and I'd posit that such cluster-readiness actually isn't worth it most of the time.

fredsir · on Feb 11, 2017

You don't think it's worth it most the time because of the hassle of setting up and managing a cluster, or because clusters in and of it self is not necessary for most?

lomnakkus · on Feb 11, 2017

The latter. I think it's a 1% business case, basically. I mean, if we can get 80% benefit without excessive cost, then it's obviously a good idea. (And I do use Ansible/Docker and the like, but it's not entirely without friction... which is where the cost/benefit analysis comes in.)

EDIT: Obviously, if you really need clustering, then you need it, but IME people tend to overestimate their needs drastically. Everybody wants to be Big Data, but almost nobody actually is.

kuschku · on Feb 11, 2017

That’s a nice idea, if you’re willing to pay the massive extra cost to actually rent all those overpriced systems.

For me, personally, going from cloud servers to rented dedicated servers cut my bill by 93% – more than an order of magnitude. At same performance.

In fact, it’d be cheaper to run 10x as many dedicated servers than to use cloud solutions for me.

theptip · on Feb 13, 2017

It cut your cloud services bill by 93%, but how much did it increase your engineering bill by?

If your engineering time is free, then this calculation is complete. Otherwise it is not.

Does that 93% saving pay for a DB engineer, or enough of your developers' time to build the same quality of redundancy as you'd get with a DBaaS?

This calculus is going to be different for every DB and every company, but the OpEx impact of switching to dedicated servers is a bit more complex than you suggest above.

kuschku · on Feb 13, 2017

(a) I’m talking about projects I host in my free time (b) My server budget is fixed.

So, for me the choice was between "use cloud tools, and get performance worse than a raspberry pi", or "run dedicated, and get more performance and storage and traffic than I need, and actually the ability to run my stuff".

For less than the price of a Netflix subscription I’m able to run services that can handle tenthousands of concurrent users, and have terabytes of storage (and enough traffic that I never have to worry about that).

And the cost of setting it up was for me a few days.

For me it was a decision between being able to run services, or not being able to run them at all.

theptip · on Feb 14, 2017

Sure, hobby/spare-time projects are one of the cases where it's perfectly reasonable to self-host; often it's fun to learn about the underlying tools by rolling your own db, and doing so can save you some cash (at the expense of your own time).

However, that paradigm is not really applicable to GitLab's OpEx calculation; they have to pay their engineers ;)

kuschku · on Feb 14, 2017

Even for GitLab, when you have 1-2 orders of magnitude price difference, it might be more affordable to hire a DBA.

You have to remember GitLab is a 100% remote company, so they can hire DBAs from anywhere on the planet.

The cloud obviously makes a lot more sense if you have US electricity prices and Silicon Valley wages.

theptip · on Feb 14, 2017

Yes, it might be more affordable. They seem to think it is, as they have chosen to go with self-hosted.

My point is simply that your posts above didn't address the complexity of their calculation, as they didn't factor the costs of switching to self-hosted.

jjirsa · on Feb 11, 2017

The fact that they're using a single master software is already an antiquated concept in 2017

meowface · on Feb 11, 2017

>Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.

I could feel the sweat drops just from reading this.

I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.

atombender · on Feb 11, 2017

Brings back memories, though not of anything I did. Quoting a comment I made on HN recently in a different thread:

---

Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.

The first time it happened, we didn't understand what, exactly, had caused it. The database directory was just gone, and it seemed to have gone around 11pm. I (not they!) discovered this and we scrambled to recover the data. We had replication, but for some reason the guy on call wasn't able to restore from them -- he was standing in for our regular ops guy, who was away on site with another customer -- so after he'd struggled for a while, I said screw it, let's just restore the last dump, which fortunately had run an hour earlier; after some time we were able to get a new master set up, and although we had lost one hour of data, it was fortunately from a quiet period with very little writes. Everyone went to bed around 1am and things were fine, the users were forgiving, and it seemed like a one-time accident. The techs promised that setting up a new replication slave would happen the next day.

Then, the next day, at exactly 11pm, the exact same thing happened! This obviously pointed to a regular maintenance job as being the culprit. It turns out the script they used to rotate database backup files did an "rm -rf" of the database directory by accident. Again we scrambled to fix. This time the dump was 4 hours old, and there was no slave we could promote to master. We restored the last dump, and I spent the night writing and running a tool that reconstructed the most important data from our logs (fortunately we logged a great deal, including the content of things users were creating). I was able to go bed around 5am. The following afternoon, our main guy was called back to help fix things and set up replication. He had to travel back to the customer, and the last things he told the other guy was: "Remember to disable the cron job".

Then at 10pm... well, take a guess. Kaboom, no database. Turns out they were using Puppet for configuration management, and when the on-call guy had fixed the cron job, he hadn't edited Puppet; he'd edited the crontab on the machine manually. So Puppet ran 15 mins later and put the destructive cron job back in. This time we called everyone, including the CEO. The department head cut his vacation short and worked until 4am restoring the master from the replication logs.

We then fired the company (which filed for bankruptcy not too long after), got a ton of money back (we threatened to sue for damages), and took over the ops side of things ourselves. Haven't lost a database since.

lojack · on Feb 11, 2017

I'm no sysadmin, and I know mistakes are inevitable and all... but I find this kind of mistake is unlikely to come from me. I feel as though a lot of developers are too nonchalant about production boxes. I think one or two close calls where I nearly did this exact thing served as a good wakeup call for me.

Steps I personally take to avoid this:

- Avoid prod boxes like the plague

- Set up a prompt (globally) to make it extremely obvious that you're in production. Something like a red background and black text saying "PRODUCTION"

- When changing data in production (DB's, config, etc) write a script (or just commands to copy and paste) and have that peer reviewed. If anything doesn't go to plan, treat it as a red flag. This serves a dual purpose of having a quick record of your actions without hunting through logs.

- Never ever leave open sessions

- Avoid prod boxes. This is important enough for me to say twice. Most of the time it can be avoided, especially if you use configuration management tools and write tools to perform common operations.

Now, lets just cross my fingers I don't jinx myself :-)

StreamBright · on Feb 11, 2017

This sounds really cool. How about you are working for amazon or google and you have 1M production boxes? A nice prompt wont save you, change management will. Writing down the exact steps and reading it yourself and get others to read it and you execute it line by line. In my experience this is much better approach than avoid production boxes or have a red prompt and also scales to larger infrastructures. If something goes sideways (and in some cases it will) you can pinpoint the root cause quickly.

DigitalJack · on Feb 11, 2017

It may be unlikely to come from you, but it's for damn sure that it'll never happen to said engineer again.

Also, I would make sure to have a different prompt than default for non-prod systems too. That way you know to be suspicious if it hasn't been changed from default.

viraptor · on Feb 11, 2017

I don't think most of your points really apply though. They were setting up replication in production, so they had to work on production boxes. Setting prompt to say just "production" wouldn't help for the same reason. Production was intended.

Peer review though - yes. That could help. I wouldn't say "I'm unlikely to make that mistake" - it's likely to go on the famous last words list...

lojack · on Feb 11, 2017

Sure, maybe the point about "PRODUCTION" may not have applied, I was mainly commenting on OPs post about how we've all been in a situation and the steps I take to avoid such situations. I'm curious about your arguments about the remaining 3/4 of the steps I take and how those wouldn't have applied to this situation?

cookiecaper · on Feb 11, 2017

The red PS1 would've clearly indicated to the engineer that he was typing `rm -rf ...` on the _master_, not the secondary. This assumes that the master and secondary would have differing prompts based on their relative importance.

viraptor · on Feb 11, 2017

That would help, but that's not what OP advocated. Sure, you can improve on those ideas. I was mainly pointing out that saying "it's unlikely to happen to me" was a bit dangerous and too sure, if most of the reasons do not apply to the situation.

lojack · on Feb 11, 2017

Would the steps I describe prevent actions taken in the GitLab incident? I would never make no assumptions to that. Maybe. Maybe not. Did I say following those steps would make it unlikely to happen to you? No. That's why I prefaced it with "I'm not a sysadmin." Would it prevent cases described by the person I was responding to? Absolutely. Not 100% of the time, but some percentage of the time.

So, I'll say it more clearly, and you can mark my words. It's unlikely I'll ever log into a production system, type the wrong command, and do something bad as a result.

Could I deploy code that does very bad things to production? Yes. It'll probably happen to me. Is that the situation described above? No.

I treat logging into a production system as if one wrong move could result in me losing my job. Why? Because one wrong move could result in me losing my job. I'm not joking when I say I avoid logging into a production system like the plague. It's unlikely to happen to me because its extremely rare for me to put myself in a situation where I could let this happen. There's almost always better alternatives that I'll resort to, well before doing anything like this.

chrismorgan · on Feb 11, 2017

I messed up an XP computer at home with `cd D:\backups\something; del /s * ` many years ago; `cd` without the /D flag doesn’t change the drive, so although D:\backups\something was the working directory on the D: drive, the working directory was still C:\WINDOWS\system32, and cmd.exe was running as administrator.

Fortunately disks were slower back then, so it hadn’t deleted too many files when I interrupted it, and the computer was able to be recovered without too much inconvenience.

rectang · on Feb 11, 2017

For me, it was when I meant to execute this...

    rm -rf ~/foo

... but executed this instead:

    rm -rf ~ /foo

cjbprime · on Feb 11, 2017

My worst data loss was:

    $ tar cvfz mbox outbox mbox.tar.gz

The argument order is backwards -- the output file is supposed to be first, then the input files.

On my system, this overwrote my full mailbox with a gzipped copy of my outbox, and a complaint that the mbox.tar.gz input file didn't exist.

That's right, the worst data loss happened while I was trying to take a backup. :(

chuckdries · on Feb 11, 2017

What were you trying to do? What's the outbox argument for?

dasil003 · on Feb 11, 2017

He was trying to backup his mbox and outbox.

BinaryIdiot · on Feb 11, 2017

I did the Windows equivalent once a long time ago (I think it's deltree?) and I did it on a university computer system. It cleared out a TON of files and the computer itself pretty much stopped working. I had to hard turn it off.

Fortunately the University was using some tool that can re-image a computer each time it boots before hitting Windows so starting it back up and all the deleted system and application files were back.

overcast · on Feb 11, 2017

Probably Ghost server, that's why schools do this.

dsavinkov · on Feb 11, 2017

this is the reason why I always use quotes before specifying folders with rf -rf :)

memracom · on Feb 11, 2017

I always type

ls /stuff/wherever/*

Then examine the output to see if that is the stuff I really want to delete and if it is,

up-arrow Ctrl-A right right backspace backspace rm -rf enter

Never deleted the wrong stuff again in 30 years of doing that

okbake · on Feb 11, 2017

Something else that is really useful in these situations (in bash at least) is alt-* (alt-shift-8). It will expand a directory or glob into all effected top level files/directories.

For example, it will expand `ls *` to `ls foo bar baz`, etc

was_boring · on Feb 11, 2017

I always cd to the parent directory, ls with tab complete to check, the rm with the same tab complete.

bartvk · on Feb 12, 2017

Great tip, thanks!

meowface · on Feb 11, 2017

For me it was a simple `rm -rf .`. Thought I was in ~/somefolder/, was actually in ~...

I stopped it after about 3 seconds, but that was enough to do critical damage.

a_t48 · on Feb 11, 2017

Misread that as "The engineer was terminated" at first. Poor guy.

haldean · on Feb 11, 2017

Potential outage prevention plan: put an alias on all production machines that emails HR to schedule a disciplinary meeting every time you run `rm -rf`.

INTPenis · on Feb 11, 2017

At a small web host, early in my career, I once saw the boss blurr past my desk towards the server room. Throw open the big vault door and disappear inside.

Turns out he had accidentally executed an rm of the home dir on a major web server in the background so in panic, instead of killing the right pid, he just ran to the server and pulled the power cords. :D

Ended up restoring a few home dirs from tape.

atmosx · on Feb 11, 2017

Great to have a full-featured, professional post-mortem. Incidentally I work at a company that suffered data loss because of this outage and we're looking for ways to move out of GL.

My 2 cents... I might be the only one, but I don't like the way GL handled this case. I understand transparency as a core value and all, but they've gotten a bit too far.

IMHO this level of exposure has far-reaching, privacy implications for the ppl who work there. Implications that cannot be assessed now.

The engineer in question might have not suffered a PTSD, but some other engineer might haven been. Who knows how a bad public experience might play out? It's a fairly small circle, I'm not sure I would like to be part of a company that would expose me in a similar fashion, if I happen to screw up.

On the corporate side of things there is a saying in Greek: "Τα εν οίκω μη εν δήμω" meaning don't wash your dirty linen in public. Although they're getting praised by bloggers and other small-size startups, in the end of the day exposing your 6-layer broken backup policy and other internal flaws in between, while being funded at the tune of 25.62M in 4 rounds, does not look good.

sytse · on Feb 11, 2017

Hi Panagiotis. I'm glad to hear you like the postmortem. I'm very sorry your company suffered data loss. If you want to move from GitLab.com please know that you can easily export projects and import them on a self-hosted instance https://gitlab.com/help/user/project/settings/import_export.... (and if in the future we regain your trust you can also go the other way).

It is not our intent to have one of our team members implicated by the transparency. That is why we redacted their name to team-member-1 and in any future incidents we'll do the same. It should be their choice to be identified or not. We are very aware of the stress that such a mistake might cause and the rest of the team has been very supportive.

I agree that we don't look good because of the broken backup policy. The way to fix that is to improve our processes. We recognize the risk to the company of being transparent, but your values are defined by what you do when it is hard.

JasonSage · on Feb 11, 2017

This is a perfect response.

Every day I'm growing more to like GitLab. It took me way too long to realize that GitLab has a singular focus to change how people create and collaborate.

A person purely motivated on principle to see a specific change is going to find a way to make it happen. The hard part with such ideological ventures is that you have to have the business sense to make it sustainable. I'm gradually learning to recognize both aspects present in GitLab.

When you're guided on principle, it's much easier to accept losses here and there in the right way...

> If you want to move from GitLab.com please know that you can easily export projects and import them on a self-hosted instance (and if in the future we regain your trust you can also go the other way).

...and be able to stay focused on the bigger picture! Some customers were going to react this way no matter what. Sytse's response here characterizes GitLab's response as a whole here—we know we did wrong here, we learned from it, and we're going to be able to do a better job here on out regardless of whatever the fallout from the incident is.

Sytse, I love what you're doing and I look forward to seeing your continued resilience and dedication to your goal. The world needs more businesses like this.

sytse · on Feb 11, 2017

Thanks for the kind words Jason. It has been a long week and reading this at 7:30am on a Saturday is an unexpected gift.

atmosx · on Feb 11, 2017

Thanks for the reply.

> It is not our intent to have one of our team members implicated by the transparency. That is why we redacted their name to team-member-1 and in any future incidents we'll do the same.

Great, good to know. I wish all the success in the world to you and everyone involved with Gitlab.

sytse · on Feb 11, 2017

Thank you very much!

AsyncAwait · on Feb 11, 2017

> your values are defined by what you do when it is hard.

Precisely.

Most companies would stay as quiet about this as possible, you guys remained transparent and this is why I'll remain a customer.

sytse · on Feb 11, 2017

Thanks for supporting us.

gr2020 · on Feb 10, 2017

Reading this, the thing that stuck out to me was how remarkably lucky they were to have the two snapshots. The one from 6 hours earlier was there seemingly by chance, as an engineer had created it for unrelated reasons. And for both the 6- and 24-hour snapshots, it seems just lucky that neither had any breaking changes made to them by pre-production code (they _were_ dev/staging snapshots, after all).

I'm glad it all worked out in the end!

sytse · on Feb 10, 2017

We too are glad we had those snapshots. And while it was the worst thing that ever happend at GitLab it is humbling to know that it could have been worse.

animex · on Feb 10, 2017

What do you think would have happened if you had total data loss/failure?

sb8244 · on Feb 11, 2017

I don't work for gitlab.

Isn't that a put you out of business moment? I guess there is enterprise installations but the main website would be toast.

I guess there is the chance there were still older backups taken intentionally or by chance that would have been worse but recoverable from

gtaylor · on Feb 11, 2017

Most of their income probably comes from customers that run their own GitLab Enterprises installs. This would have really sucked for all of their non paying users, though.

greenrd · on Feb 10, 2017

GitHub also lost a bunch of PRs and issues sitewide early in their history. They claimed to have restored all the PRs from backup, but I was pretty sure I had opened a PR and it never came back. I emailed support and they basically told me tough luck.

ancarda · on Feb 11, 2017

>Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.

At my dayjob, we gradually stopped using email for almost all alerts, instead we have several Slack channels like #database-log where errors to MySQL go. Any cron jobs that fail post in #general-log. Uptime monitoring tools post in #status. So on...

Email has so much anti-spam stuff like DMARC that make it less reliable your mail will be delivered. For something failing like a backup or database query, it's too important to have potentially not reach someone who can make sure it gets fixed.

My 2 cents.

gl- · on Feb 11, 2017

This is a step in the right direction but still misses a big part of it IMO: push versus pull notifications. If the agent stops functioning correctly or someone makes a config change, the alerts just stop and no one notices.

At the very least you want some kinda dead-man's switch that gets pissed if it's seen no events in the last x amount of time. Ideally you want to be polling the box in a stateful way; although with ephemeral nodes & flexible infra being all the rage that's fallen to the side a bit lately.

ancarda · on Feb 12, 2017

Absolutely, that's a great idea!

You could also check for evidence a run has been successful, although that does depend on what you're doing exactly.

For our backup system, we're going to build an audit cron job on our main server that checks all our Azure containers to see if each server has pushed a file lately. It'll alert us if a file hasn't been uploaded in a few days or if it's smaller than a few MB (which is suspiciously small; we'd expect a few hundred MB for mysqldump+files).

mkopinsky · on Feb 12, 2017

How do you set cron to post to Slack?

ancarda · on Feb 12, 2017

We use Monolog (PHP library) that is able to post to Slack.

Messages in Monolog, like syslog have a level attached, so DEBUG, INFO, NOTICE and WARNING will only be written to a log file on disk. Anything higher, so ERROR, CRITICAL, ALERT or EMERGENCY will write to Slack (as well as log to disk). This means we only get notified of things failing and we can go on the server and see everything from DEBUG upwards which lets us mentally step through the cron job's run.

It's a very cool library. https://github.com/Seldaek/monolog

You can see the handlers here: https://github.com/Seldaek/monolog/tree/master/src/Monolog/H... which includes Slack, HipChat, IFTTT, Pushover, etc...

MichaelRenor · on Feb 11, 2017

> Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.

I can only image this engineer's poor old heart after the realization of removing that directory on the master. A sinking, awful feeling of dread.

I've had a few close calls in my career. Each time it's made me pause and thank my luck it wasn't prod.

coderintherye · on Feb 11, 2017

Thanks so much for the post and transparency Gitlab! We had just finished recovering from our own outage (stemming for a power loss and subsequent cascading failures) and were scheduled to do our post-mortem on 2/1 so the original document was a refreshing and reassuring read.

sytse · on Feb 11, 2017

Glad to hear it was of use to you.

aabajian · on Feb 11, 2017

This is an outstanding writeup, but I wonder if it glosses over the real problem:

>>The standby (secondary) is only used for failover purposes.

>>One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup.

IMO, secondaries should be treated exactly as their primaries. No operation should be done on a secondary unless you'd be OK doing that same operation on the primary. You can always create another instance for these operations.

voidlogic · on Feb 11, 2017

>When we went to look for the pg_dump backups we found out they were not there. The S3 bucket was empty, and there was no recent backup to be found anywhere. Upon closer inspection we found out that the backup procedure was using pg_dump 9.2, while our database is running PostgreSQL 9.6 (for Postgres, 9.x releases are considered major). A difference in major versions results in pg_dump producing an error, terminating the backup procedure.

Yikes. One common practice that would have avoided this is by using the just taken backup to populate stage. If the restore fails pages go out. If integration tests that run after a successful restore/populate fail- pages go out.

Live and learn I guess.

_Marak_ · on Feb 11, 2017

I've noticed a lot of other positive activity and press for Gitlab for in the past month.

It's unfortunate they had this technical issue, but it's good to see others ( besides Github ) operating in this space. I should give Gitlab a try sometime.

sperglord · on Feb 11, 2017

We just switched from Github to Gitlab for our private repos. The choice (based upon cost alone) was between them and Bitbucket, and the professional way that this was handled and the transparent communication was really nice to see.

pradeepchhetri · on Feb 11, 2017

Just want to add here that using tools like safe-rm[1] across your infrastructure would help in preventing data losses by running rm on unintended directories.

[1]: https://launchpad.net/safe-rm

jsperson · on Feb 11, 2017

>An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

This is a great attitude. Too often opportunity cost isn't considered when making rules to protect folks from doing something stupid.

yarper · on Feb 11, 2017

It's amazing how quickly it descends into "one of the engineers" did x or y. Who was steering this ship exactly?

It's really simple to point the finger and try to find a single cause of failure - but it's a fools errand - comparable to finding the single source behind a great success.

jdavis703 · on Feb 11, 2017

Do you expect management to be staring over your shoulder every time you do some kind of `rm` on a production server? With great power comes great responsibility.

yarper · on Feb 11, 2017

People get tired, sick, frustrated, panic - part of being a responsible engineer is accepting you're as fallible as the next person and building in protection against your own errors.

However, if "the engineer" that caused this happens to read this, the above is not a sign that you should quit the profession and become a hermit. A chain of events caused this, you just happened to be the one without a chair to sit in when the music stopped.

CaptSpify · on Feb 11, 2017

One of the coolest things I've read about is how airlines do root cause analysis. If you get to a point where a human can mess up the situation like this, it's considered a systemic issue. Mobile now, but can try to find it later

EDIT: https://dvikan.no/ntnu-studentserver/reports/A%20Human%20Err...

> That is, much like falling dominoes, Bird and others (Adams, 1976; Weaver, 1971) have described the cascading nature of human error beginning Human Error Perspective 39 with the failure of management to control losses (not necessarily of the monetary sort) within the organization.

JorgeGT · on Feb 11, 2017

In general, in aviation, the existence of any single point of failure (SPOF) is considered a systemic issue, be it a single human who can fail (anyone can faint or have a heart attack), a single component, or a single process. That's why there are not only redundant systems, but redundant humans and even redundant processes for the same task (you can power the control surfaces hydraulics through the engines, then through the aux power unit, then through the windmill... ).

If a design contains an SPOF, then it's a bad design and should not be approved until the SPOF is removed by adequate redundancy or other means.

damagednoob · on Feb 11, 2017

Management is often at fault for not giving engineers the resources to do their job properly. How much of the 'Improving Recovery Procedures' were already highlighted but ignored? Were they pressured to deliver other features instead of bedding down some of their operations procedures?

I'm not saying this is the case here but it's all too easy to blame someone for making a mistake. Even the most experienced make mistakes but reducing your MTTR is often overlooked in favour of other seemingly more pressing concerns.

_ph_ · on Feb 11, 2017

It was not the managements job to prevent the engineer from typing "rm". It was the managements job to make sure that typing "rm" would not result in big data loss. This is, assuming the engineer was not already the highest ranking technical person in the company.

I am very happy about their open post-mortem, so that anyone can learn from it. Reading it, it looks to me that the "rm" was not the cause of the disaster, it just triggered it. The real problem was the whole setup, which failed. And that is something, which falls under managements responsibility.

chadcmulligan · on Feb 11, 2017

doesn't everyone alias rm to rm -i on prod?

likewise all tty's have red backgrounds on prod.

nathancahill · on Feb 11, 2017

I'm not sure if you're making an equally sarcastic point as your parent or not..

chadcmulligan · on Feb 11, 2017

nope serious. I was at a place that the dba did exactly what happened at gitlab but in sql

select * from table > script

@script.

(drop all the tables)

It was in prod, he thought it was a dev db, the backups had never worked. After this the edict was all terminals for prod will be red. A simple solution

pfarnsworth · on Feb 11, 2017

90% of all outages are caused by human error. That's why change management solutions were so big 10 years ago, trying to get rid of the human element of changes in an enterprise.

angry_octet · on Feb 11, 2017

They way to deal with that is not to crush the people, but to analyse the ways human decision making fails and to provide assistance. More often than not the failures occur because the people had not been trained correctly, which is a management problem not a PEBKAC one (how many junior systems people have been trained in how to recover from cascading disk/network/database failures?). ITIL is just a tool for ensuring failures which no one understands because understanding has been devalued.

isoos · on Feb 11, 2017

sytse and GitLab folks: thank you for the transparency.

sytse · on Feb 11, 2017

You're welcome. Thanks for all the kind responses we received https://twitter.com/i/moments/826818668948549632

samat · on Feb 10, 2017

Am I missing something or didn't they mention 'test recovery, not backups'?

sytse · on Feb 11, 2017

Two of the issues linked from the article deal with testing the backups:

- Automated testing of recovering PostgreSQL database backups https://gitlab.com/gitlab-com/infrastructure/issues/1102

- Build Streaming Database Restore https://gitlab.com/gitlab-com/infrastructure/issues/1152

XorNot · on Feb 11, 2017

The backup situation stands out to me as a problem no one has really adequately solved. Verifying a task has happened in a way where the notifications are noticed is actually a really hard problem that it feels like we collectively ignore in this business.

How do you reliably check if something didn't happen? Is the backup server alive? Did the script work? Did the backup work? Is the email server working? Is the dashboard working? Is the user checking their emails (think: wildcard mail sorting rule dumping a slight change in failure messages to the wrong folder).

And the converse answer isn't much better: send a success notification...but if it mostly succeeds, how do you keep people paying attention to it when it doesn't (i.e. no failure message, but no success message)?

The best answer I've got, personally, is to use positive notifications combined with visibility - dashboard your really important tasks with big, distinctive colors - use time based detection and put a clock on your dashboard (because dashboards which mostly don't change might hang and no one notice).

nodesocket · on Feb 11, 2017

My main question is still:

>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.

Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?

cookiecaper · on Feb 11, 2017

The cause is bad configuration.

PgSQL, Mongo, and MySQL all use a transaction stream like this for replication and they all have to put some kind of cap on it or risk running out of disk space, but the cap should be made sufficiently large to allow automatic resumption of disconnected slaves without manual redumping, except in extraordinary circumstances. Log retention should be long enough to last at least a long weekend so that someone can come in and poke the DB back into action on Tuesday morning, but preferably more like 1 week. Alarms should be configured to fire well before replication lag gets anywhere near the log expiration timeout.

In particular, PostgreSQL has a feature that allows automatic WAL archiving (i.e., it confirms that the WAL has been successfully shipped to a separate system before it removes it from the master) and a feature called "replication slots" that ensures that all WALs are kept if a regular subscriber is offline. If either of these features had been correctly configured, there would've been no need to do a full resync; the secondary database would've come back and immediately picked up where it left off.

Additionally, if one must resync the full database (and I've had to do this many times), tools like pg_basebackup and innobackupex are basically required to consistently perform the process of pulling the master dumps, and the old (unsynced) data directory should be allowed to linger until the full master snapshot has been fully confirmed and is ready to resync. It's very reckless to go around removing binary data directories until you're certain that the new stuff is running, even if you're "just on the replicant".

With pg_basebackup, you run it on the replicant server and it streams down the files, no need to log into the master server at all. With innobackupex, you need to have read access to the master's binary data directory, but should achieve this safely through something like a read-only NFS mount. mydumper is a possible alternative to innobackupex that tries to capture the binlog coords and doesn't require any direct access to the host beneath the database server.

scurvy · on Feb 11, 2017

+1 for replication slots. Just remember to remove the slot if you decommission a server; otherwise storage on the master will grow forever.

innobackupex works fine locally on the server, streaming out to netcat or ssh on the remote side. Nothing wild like read only NFS required. It also copies all binlogs. Mydumper is pretty old at this point and doesn't do most of the things innobackupex can. I wouldn't recommend it.

nodesocket · on Feb 11, 2017

Wow, thanks. This is like the best answer I've ever seen. You absolutely nailed it.

Are you by any chance looking for any DevOps/Ops consulting? I just founded my third startup Elastic Byte (https://elasticbyte.net) and always looking for smart people. We're a consulting startup that helps companies manage their cloud infrastructure.

cookiecaper · on Feb 11, 2017

I'm flattered, but I'm sure there are better options out there, and I'm swamped as it is anyway. ;) I'll keep it in mind and reach out if something changes.

I do think that's a great startup, though, and this post-mortem and incident only proves how badly it's needed. A lot of people already think they're getting something like what you're offering when they sign up with a cloud provider.

nodesocket · on Feb 11, 2017

No worries. You're not in the bay by chance are you? Coffee/beer on me if so.

ploxiln · on Feb 11, 2017

There's always some limit. At $PREVIOUS_JOB I think it was at least 48 hours, probably over 72 hours, of replication log retention (usually measured in GiB though). So it's surprising that in GitLab's case it must have been less than 6 hours (IIRC from the original google doc the slave had more than 4 hours replication lag due to load, initially ...)

scaryclam · on Feb 12, 2017

There was a nice response from the PostgreSQL guys here: http://blog.2ndquadrant.com/dataloss-at-gitlab/

There is a bug that might have been hit, but it appears as though there were other issues at play as well.

dancryer · on Feb 13, 2017

Can't help but notice that the new backup monitoring tool suggests that the latest PGSQL backup is almost six days old...

Is that correct? http://monitor.gitlab.net/dashboard/db/backups?from=14859419...

nierman · on Feb 11, 2017

yes, wal archiving would have helped (archive_command = rsync standby ...), but it's also very easy in postgres 9.4+ to add a replication slot on the master so that wal is kept until it is no longer needed by the standby. simply reference the slot in the standby's recovery.conf file.

definitely monitor your replication lag--or at least disk usage on the master--with this approach (in case wal starts piling up there).

nstj · on Feb 11, 2017

@sytse were you in contact with MS/Azure during the restore? If so did they offer any assistance, e.g in speeding up restoration disk speed etc

Achshar · on Feb 11, 2017

Does anyone have a link to the YouTube stream they's talking about? Can't seem to find it on their channel. And the link in the doc is redirecting to the live link [1] which doesn't list the stream.

[1] - https://www.youtube.com/c/Gitlab/live

cmatija · on Feb 11, 2017

Yup,

Thanks for taking the interest to check it out.

It's an unlisted YT video, so that's why it might be hard to find.

Here it is: https://www.youtube.com/watch?v=nc0hPGerSd4

Achshar · on Feb 12, 2017

Thanks! I'll check it out.

jsingleton · on Feb 11, 2017

TIL GitLab runs on Azure. If your CI servers or deployment targets are also on Azure then the latency should be pretty low (assuming you get the correct region). Good to know.

I moved from AWS to Azure years ago. Mainly because I run mostly .NET workloads and the support is better. I've recently done some .NET stuff on AWS again and am remembering why I switched.

AlexCoventry · on Feb 11, 2017

Thank you for this informative postmortem and mitigation outline.

Are any organizational changes planned in response to the development friction which led to the outage? It seems to have arisen from long-standing operational issues, and an analysis of how prior attempts to address those issues got bogged down would be very interesting.

oli5679 · on Feb 10, 2017

I found this entertaining, even if they did later admit that it was a hoax:

http://serverfault.com/questions/587102/monday-morning-mista...

tschellenbach · on Feb 11, 2017

Shouldn't the conclusion of this post mortem be a move to a managed database service like RDS? The database doesn't sound huge, RDS is affordable enough, sounds to me that you spend less money and have better uptime and sleep by moving away from this in-house solution.

carterehsmith · on Feb 11, 2017

According to this article:

> https://www.theregister.co.uk/2016/11/14/gitlab_to_dump_clou...

...they were moving away from the cloud, to their own servers.

sytse · on Feb 11, 2017

We changed our minds. We're working on a post in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/... but we wanted to post this first.

carterehsmith · on Feb 11, 2017

It would be great for all of us if you provided your reasoning: to cloud, or not to cloud :) Thanks