Google Cloud Downtime Postmortem

RestlessMind · on July 19, 2018

> One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.

I wish they elaborated more on what type of bug was that, which was not caught by testing / initial rollout. Either tests must be poorly written or the bug must be very subtle.

On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice. I also wish they publish (a belated) one about Google+ and their throng of messaging apps over the years.

morrbo · on July 19, 2018

With all due respect, it's not a postmortem, it's an advert. It doesn't really say anything other than "We had a problem, we fixed it.". There are virtually no technical details in there other than "something would restart spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded by cloudflare and aws writeups, but this really isn't anything special or worthwhile reading.

tylerl · on July 20, 2018

I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?

A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.

After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.

Anything else is just cleverly-disguised marketing.

occams_chainsaw · on July 20, 2018

It's pretty bare-bones next to a postmortem like https://www.epicgames.com/fortnite/en-US/news/postmortem-of-...

forgot-my-pw · on July 23, 2018

I feel that's way too much info.

andrewstuart2 · on July 20, 2018

To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.

All postmortems are ads, but not all ads are effective.

tanilama · on July 20, 2018

This is a self congratulatory lazy... ad.

halbritt · on July 19, 2018

Given my recent experiences with Google, I have to say that I'm impressed that they supplied even that much detail.

koolba · on July 19, 2018

That’s an interesting long term strategy to keep the bar so low that the slightest step is going above.

zinckiwi · on July 20, 2018

You've just described most people's career plans.

tgtweak · on July 19, 2018

Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.

It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.

phyzome · on July 19, 2018

In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:

« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »

...but yeah, they definitely could have done better.

puzzle · on July 19, 2018

Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.

In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.

Having been on the other side, I know that, for every detail added, a bunch more questions come up.

linza · on July 20, 2018

Is there a write up that explains what the technologies are that you mentioned in your post? LOAS, L1 GFE, L2 GFE, ... never heard of those

puzzle · on July 20, 2018

LOAS is how Stubby (gRPC) connections on the internal network are secured, looks a bit like mTLS and is partially being open sourced as ALTS: https://cloud.google.com/security/encryption-in-transit/appl...

GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.

breakingcups · on July 20, 2018

L1 and L2 GFE are the Level 1 and Level 2 Google Front Ends which the post mortem talks about. I'm not familiar with LOAS.

iKSv2 · on July 20, 2018

Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.

Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).

londons_explore · on July 23, 2018

Google internally will totally have a document with stack traces and code snippets.

They have just chosen to only post a brief summary.

foxylad · on July 20, 2018

Actually "We had a problem, we fixed it, we made sure it won't happen again."

That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.

I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.

aserafini · on July 20, 2018

I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).

briffle · on July 20, 2018

It's the perfect doc to show my leadership team, who want to know what happened, but aren't very technical.

kdmytro · on July 20, 2018

To be fair, nowhere on the linked page does it say that this is a postmortem. The submitter called it that.

actuator · on July 19, 2018

> On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice.

I think most already do this. I have seen AWS also publish detailed postmortems for outages like this. Ex: https://aws.amazon.com/message/41926/

macintux · on July 19, 2018

I started tracking them a while back, didn't stick with it for very long.

https://github.com/macintux/Service-postmortems

kerng · on July 20, 2018

This is super high level summary not a postmortem, in fact it doesn't say anything about the bug and why there was no test for it, what repair items are done, etc..

Amazon and Microsoft's post mortems are much more to the point, one can actually learn from them to not make the same or similar mistake.

skj · on July 20, 2018

I will note that the word "postmortem" appears nowhere on the page, which is dedicated to status updates about the incident in question.

ebikelaw · on July 19, 2018

It reads to me like the bug required a particular feature to be enabled, and the feature wasn't enable when the software was rolling out, then the feature was globally knife-switched to the "on" position without an orderly staged rollout.

I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.

mjevans · on July 19, 2018

It's sort of like that extra disk in your raid 1 setup at home.

Compared to a single SSD the performance improvement doesn't really show (for desktop loads)...

However when things come to a rare (but somewhat inevitable) and screeching halt and that mirror has one of the copies shatter beyond recognition... that's when the doubled price proves that it was worth it.

Such a dashboard would invariably also add load and complexity (both failure points) to the system, but outwardly most users would be unaware of their existence.

readams · on July 19, 2018

As it says in the note, the bug was in a feature that was latent but not yet being used. Then a configuration change started hitting the feature.

With highly redundant systems such as this, you generally need multiple layers of things going wrong all at once to notice an issue. This was the case here as well.

michaelt · on July 20, 2018

  Either tests must be poorly written or
  the bug must be very subtle.

The easiest way to make a web service work in testing but fail in production is a problem with the settings that are necessarily different between environments.

For example, you must have different database credentials between test and production, and you must limit who can read the production credentials. If the production credentials are malformed, a service that worked in test will fail in production.

And the same applies to your SSL certificates, your settings for enabled/disabled features, your flashy markers that stop people mistaking production for test....

puzzle · on July 19, 2018

Sometimes bugs are triggered by very specific data or queries (aka "queries of death"). GFEs have many tests, so I'd go for the subtle bug.

nine_k · on July 19, 2018

I suppose every detected "query of death" should be included in the next version of the service's test suite.

tweenagedream · on July 20, 2018

I bet there is already a regression test for this one.

emmelaich · on July 20, 2018

> ... this bug had not been detected in either of testing and initial rollout.

It makes me so happy[0] that the big smart corps have the same problems that us plebs have.

I tell people that testing can never precisely duplicate the production environment but do they believe me?

Also, this is an argument for feature switches vs staging environments.

[0] But not in a schadenfreude sense.

ams6110 · on July 20, 2018

Another possibility is that the change was not tested at all or very minimally tested -- and they would not want to admit to that publicly.

westoque · on July 19, 2018

Amazing that they handled the downtime in such a short time span. Boosts my confidence to actually use GCP. Huge props!

Meanwhile.. waiting for the Amazon prime day post mortem.

throwaway5752 · on July 19, 2018

+1'ed you on the props to GCP, but on Amazon prime, that was only on their retail site, right? Do vendors there have SLA or is there otherwise an obligation to publish a post-mortem on the incident? I think it's different when it's a platform/service provider but the prime day outage only hurt Amazon. My recollection is that AWS service outages had prompt and thorough published incident reports.

actuator · on July 19, 2018

I have seen SaaS products provide incident reports to customers but that were products with SLAs.

It depends on Amazon Retail Site in this case, Netflix has published postmortems in the past but then again Netflix is very good at blogging(also, open sourcing) their engineering efforts.

https://medium.com/netflix-techblog/a-closer-look-at-the-chr...

paulddraper · on July 19, 2018

> I think it's different when it's a platform/service provider but the prime day outage only hurt Amazon.

Have you not used Amazon? It's where half the country does their shopping. Many businesses live and die on Amazon as just much as they do on AWS.

throwaway5752 · on July 19, 2018

It may come as a surprise, but I have used Amazon. Both on a professional and personal basis! (I say this tongue in cheek).

I know businesses that depend on Amazon.com as a primary sales channel were affected, but they don't pay Amazon to sell their product (maybe Pro merchants are the exception). I think they owe us Prime members an outage report even more on that basis.

In any case, I think it would be a good idea for them to write an incident report, but don't think it's comparable.

edit: I give. Like I said, I think it's different than AWS/GCP outages, but I think your point is great and I think it's a good idea they pubish a report on it. Look forward to seeing it someday.

paulddraper · on July 19, 2018

> but they don't pay Amazon to sell their product

They certainly do.

As with AWS, there are various services offered, each with their own pricing structure.

* Payment mechanism (Amazon Pay)

* E-commerce interface (Sell on Amazon)

* Services marketplace (Selling Services on Amazon)

* Advertising (Advertise on Amazon)

* Warehouse/logistics provider (Fulfillment by Amazon)

Amazon, eBay, or your own ecommerce website problems can cost exactly as much business as AWS, GCP, or your own data center problems.

https://services.amazon.com/

derefr · on July 20, 2018

Do you mean that businesses "live and die" based on the ability to sell stuff on Amazon (obvious), or to buy stuff on Amazon (much less obvious, potentially interesting)?

If the latter, I wonder whether such customers would like to see Amazon introduce some sort of lower-level "Amazon purchasing API" that would continue to function even when the website doesn't, and which doesn't include any of the features that could topple the site (mostly, no paginated browse/search result API—you would have to already know the ID of the product you're buying.)

tedunangst · on July 19, 2018

How many businesses died as a result of the prime day outage?

ikeboy · on July 20, 2018

None. Sales were way over sales on a typical day.

The outage was way over-hyped.

paulddraper · on July 20, 2018

How many businesses died as part of the S3 outage of 2017?

westoque · on July 19, 2018

I don't think there's an obligation for any company to publish a post mortem. But if a company does, it shows that they're a company that's accountable and shows that they care about their customers.

outworlder · on July 19, 2018

> Meanwhile.. waiting for the Amazon prime day post mortem.

Why are you comparing a cloud provider with a retail site?

There was an issue in AWS Frankfurt yesterday, I'm waiting the post-mortem on that.

vxNsr · on July 19, 2018

My guess is that amazon's post mortem is basically: we misjudged how many people would be hitting our site. not that interesting or noteworthy.

nashadelic · on July 20, 2018

> Google engineers were alerted to the issue within 3 minutes and began immediately investigating

As a user of their service, our engineers were notified within <30 secs when the issue started. Given GCP had a large population impacted, how is it that it took them much longer to acknowledge?

> The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.

Something going down after a deployment is the most common source of issues. A system KPI abnormality after a rollout should be common practice to monitor and to perform an almost instant auto-rollout on. Also, doesn't GCP perform dark launched, partial launches? Launch to 1%, see KPIs, increase to 5% and so on?

foobaw · on July 20, 2018

Maybe they were on reddit. Seriously though, 30 seconds vs 3 minutes is a long time in the reliability world but not atrocious.

smueller1234 · on July 20, 2018

It's the difference between being at home, sober, and with a laptop handy vs. being actively logged in and ready to go at all times.

In terms of quality of life for people on call, it's an entirely different world. And in a setup where your oncall engineers are extremely highly skilled and have all the choice in the world in terms of where to work, that little bit of respect of their time is a necessary investment.

tubaguy50035 · on July 20, 2018

Did they learn nothing from the Azure of yore? You don't roll out a change globally, ever.

While the postmortem is appreciated, I'd rather they just didn't roll out changes globally.

piinbinary · on July 20, 2018

I'm very impressed that they can go from deploying a fix (12:44) to it being effective (12:49) in just 5 minutes.

hilbertseries · on July 20, 2018

Reading the post mortem, I'm guessing it's more like it took them 30 minutes to figure out what was happening and then they rolled back the deploy and it was fixed.

Edit: Reading further down, they actually admit that it was just a roll back

> At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts

zilchers · on July 19, 2018

I love GCP’s postmortems - they’re open, honest, insightful, and I wish I could get my company to OK us releasing details like this when we have outages. It’s part of the reason I personally like GCP more than AWS (and certainly azure, those guys don’t admit to shit regardless how bad the outage is).

Edit: Wow, downvotes because I like transparency from my cloud hoster, super interesting...

sovnade · on July 20, 2018

You're getting downvotes because this was not a transparent and open report. It was vague and was more advertise-y than postmortem-y.

Not saying Amazon is perfect by any means either, but there's a lot of room for improvement. Good postmortems give everyone ideas on how to solidify their own processes and prevent other issues. This was just fluff.

aequitas · on July 19, 2018

A little offtopic but, why is it called a postmortem?

When I got first introduced to the concept of incident reports it was under the name of postmortem, as I worked for a mainly English speaking company then and didn't think twice about it. But earlier this week I mentioned it to a colleague he found it a rather macabre term for something like an incident report. When you think of it, nothing really died (maybe some engineers died a little inside that their design was not as 100% reliable as they though). But for the rest it was just a temporary state, nothing permanent like death. All other uses (eg: medical) of this word all seem to relate strictly to death.

Maybe it is because incident reports just sound to formal or is there a etymology of this term in the IT world?

jldugger · on July 19, 2018

It's called a 'post mortem' in the medical world because the patient died. It involves a high level of inspection into what went wrong and what can be learned to prevent it. I assume the term was adopted from there into project management.

It might make a little more sense in the world of shipping software in retail boxes, where products/projects had a 'done' date. The project is dead, what contributed to it's demise? Or you might generalize death into failure, and that's why we use the term instead of post-incident.

swozey · on July 19, 2018

Not all post-mortems are for failures/dead products which might add even more confusion. For instance Gamasutra runs a game development post-mortem section where developers of popular (and unpopular) games can hop on and describe difficult situations they've encountered, how they did what they did, why they did that thing, etc.

It's really fun to read.

http://www.gamasutra.com/features/postmortem/

edit: Wow, there hasn't been one since 2014. I wonder why this died out. There's 10 pages of them since 2007..

s_ngularity · on July 20, 2018

Ironic. Apparently we need a postmortem postmortem

radlich · on July 19, 2018

The earliest I know of: https://books.google.com/books?id=mskkmVkpIUcC&lpg=PP1&pg=PP...

I've seen the act of analyzing project failures by this name in software engineering/management ever since.

Maybe it was a bit more relevant in the days of large waterfall software project management, where failure often meant the end of a project with no product launched. Sometimes after a "death march": https://en.wikipedia.org/wiki/Death_march_(project_managemen...

But it does seem natural to me that it has been carried over to current days, and applied to analyzing failures in the context most relevant to modern software development.

Never stopped to think about the weirdness of the term's application. It feels so natural to me, I'd suggest we all start calling the fixed service as "resurrected". As in: "Google Cloud Global Load Balancers were resurrected at 13:19." (-:

aequitas · on July 20, 2018

That seems to make sense indeed. It does cover the load in terms of a thorough investigation so I get why it stuck as a more powerful term than a mere incident report.

joeax · on July 19, 2018

The term 'post-mortem' doesn't appear anywhere on that page. It was added by the OP in the title. In reality it should be called an incident report.

If we want to be technical, a post-mortem in the tech world is commonly used to outline failures that occurred during a normal event (i.e. a software release) not a random production issue.

infogulch · on July 19, 2018

Well the outage itself died. Hopefully. Otherwise it would just be an incident status update.

meesterdude · on July 19, 2018

this was amusing for me, because i am JUST starting to test out google's cloud offerings and got hit by this outage on basically day 1. Luckily, it wasn't too long before i figured out it was them and not me.

foxylad · on July 20, 2018

We've been on Appengine for years, and one of the un-intuitive benefits is that when something goes wrong, all you can do is sit back and tell your customers "a team of the best sysadmins in the world are working on it right now".

I've met many people who hate that abdication of responsibility, and would prefer to be heroically hacking solutions at 3am when their database replication fails. Maybe they feel they have to be punished for failing their customers.

My advice is keep testing GC, because in my experience it is very reliable. And once you realise that, the peace of mind is awesome.

EugeneOZ · on July 19, 2018

Bobby Tables?

dylan604 · on July 20, 2018

Recently ran into an issue with a provider's CSV export of their data not quoting some of their text fields. It was a minor annoyance in that the data in the text field was user input where the user used commas. Importing in Excel then had the columns mangled by the extra commas.

I then made the comment to sign in using Bobby Drop Tables for a user name. The silence in the room quickly reminded me I was not in the company of other developers. Such a waste

nodesocket · on July 19, 2018

I believe the load balancer outage only affected global load balancers not regional load balancers. Is that still accurate?

swozey · on July 19, 2018

My understanding is yes; I only use regional, which encountered no outages. This hugely affected appengine users. I've not encountered a lot of GKE users that use Global LBs yet (I've run in GKE for roughly 4 years).

djhworld · on July 19, 2018

A good write up, thanks.

I remember a configuration change being rolled out by an automated system caused a problem on GCP a few years ago, it's an interesting area that's probably quite hard to fix

tolk460 · on July 20, 2018

I'm an SRE for a large software company. We call it an "incident retrospective" for this reason.

grogers · on July 20, 2018

It sounds like the impact wasn't limited to particular users or regions. Why would they deploy the configuration worldwide at the same time?

buahahaha · on July 20, 2018

It's a global load balancer.

jaimex2 · on July 20, 2018

Bet Tesla are glad they made the switch to MapBox and Valhalla months ago.

iKSv2 · on July 20, 2018

genuine question, whats Valhalla? Google returned something Norse.

crazysim · on July 20, 2018

Tack on the "Tesla" keyword in your search. "Tesla Valhalla".

jwommack · on July 20, 2018

That does it but also you can find more detailed information on their Github page: https://github.com/teslamotors/valhalla

exabrial · on July 20, 2018

Basically, due to the lack of customer service, Google cloud is not for serious busines, just casual side projects. Reading through this and the article about getting their servers cut off made my stomach lurch

asfasgasg · on July 20, 2018

All major cloud platforms have occasional unplanned downtime. AWS had an outage last year that took many sites offline for hours. A single instance of Google having such unplanned downtime is meaningless without more datapoints.

As for whether it's useful for serious business. Well. Proof by example? This has sixteen pages of case studies: https://cloud.google.com/customers/. That is by no means all of GCP's serious customers.

exabrial · on July 20, 2018

That sort of proves my point :( unless you are a big fish with marketing potential, you're nothing to Google

robinwassen · on July 20, 2018

We have been using GCP for 6 years (and is one of the companies in the linked case studies) and I must say that their customer service is good.

I think I had over 30 touches with their support and key account managers regarding everything from billing, minor issues with services and just asking for advice regarding stuff. They have always delivered.

The expectation from us has been that we pay the $150/mon support package fee.