Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Downtime Postmortem (cloud.google.com)
346 points by tosh on July 19, 2018 | hide | past | favorite | 82 comments



> One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.

I wish they elaborated more on what type of bug was that, which was not caught by testing / initial rollout. Either tests must be poorly written or the bug must be very subtle.

On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice. I also wish they publish (a belated) one about Google+ and their throng of messaging apps over the years.


With all due respect, it's not a postmortem, it's an advert. It doesn't really say anything other than "We had a problem, we fixed it.". There are virtually no technical details in there other than "something would restart spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded by cloudflare and aws writeups, but this really isn't anything special or worthwhile reading.


I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?

A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.

After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.

Anything else is just cleverly-disguised marketing.


It's pretty bare-bones next to a postmortem like https://www.epicgames.com/fortnite/en-US/news/postmortem-of-...


I feel that's way too much info.


To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.

All postmortems are ads, but not all ads are effective.


This is a self congratulatory lazy... ad.


Given my recent experiences with Google, I have to say that I'm impressed that they supplied even that much detail.


That’s an interesting long term strategy to keep the bar so low that the slightest step is going above.


You've just described most people's career plans.


Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.

It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.


In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:

« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »

...but yeah, they definitely could have done better.


Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.

In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.

Having been on the other side, I know that, for every detail added, a bunch more questions come up.


Is there a write up that explains what the technologies are that you mentioned in your post? LOAS, L1 GFE, L2 GFE, ... never heard of those


LOAS is how Stubby (gRPC) connections on the internal network are secured, looks a bit like mTLS and is partially being open sourced as ALTS: https://cloud.google.com/security/encryption-in-transit/appl...

GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.


L1 and L2 GFE are the Level 1 and Level 2 Google Front Ends which the post mortem talks about. I'm not familiar with LOAS.


Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.

Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).


Google internally will totally have a document with stack traces and code snippets.

They have just chosen to only post a brief summary.


Actually "We had a problem, we fixed it, we made sure it won't happen again."

That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.

I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.


I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).


It's the perfect doc to show my leadership team, who want to know what happened, but aren't very technical.


To be fair, nowhere on the linked page does it say that this is a postmortem. The submitter called it that.


> On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice.

I think most already do this. I have seen AWS also publish detailed postmortems for outages like this. Ex: https://aws.amazon.com/message/41926/


I started tracking them a while back, didn't stick with it for very long.

https://github.com/macintux/Service-postmortems


This is super high level summary not a postmortem, in fact it doesn't say anything about the bug and why there was no test for it, what repair items are done, etc..

Amazon and Microsoft's post mortems are much more to the point, one can actually learn from them to not make the same or similar mistake.


I will note that the word "postmortem" appears nowhere on the page, which is dedicated to status updates about the incident in question.


It reads to me like the bug required a particular feature to be enabled, and the feature wasn't enable when the software was rolling out, then the feature was globally knife-switched to the "on" position without an orderly staged rollout.

I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.


It's sort of like that extra disk in your raid 1 setup at home.

Compared to a single SSD the performance improvement doesn't really show (for desktop loads)...

However when things come to a rare (but somewhat inevitable) and screeching halt and that mirror has one of the copies shatter beyond recognition... that's when the doubled price proves that it was worth it.

Such a dashboard would invariably also add load and complexity (both failure points) to the system, but outwardly most users would be unaware of their existence.


As it says in the note, the bug was in a feature that was latent but not yet being used. Then a configuration change started hitting the feature.

With highly redundant systems such as this, you generally need multiple layers of things going wrong all at once to notice an issue. This was the case here as well.


  Either tests must be poorly written or
  the bug must be very subtle.
The easiest way to make a web service work in testing but fail in production is a problem with the settings that are necessarily different between environments.

For example, you must have different database credentials between test and production, and you must limit who can read the production credentials. If the production credentials are malformed, a service that worked in test will fail in production.

And the same applies to your SSL certificates, your settings for enabled/disabled features, your flashy markers that stop people mistaking production for test....


Sometimes bugs are triggered by very specific data or queries (aka "queries of death"). GFEs have many tests, so I'd go for the subtle bug.


I suppose every detected "query of death" should be included in the next version of the service's test suite.


I bet there is already a regression test for this one.


> ... this bug had not been detected in either of testing and initial rollout.

It makes me so happy[0] that the big smart corps have the same problems that us plebs have.

I tell people that testing can never precisely duplicate the production environment but do they believe me?

Also, this is an argument for feature switches vs staging environments.

[0] But not in a schadenfreude sense.


Another possibility is that the change was not tested at all or very minimally tested -- and they would not want to admit to that publicly.


Amazing that they handled the downtime in such a short time span. Boosts my confidence to actually use GCP. Huge props!

Meanwhile.. waiting for the Amazon prime day post mortem.


+1'ed you on the props to GCP, but on Amazon prime, that was only on their retail site, right? Do vendors there have SLA or is there otherwise an obligation to publish a post-mortem on the incident? I think it's different when it's a platform/service provider but the prime day outage only hurt Amazon. My recollection is that AWS service outages had prompt and thorough published incident reports.


I have seen SaaS products provide incident reports to customers but that were products with SLAs.

It depends on Amazon Retail Site in this case, Netflix has published postmortems in the past but then again Netflix is very good at blogging(also, open sourcing) their engineering efforts.

https://medium.com/netflix-techblog/a-closer-look-at-the-chr...


> I think it's different when it's a platform/service provider but the prime day outage only hurt Amazon.

Have you not used Amazon? It's where half the country does their shopping. Many businesses live and die on Amazon as just much as they do on AWS.


It may come as a surprise, but I have used Amazon. Both on a professional and personal basis! (I say this tongue in cheek).

I know businesses that depend on Amazon.com as a primary sales channel were affected, but they don't pay Amazon to sell their product (maybe Pro merchants are the exception). I think they owe us Prime members an outage report even more on that basis.

In any case, I think it would be a good idea for them to write an incident report, but don't think it's comparable.

edit: I give. Like I said, I think it's different than AWS/GCP outages, but I think your point is great and I think it's a good idea they pubish a report on it. Look forward to seeing it someday.


> but they don't pay Amazon to sell their product

They certainly do.

As with AWS, there are various services offered, each with their own pricing structure.

* Payment mechanism (Amazon Pay)

* E-commerce interface (Sell on Amazon)

* Services marketplace (Selling Services on Amazon)

* Advertising (Advertise on Amazon)

* Warehouse/logistics provider (Fulfillment by Amazon)

Amazon, eBay, or your own ecommerce website problems can cost exactly as much business as AWS, GCP, or your own data center problems.

https://services.amazon.com/


Do you mean that businesses "live and die" based on the ability to sell stuff on Amazon (obvious), or to buy stuff on Amazon (much less obvious, potentially interesting)?

If the latter, I wonder whether such customers would like to see Amazon introduce some sort of lower-level "Amazon purchasing API" that would continue to function even when the website doesn't, and which doesn't include any of the features that could topple the site (mostly, no paginated browse/search result API—you would have to already know the ID of the product you're buying.)


How many businesses died as a result of the prime day outage?


None. Sales were way over sales on a typical day.

The outage was way over-hyped.


How many businesses died as part of the S3 outage of 2017?


I don't think there's an obligation for any company to publish a post mortem. But if a company does, it shows that they're a company that's accountable and shows that they care about their customers.


> Meanwhile.. waiting for the Amazon prime day post mortem.

Why are you comparing a cloud provider with a retail site?

There was an issue in AWS Frankfurt yesterday, I'm waiting the post-mortem on that.


My guess is that amazon's post mortem is basically: we misjudged how many people would be hitting our site. not that interesting or noteworthy.


> Google engineers were alerted to the issue within 3 minutes and began immediately investigating

As a user of their service, our engineers were notified within <30 secs when the issue started. Given GCP had a large population impacted, how is it that it took them much longer to acknowledge?

> The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.

Something going down after a deployment is the most common source of issues. A system KPI abnormality after a rollout should be common practice to monitor and to perform an almost instant auto-rollout on. Also, doesn't GCP perform dark launched, partial launches? Launch to 1%, see KPIs, increase to 5% and so on?


Maybe they were on reddit. Seriously though, 30 seconds vs 3 minutes is a long time in the reliability world but not atrocious.


It's the difference between being at home, sober, and with a laptop handy vs. being actively logged in and ready to go at all times.

In terms of quality of life for people on call, it's an entirely different world. And in a setup where your oncall engineers are extremely highly skilled and have all the choice in the world in terms of where to work, that little bit of respect of their time is a necessary investment.


Did they learn nothing from the Azure of yore? You don't roll out a change globally, ever.

While the postmortem is appreciated, I'd rather they just didn't roll out changes globally.


I'm very impressed that they can go from deploying a fix (12:44) to it being effective (12:49) in just 5 minutes.


Reading the post mortem, I'm guessing it's more like it took them 30 minutes to figure out what was happening and then they rolled back the deploy and it was fixed.

Edit: Reading further down, they actually admit that it was just a roll back

> At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts


I love GCP’s postmortems - they’re open, honest, insightful, and I wish I could get my company to OK us releasing details like this when we have outages. It’s part of the reason I personally like GCP more than AWS (and certainly azure, those guys don’t admit to shit regardless how bad the outage is).

Edit: Wow, downvotes because I like transparency from my cloud hoster, super interesting...


You're getting downvotes because this was not a transparent and open report. It was vague and was more advertise-y than postmortem-y.

Not saying Amazon is perfect by any means either, but there's a lot of room for improvement. Good postmortems give everyone ideas on how to solidify their own processes and prevent other issues. This was just fluff.


A little offtopic but, why is it called a postmortem?

When I got first introduced to the concept of incident reports it was under the name of postmortem, as I worked for a mainly English speaking company then and didn't think twice about it. But earlier this week I mentioned it to a colleague he found it a rather macabre term for something like an incident report. When you think of it, nothing really died (maybe some engineers died a little inside that their design was not as 100% reliable as they though). But for the rest it was just a temporary state, nothing permanent like death. All other uses (eg: medical) of this word all seem to relate strictly to death.

Maybe it is because incident reports just sound to formal or is there a etymology of this term in the IT world?


It's called a 'post mortem' in the medical world because the patient died. It involves a high level of inspection into what went wrong and what can be learned to prevent it. I assume the term was adopted from there into project management.

It might make a little more sense in the world of shipping software in retail boxes, where products/projects had a 'done' date. The project is dead, what contributed to it's demise? Or you might generalize death into failure, and that's why we use the term instead of post-incident.


Not all post-mortems are for failures/dead products which might add even more confusion. For instance Gamasutra runs a game development post-mortem section where developers of popular (and unpopular) games can hop on and describe difficult situations they've encountered, how they did what they did, why they did that thing, etc.

It's really fun to read.

http://www.gamasutra.com/features/postmortem/

edit: Wow, there hasn't been one since 2014. I wonder why this died out. There's 10 pages of them since 2007..


Ironic. Apparently we need a postmortem postmortem


The earliest I know of: https://books.google.com/books?id=mskkmVkpIUcC&lpg=PP1&pg=PP...

I've seen the act of analyzing project failures by this name in software engineering/management ever since.

Maybe it was a bit more relevant in the days of large waterfall software project management, where failure often meant the end of a project with no product launched. Sometimes after a "death march": https://en.wikipedia.org/wiki/Death_march_(project_managemen...

But it does seem natural to me that it has been carried over to current days, and applied to analyzing failures in the context most relevant to modern software development.

Never stopped to think about the weirdness of the term's application. It feels so natural to me, I'd suggest we all start calling the fixed service as "resurrected". As in: "Google Cloud Global Load Balancers were resurrected at 13:19." (-:


That seems to make sense indeed. It does cover the load in terms of a thorough investigation so I get why it stuck as a more powerful term than a mere incident report.


The term 'post-mortem' doesn't appear anywhere on that page. It was added by the OP in the title. In reality it should be called an incident report.

If we want to be technical, a post-mortem in the tech world is commonly used to outline failures that occurred during a normal event (i.e. a software release) not a random production issue.


Well the outage itself died. Hopefully. Otherwise it would just be an incident status update.


this was amusing for me, because i am JUST starting to test out google's cloud offerings and got hit by this outage on basically day 1. Luckily, it wasn't too long before i figured out it was them and not me.


We've been on Appengine for years, and one of the un-intuitive benefits is that when something goes wrong, all you can do is sit back and tell your customers "a team of the best sysadmins in the world are working on it right now".

I've met many people who hate that abdication of responsibility, and would prefer to be heroically hacking solutions at 3am when their database replication fails. Maybe they feel they have to be punished for failing their customers.

My advice is keep testing GC, because in my experience it is very reliable. And once you realise that, the peace of mind is awesome.


Bobby Tables?


Recently ran into an issue with a provider's CSV export of their data not quoting some of their text fields. It was a minor annoyance in that the data in the text field was user input where the user used commas. Importing in Excel then had the columns mangled by the extra commas.

I then made the comment to sign in using Bobby Drop Tables for a user name. The silence in the room quickly reminded me I was not in the company of other developers. Such a waste


I believe the load balancer outage only affected global load balancers not regional load balancers. Is that still accurate?


My understanding is yes; I only use regional, which encountered no outages. This hugely affected appengine users. I've not encountered a lot of GKE users that use Global LBs yet (I've run in GKE for roughly 4 years).


A good write up, thanks.

I remember a configuration change being rolled out by an automated system caused a problem on GCP a few years ago, it's an interesting area that's probably quite hard to fix


I'm an SRE for a large software company. We call it an "incident retrospective" for this reason.


It sounds like the impact wasn't limited to particular users or regions. Why would they deploy the configuration worldwide at the same time?


It's a global load balancer.


Bet Tesla are glad they made the switch to MapBox and Valhalla months ago.


genuine question, whats Valhalla? Google returned something Norse.


Tack on the "Tesla" keyword in your search. "Tesla Valhalla".


That does it but also you can find more detailed information on their Github page: https://github.com/teslamotors/valhalla


Basically, due to the lack of customer service, Google cloud is not for serious busines, just casual side projects. Reading through this and the article about getting their servers cut off made my stomach lurch


All major cloud platforms have occasional unplanned downtime. AWS had an outage last year that took many sites offline for hours. A single instance of Google having such unplanned downtime is meaningless without more datapoints.

As for whether it's useful for serious business. Well. Proof by example? This has sixteen pages of case studies: https://cloud.google.com/customers/. That is by no means all of GCP's serious customers.


That sort of proves my point :( unless you are a big fish with marketing potential, you're nothing to Google


We have been using GCP for 6 years (and is one of the companies in the linked case studies) and I must say that their customer service is good.

I think I had over 30 touches with their support and key account managers regarding everything from billing, minor issues with services and just asking for advice regarding stuff. They have always delivered.

The expectation from us has been that we pay the $150/mon support package fee.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: