> One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.
I wish they elaborated more on what type of bug was that, which was not caught by testing / initial rollout. Either tests must be poorly written or the bug must be very subtle.
On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice. I also wish they publish (a belated) one about Google+ and their throng of messaging apps over the years.
With all due respect, it's not a postmortem, it's an advert. It doesn't really say anything other than "We had a problem, we fixed it.". There are virtually no technical details in there other than "something would restart spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded by cloudflare and aws writeups, but this really isn't anything special or worthwhile reading.
I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?
A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.
After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.
Anything else is just cleverly-disguised marketing.
To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.
All postmortems are ads, but not all ads are effective.
Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.
It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.
In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:
« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »
...but yeah, they definitely could have done better.
Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.
In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.
Having been on the other side, I know that, for every detail added, a bunch more questions come up.
GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.
Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.
Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).
Actually "We had a problem, we fixed it, we made sure it won't happen again."
That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.
I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.
I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).
This is super high level summary not a postmortem, in fact it doesn't say anything about the bug and why there was no test for it, what repair items are done, etc..
Amazon and Microsoft's post mortems are much more to the point, one can actually learn from them to not make the same or similar mistake.
It reads to me like the bug required a particular feature to be enabled, and the feature wasn't enable when the software was rolling out, then the feature was globally knife-switched to the "on" position without an orderly staged rollout.
I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.
It's sort of like that extra disk in your raid 1 setup at home.
Compared to a single SSD the performance improvement doesn't really show (for desktop loads)...
However when things come to a rare (but somewhat inevitable) and screeching halt and that mirror has one of the copies shatter beyond recognition... that's when the doubled price proves that it was worth it.
Such a dashboard would invariably also add load and complexity (both failure points) to the system, but outwardly most users would be unaware of their existence.
As it says in the note, the bug was in a feature that was latent but not yet being used. Then a configuration change started hitting the feature.
With highly redundant systems such as this, you generally need multiple layers of things going wrong all at once to notice an issue. This was the case here as well.
Either tests must be poorly written or
the bug must be very subtle.
The easiest way to make a web service work in testing but fail in production is a problem with the settings that are necessarily different between environments.
For example, you must have different database credentials between test and production, and you must limit who can read the production credentials. If the production credentials are malformed, a service that worked in test will fail in production.
And the same applies to your SSL certificates, your settings for enabled/disabled features, your flashy markers that stop people mistaking production for test....
+1'ed you on the props to GCP, but on Amazon prime, that was only on their retail site, right? Do vendors there have SLA or is there otherwise an obligation to publish a post-mortem on the incident? I think it's different when it's a platform/service provider but the prime day outage only hurt Amazon. My recollection is that AWS service outages had prompt and thorough published incident reports.
I have seen SaaS products provide incident reports to customers but that were products with SLAs.
It depends on Amazon Retail Site in this case, Netflix has published postmortems in the past but then again Netflix is very good at blogging(also, open sourcing) their engineering efforts.
It may come as a surprise, but I have used Amazon. Both on a professional and personal basis! (I say this tongue in cheek).
I know businesses that depend on Amazon.com as a primary sales channel were affected, but they don't pay Amazon to sell their product (maybe Pro merchants are the exception). I think they owe us Prime members an outage report even more on that basis.
In any case, I think it would be a good idea for them to write an incident report, but don't think it's comparable.
edit: I give. Like I said, I think it's different than AWS/GCP outages, but I think your point is great and I think it's a good idea they pubish a report on it. Look forward to seeing it someday.
Do you mean that businesses "live and die" based on the ability to sell stuff on Amazon (obvious), or to buy stuff on Amazon (much less obvious, potentially interesting)?
If the latter, I wonder whether such customers would like to see Amazon introduce some sort of lower-level "Amazon purchasing API" that would continue to function even when the website doesn't, and which doesn't include any of the features that could topple the site (mostly, no paginated browse/search result API—you would have to already know the ID of the product you're buying.)
I don't think there's an obligation for any company to publish a post mortem. But if a company does, it shows that they're a company that's accountable and shows that they care about their customers.
> Google engineers were alerted to the issue within 3 minutes and began immediately investigating
As a user of their service, our engineers were notified within <30 secs when the issue started. Given GCP had a large population impacted, how is it that it took them much longer to acknowledge?
> The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.
Something going down after a deployment is the most common source of issues. A system KPI abnormality after a rollout should be common practice to monitor and to perform an almost instant auto-rollout on. Also, doesn't GCP perform dark launched, partial launches? Launch to 1%, see KPIs, increase to 5% and so on?
It's the difference between being at home, sober, and with a laptop handy vs. being actively logged in and ready to go at all times.
In terms of quality of life for people on call, it's an entirely different world. And in a setup where your oncall engineers are extremely highly skilled and have all the choice in the world in terms of where to work, that little bit of respect of their time is a necessary investment.
Reading the post mortem, I'm guessing it's more like it took them 30 minutes to figure out what was happening and then they rolled back the deploy and it was fixed.
Edit: Reading further down, they actually admit that it was just a roll back
> At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts
I love GCP’s postmortems - they’re open, honest, insightful, and I wish I could get my company to OK us releasing details like this when we have outages. It’s part of the reason I personally like GCP more than AWS (and certainly azure, those guys don’t admit to shit regardless how bad the outage is).
Edit: Wow, downvotes because I like transparency from my cloud hoster, super interesting...
You're getting downvotes because this was not a transparent and open report. It was vague and was more advertise-y than postmortem-y.
Not saying Amazon is perfect by any means either, but there's a lot of room for improvement. Good postmortems give everyone ideas on how to solidify their own processes and prevent other issues. This was just fluff.
A little offtopic but, why is it called a postmortem?
When I got first introduced to the concept of incident reports it was under the name of postmortem, as I worked for a mainly English speaking company then and didn't think twice about it. But earlier this week I mentioned it to a colleague he found it a rather macabre term for something like an incident report. When you think of it, nothing really died (maybe some engineers died a little inside that their design was not as 100% reliable as they though). But for the rest it was just a temporary state, nothing permanent like death. All other uses (eg: medical) of this word all seem to relate strictly to death.
Maybe it is because incident reports just sound to formal or is there a etymology of this term in the IT world?
It's called a 'post mortem' in the medical world because the patient died. It involves a high level of inspection into what went wrong and what can be learned to prevent it. I assume the term was adopted from there into project management.
It might make a little more sense in the world of shipping software in retail boxes, where products/projects had a 'done' date. The project is dead, what contributed to it's demise? Or you might generalize death into failure, and that's why we use the term instead of post-incident.
Not all post-mortems are for failures/dead products which might add even more confusion. For instance Gamasutra runs a game development post-mortem section where developers of popular (and unpopular) games can hop on and describe difficult situations they've encountered, how they did what they did, why they did that thing, etc.
I've seen the act of analyzing project failures by this name in software engineering/management ever since.
Maybe it was a bit more relevant in the days of large waterfall software project management, where failure often meant the end of a project with no product launched. Sometimes after a "death march": https://en.wikipedia.org/wiki/Death_march_(project_managemen...
But it does seem natural to me that it has been carried over to current days, and applied to analyzing failures in the context most relevant to modern software development.
Never stopped to think about the weirdness of the term's application. It feels so natural to me, I'd suggest we all start calling the fixed service as "resurrected". As in: "Google Cloud Global Load Balancers were resurrected at 13:19." (-:
That seems to make sense indeed. It does cover the load in terms of a thorough investigation so I get why it stuck as a more powerful term than a mere incident report.
The term 'post-mortem' doesn't appear anywhere on that page. It was added by the OP in the title. In reality it should be called an incident report.
If we want to be technical, a post-mortem in the tech world is commonly used to outline failures that occurred during a normal event (i.e. a software release) not a random production issue.
this was amusing for me, because i am JUST starting to test out google's cloud offerings and got hit by this outage on basically day 1. Luckily, it wasn't too long before i figured out it was them and not me.
We've been on Appengine for years, and one of the un-intuitive benefits is that when something goes wrong, all you can do is sit back and tell your customers "a team of the best sysadmins in the world are working on it right now".
I've met many people who hate that abdication of responsibility, and would prefer to be heroically hacking solutions at 3am when their database replication fails. Maybe they feel they have to be punished for failing their customers.
My advice is keep testing GC, because in my experience it is very reliable. And once you realise that, the peace of mind is awesome.
Recently ran into an issue with a provider's CSV export of their data not quoting some of their text fields. It was a minor annoyance in that the data in the text field was user input where the user used commas. Importing in Excel then had the columns mangled by the extra commas.
I then made the comment to sign in using Bobby Drop Tables for a user name. The silence in the room quickly reminded me I was not in the company of other developers. Such a waste
My understanding is yes; I only use regional, which encountered no outages. This hugely affected appengine users. I've not encountered a lot of GKE users that use Global LBs yet (I've run in GKE for roughly 4 years).
I remember a configuration change being rolled out by an automated system caused a problem on GCP a few years ago, it's an interesting area that's probably quite hard to fix
Basically, due to the lack of customer service, Google cloud is not for serious busines, just casual side projects. Reading through this and the article about getting their servers cut off made my stomach lurch
All major cloud platforms have occasional unplanned downtime. AWS had an outage last year that took many sites offline for hours. A single instance of Google having such unplanned downtime is meaningless without more datapoints.
As for whether it's useful for serious business. Well. Proof by example? This has sixteen pages of case studies: https://cloud.google.com/customers/. That is by no means all of GCP's serious customers.
We have been using GCP for 6 years (and is one of the companies in the linked case studies) and I must say that their customer service is good.
I think I had over 30 touches with their support and key account managers regarding everything from billing, minor issues with services and just asking for advice regarding stuff. They have always delivered.
The expectation from us has been that we pay the $150/mon support package fee.
I wish they elaborated more on what type of bug was that, which was not caught by testing / initial rollout. Either tests must be poorly written or the bug must be very subtle.
On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice. I also wish they publish (a belated) one about Google+ and their throng of messaging apps over the years.