With all due respect, it's not a postmortem, it's an advert. It doesn't really s...

tylerl · on July 20, 2018

I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?

A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.

After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.

Anything else is just cleverly-disguised marketing.

occams_chainsaw · on July 20, 2018

It's pretty bare-bones next to a postmortem like https://www.epicgames.com/fortnite/en-US/news/postmortem-of-...

forgot-my-pw · on July 23, 2018

I feel that's way too much info.

andrewstuart2 · on July 20, 2018

To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.

All postmortems are ads, but not all ads are effective.

tanilama · on July 20, 2018

This is a self congratulatory lazy... ad.

halbritt · on July 19, 2018

Given my recent experiences with Google, I have to say that I'm impressed that they supplied even that much detail.

koolba · on July 19, 2018

That’s an interesting long term strategy to keep the bar so low that the slightest step is going above.

zinckiwi · on July 20, 2018

You've just described most people's career plans.

tgtweak · on July 19, 2018

Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.

It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.

phyzome · on July 19, 2018

In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:

« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »

...but yeah, they definitely could have done better.

puzzle · on July 19, 2018

Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.

In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.

Having been on the other side, I know that, for every detail added, a bunch more questions come up.

linza · on July 20, 2018

Is there a write up that explains what the technologies are that you mentioned in your post? LOAS, L1 GFE, L2 GFE, ... never heard of those

puzzle · on July 20, 2018

LOAS is how Stubby (gRPC) connections on the internal network are secured, looks a bit like mTLS and is partially being open sourced as ALTS: https://cloud.google.com/security/encryption-in-transit/appl...

GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.

breakingcups · on July 20, 2018

L1 and L2 GFE are the Level 1 and Level 2 Google Front Ends which the post mortem talks about. I'm not familiar with LOAS.

iKSv2 · on July 20, 2018

Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.

Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).

londons_explore · on July 23, 2018

Google internally will totally have a document with stack traces and code snippets.

They have just chosen to only post a brief summary.

foxylad · on July 20, 2018

Actually "We had a problem, we fixed it, we made sure it won't happen again."

That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.

I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.

aserafini · on July 20, 2018

I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).

briffle · on July 20, 2018

It's the perfect doc to show my leadership team, who want to know what happened, but aren't very technical.

kdmytro · on July 20, 2018

To be fair, nowhere on the linked page does it say that this is a postmortem. The submitter called it that.