With all due respect, it's not a postmortem, it's an advert. It doesn't really say anything other than "We had a problem, we fixed it.". There are virtually no technical details in there other than "something would restart spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded by cloudflare and aws writeups, but this really isn't anything special or worthwhile reading.
I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?
A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.
After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.
Anything else is just cleverly-disguised marketing.
To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.
All postmortems are ads, but not all ads are effective.
Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.
It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.
In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:
« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »
...but yeah, they definitely could have done better.
Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.
In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.
Having been on the other side, I know that, for every detail added, a bunch more questions come up.
GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.
Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.
Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).
Actually "We had a problem, we fixed it, we made sure it won't happen again."
That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.
I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.
I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).