Hacker News new | past | comments | ask | show | jobs | submit login

With all due respect, it's not a postmortem, it's an advert. It doesn't really say anything other than "We had a problem, we fixed it.". There are virtually no technical details in there other than "something would restart spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded by cloudflare and aws writeups, but this really isn't anything special or worthwhile reading.



I donno, looks comprehensive enough for me. What were you looking for, exactly? Source code? Packet traces? Pew-pew maps?

A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.

After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.

Anything else is just cleverly-disguised marketing.


It's pretty bare-bones next to a postmortem like https://www.epicgames.com/fortnite/en-US/news/postmortem-of-...


I feel that's way too much info.


To your point, all postmortems are advertisements. This is just an unconvincing "me too" ad. Amazon and CloudFlare's quality and detailed write-ups give technical people more confidence in supporting products, as well as more interest in joining the company.

All postmortems are ads, but not all ads are effective.


This is a self congratulatory lazy... ad.


Given my recent experiences with Google, I have to say that I'm impressed that they supplied even that much detail.


That’s an interesting long term strategy to keep the bar so low that the slightest step is going above.


You've just described most people's career plans.


Very little detail. I'm sure they're worried about disclosing something about their stack that would open them up to more vulnerability which is understandable, but not explaining why they didn't see it in their testing... That's something they could disclose without any threats.

It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.


In this case, they might be hard-pressed to go into details, since it was related to an unreleased feature:

« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »

...but yeah, they definitely could have done better.


Another reason postmortems can be vague is that, for each detail you add, you might need to add even more background details. Then it becomes a recursive problem.

In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.

Having been on the other side, I know that, for every detail added, a bunch more questions come up.


Is there a write up that explains what the technologies are that you mentioned in your post? LOAS, L1 GFE, L2 GFE, ... never heard of those


LOAS is how Stubby (gRPC) connections on the internal network are secured, looks a bit like mTLS and is partially being open sourced as ALTS: https://cloud.google.com/security/encryption-in-transit/appl...

GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.


L1 and L2 GFE are the Level 1 and Level 2 Google Front Ends which the post mortem talks about. I'm not familiar with LOAS.


Totally agree with you on this one. While they keep absolute technical details to themselves and not on the page, at least give a link or something where interested people can read what went on.

Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).


Google internally will totally have a document with stack traces and code snippets.

They have just chosen to only post a brief summary.


Actually "We had a problem, we fixed it, we made sure it won't happen again."

That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.

I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.


I agree. The AWS S3 outage postmortem had a much more specific description of the technical problem (even though it amounted to bad parameters being passed to a script).


It's the perfect doc to show my leadership team, who want to know what happened, but aren't very technical.


To be fair, nowhere on the linked page does it say that this is a postmortem. The submitter called it that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: