Why is everyone banging on about this. It's a blog post from the same day, a dec...

tyingq · on June 9, 2021

I wasn't "banging on", I was answering why the article didn't mention the cause...because the source didn't either.

jffry · on June 9, 2021

Plus it would be weird to present just that specific information, outside of the context of a post mortem / failure chain analysis type discussion.

tyingq · on June 9, 2021

That's true, though they are also saying things like "We created a permanent fix for the bug and began deploying it at 17:25.". "Permanent fix" sort of implies they understood the issue really well.

jffry · on June 9, 2021

That's my point though. Even though they may understand the immediate flaw in their code that caused the issue, there's not much use (for them or their customers) in just talking in detail about that specific flaw.

I'd go so far as to argue that the specifics of the flaw are immaterial right now. At this stage, the important thing is that they have identified a specific code change that was the proximate cause of the issue, and have a mitigation in place. This is contrasted with more mysterious and hard-to-track-down failures. ("We are working to understand why our systems are down and will post another update in 30 minutes")

What will take time, and the thing which will be interesting, is failure tree analysis. (You might hear the phrase "failure chain" or "root cause" but IMO it's quite rare for things to be so linear). That can help identify opportunities to improve processes at many different levels of the product lifecycle.

Humans are fallible, and there's no way we can write bug-free software, so the solution has to be more robust than "hope that every member of our organization never makes a mistake again"

tyingq · on June 9, 2021

Yes, I was saying I would have avoided words like "permanent fix", because it sets unrealistic expectations.

notacoward · on June 9, 2021

Everyone is "banging on" because there are important lessons to be learned from such incidents, and people want to learn. They hunger for more details about the generalizable aspects of the bug, even if a full post mortem that also covers internal processes etc. might take longer to do. Having participated in many post mortems, in many roles, for systems just as complex, I believe it's entirely possible to provide that information the next (not same) day. Is that still setting the bar too high? Perhaps. Fastly deserves kudos for providing even the level of information that they have, since that's already above the pathetic industry standard, but I don't think there's anything bad about wanting more. Defensiveness is the enemy of effective post mortems.