Hacker News new | past | comments | ask | show | jobs | submit login

Why is everyone banging on about this. It's a blog post from the same day, a decent post mortem takes a while to put together and assuming the bug isn't fully patched across their entire CDN, why would they post the information.



I wasn't "banging on", I was answering why the article didn't mention the cause...because the source didn't either.


Plus it would be weird to present just that specific information, outside of the context of a post mortem / failure chain analysis type discussion.


That's true, though they are also saying things like "We created a permanent fix for the bug and began deploying it at 17:25.". "Permanent fix" sort of implies they understood the issue really well.


That's my point though. Even though they may understand the immediate flaw in their code that caused the issue, there's not much use (for them or their customers) in just talking in detail about that specific flaw.

I'd go so far as to argue that the specifics of the flaw are immaterial right now. At this stage, the important thing is that they have identified a specific code change that was the proximate cause of the issue, and have a mitigation in place. This is contrasted with more mysterious and hard-to-track-down failures. ("We are working to understand why our systems are down and will post another update in 30 minutes")

What will take time, and the thing which will be interesting, is failure tree analysis. (You might hear the phrase "failure chain" or "root cause" but IMO it's quite rare for things to be so linear). That can help identify opportunities to improve processes at many different levels of the product lifecycle.

Humans are fallible, and there's no way we can write bug-free software, so the solution has to be more robust than "hope that every member of our organization never makes a mistake again"


Yes, I was saying I would have avoided words like "permanent fix", because it sets unrealistic expectations.


Everyone is "banging on" because there are important lessons to be learned from such incidents, and people want to learn. They hunger for more details about the generalizable aspects of the bug, even if a full post mortem that also covers internal processes etc. might take longer to do. Having participated in many post mortems, in many roles, for systems just as complex, I believe it's entirely possible to provide that information the next (not same) day. Is that still setting the bar too high? Perhaps. Fastly deserves kudos for providing even the level of information that they have, since that's already above the pathetic industry standard, but I don't think there's anything bad about wanting more. Defensiveness is the enemy of effective post mortems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: