Hacker News new | past | comments | ask | show | jobs | submit login

No, there's a lot more that goes into handling incidents that affect large production systems. Getting things back up as fast as possible, coordination, communication, getting the right action items out of it. There are tradeoff decisions that need to be made, executives and big customers picking up the phone.

This kind of tooling is what arises in an effort to automate and streamline incident response. When you're operating at Netflix' scale, each minute is precious and if a tool manages to save 45 seconds on each incident, it can be quite valuable.




The incident workflow about halfway down reads to me like a lingoed-up version of "create a bug, escalate, put a Slack (etc.) URL in the bug, send the bug to blamees/ondutys, message boss(es), finish fix and push, schedule a meeting for the next day." Which it turns out that I've guessed reasonably well, having read the rest of the article. I mean, there's decades behind this very use-case, and at the end of the day it's possible to hook out to Slack from RT, too. But they're not using RT, true.

https://rt-wiki.bestpractical.com/wiki/WorkFlow#Modeling_Wor...

I don't have a problem with the work -- like I said, it's a persistent use-case -- it's just the way it's described here, as if it wasn't and with puffery. And the thin-ness of my skin with this is not the issue!


I'm a small team and we had to get a custom CIC for all these reasons, just much smaller.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: