Hacker News new | past | comments | ask | show | jobs | submit login
Dispatch – Open-source release of Netflix's crisis management framework (medium.com/netflixtechblog)
128 points by m0hit on Feb 24, 2020 | hide | past | favorite | 18 comments



You can also read it here: https://outline.com/p3PBUY

Medium really annoys me, because I can't even scroll without disabling my adblocker.


I have Safari configured to open all medium.com pages directly in Reader Mode. Takes care of all the noise/interruptions they throw up on the screen.


How do you do that?


- Open a medium URL.

- In your toolbar, click the site settings icon (might need to edit your toolbar if it's not there).

- Check: "When visiting this website [X] Use Reader when available"


This is Brilliant! Thank YOu.


Link doesn't work for me.


Weird, failed for me too - I refreshed and then it worked. Wonder why


I use Outline on average around once a day and for the past few months it's needed a refresh somewhere between 10% and 20% of the time. I assumed that it had to do with them fetching the page, I'm surprised a cached get failed.


took two tries for me, too


do not trust a link from some stranger on "hacker"news!


Incidents basically represent engineering culture in extremis. Seeing how large orgs manage incidents really says a lot about culture. It's interesting to see Netflix go so far to automate what amounts to trivial amounts manual labor in (hopefully) rare instances. It says a lot about how they think about making mistakes and the developer experience working through crisis.


So much of this is generalisable to just running a project or a company (comms, collecting metadata, making smart automation decisions to save time effort duplication.)

There is a deep business transformation lurking here. As a post here says Netflix clearly has at its heart "just automate it all".


Python, VueJS and Postgres.

That right there is my favorite stack for prototyping. Though, admittedly, I only say that because none of my prototypes have taken off (yet).


Fun to see `sentry.io` as one of the dependencies, kind of an interesting level of recursion on an incident mgmt app


So in other words another over-complicated ticket system


No, there's a lot more that goes into handling incidents that affect large production systems. Getting things back up as fast as possible, coordination, communication, getting the right action items out of it. There are tradeoff decisions that need to be made, executives and big customers picking up the phone.

This kind of tooling is what arises in an effort to automate and streamline incident response. When you're operating at Netflix' scale, each minute is precious and if a tool manages to save 45 seconds on each incident, it can be quite valuable.


The incident workflow about halfway down reads to me like a lingoed-up version of "create a bug, escalate, put a Slack (etc.) URL in the bug, send the bug to blamees/ondutys, message boss(es), finish fix and push, schedule a meeting for the next day." Which it turns out that I've guessed reasonably well, having read the rest of the article. I mean, there's decades behind this very use-case, and at the end of the day it's possible to hook out to Slack from RT, too. But they're not using RT, true.

https://rt-wiki.bestpractical.com/wiki/WorkFlow#Modeling_Wor...

I don't have a problem with the work -- like I said, it's a persistent use-case -- it's just the way it's described here, as if it wasn't and with puffery. And the thin-ness of my skin with this is not the issue!


I'm a small team and we had to get a custom CIC for all these reasons, just much smaller.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: