Hacker News new | past | comments | ask | show | jobs | submit login
Open-Sourcing Our Incident Response Documentation (pagerduty.com)
182 points by kungfudoi on Jan 3, 2017 | hide | past | favorite | 8 comments



Interesting. I ran the support function for some of the world's busiest gambling sites' backends. A lot of this looks very familiar!

Probably the most significant thing I learned was the power of automation through Incident Models. I spent 7 months of my own time, full time, writing them for the previous two years' major incidents. This changed my life, as I simply stopped getting called, and juniors only escalated to me when the docs were faulty.


This is the most powerful thing you can do in a business or for support; you have a set of scripts for the most common cases and actions and you're basically just ticking boxes to get through the process. Then you can use your brainpower for the really hard shit.

It's definitely a game-changer and sad to see most orgs don't do a post-mortem or incident model report at all.


Any details you can share kind of things you automated and tools that used?


Oh, the automation I'm speaking of there was the automation of human behaviour through following a script/checklist. It helped everyone - the juniors felt more confident about escalating, and they learned from it too.

(I'd read The Checklist Manifesto, which I couldn't recommend enough, BTW).

In more trad automation, I got a little obsessed by automating the construction of environments, but in a human-readable way:

http://ianmiell.github.io/shutit/

and Docker (wrote this book):

https://www.amazon.com/Docker-Practice-Ian-Miell/dp/16172927...

See also:

https://medium.com/@zwischenzugs

I should really write up those experiences; the passage of time has given me more perspective on them. I don't work in ops anymore :)


That's a really well written, extensive guide. Thanks to the pagerduty folks for sharing this.


That is an incredibly well documented guide. I develop monitoring software and am part of an active on call roster so I've seen both sides. I'm surprised how much information overlaps.


Thanks for the documentation. Is is available in GitHub instead of a zip file?


Nvmd. After some digging found them here - https://github.com/PagerDuty/incident-response-docs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: