Open-Sourcing Our Incident Response Documentation

zwischenzug · on Jan 3, 2017

Interesting. I ran the support function for some of the world's busiest gambling sites' backends. A lot of this looks very familiar!

Probably the most significant thing I learned was the power of automation through Incident Models. I spent 7 months of my own time, full time, writing them for the previous two years' major incidents. This changed my life, as I simply stopped getting called, and juniors only escalated to me when the docs were faulty.

_csoo · on Jan 3, 2017

This is the most powerful thing you can do in a business or for support; you have a set of scripts for the most common cases and actions and you're basically just ticking boxes to get through the process. Then you can use your brainpower for the really hard shit.

It's definitely a game-changer and sad to see most orgs don't do a post-mortem or incident model report at all.

brsanthu · on Jan 3, 2017

Any details you can share kind of things you automated and tools that used?

zwischenzug · on Jan 3, 2017

Oh, the automation I'm speaking of there was the automation of human behaviour through following a script/checklist. It helped everyone - the juniors felt more confident about escalating, and they learned from it too.

(I'd read The Checklist Manifesto, which I couldn't recommend enough, BTW).

In more trad automation, I got a little obsessed by automating the construction of environments, but in a human-readable way:

http://ianmiell.github.io/shutit/

and Docker (wrote this book):

https://www.amazon.com/Docker-Practice-Ian-Miell/dp/16172927...

See also:

https://medium.com/@zwischenzugs

I should really write up those experiences; the passage of time has given me more perspective on them. I don't work in ops anymore :)

remh · on Jan 3, 2017

That's a really well written, extensive guide. Thanks to the pagerduty folks for sharing this.

UseStrict · on Jan 4, 2017

That is an incredibly well documented guide. I develop monitoring software and am part of an active on call roster so I've seen both sides. I'm surprised how much information overlaps.

the_arun · on Jan 3, 2017

Thanks for the documentation. Is is available in GitHub instead of a zip file?

the_arun · on Jan 3, 2017

Nvmd. After some digging found them here - https://github.com/PagerDuty/incident-response-docs