Interesting. I ran the support function for some of the world's busiest gambling sites' backends. A lot of this looks very familiar!
Probably the most significant thing I learned was the power of automation through Incident Models. I spent 7 months of my own time, full time, writing them for the previous two years' major incidents. This changed my life, as I simply stopped getting called, and juniors only escalated to me when the docs were faulty.
This is the most powerful thing you can do in a business or for support; you have a set of scripts for the most common cases and actions and you're basically just ticking boxes to get through the process. Then you can use your brainpower for the really hard shit.
It's definitely a game-changer and sad to see most orgs don't do a post-mortem or incident model report at all.
Oh, the automation I'm speaking of there was the automation of human behaviour through following a script/checklist. It helped everyone - the juniors felt more confident about escalating, and they learned from it too.
(I'd read The Checklist Manifesto, which I couldn't recommend enough, BTW).
In more trad automation, I got a little obsessed by automating the construction of environments, but in a human-readable way:
That is an incredibly well documented guide. I develop monitoring software and am part of an active on call roster so I've seen both sides. I'm surprised how much information overlaps.
Probably the most significant thing I learned was the power of automation through Incident Models. I spent 7 months of my own time, full time, writing them for the previous two years' major incidents. This changed my life, as I simply stopped getting called, and juniors only escalated to me when the docs were faulty.