Hacker News new | past | comments | ask | show | jobs | submit login

(I'm an SRE at Google. My opinions are my own.)

WoMs are a training exercise, intended to build familiarity with systems and how to respond when oncall. A typical WoM format is a few SREs sat in a room, with a designated victim who is pretending to be oncall. The person running the WoM will open with an exchange a bit like this (massively simpified):

"You receive a page with this alert in it showing suddenly elevated rpc errors (link/paste)" "I'm going to look at this console to see if there was just a rollout" "Okay, you see a rollout happened about two minutes before the spike in rpc errors" "I'll roll that back in one location" "rpc errors go back to normal in that location" ...etc

(Depending on the team and quality of simulation available, some of this may be replaced with actual historical monitoring data or simulated broken systems)

The "chaos monkey" tool, as I understand it, is intended to maintain a minimum level of failure in order to make sure that failure cases are exercised. I've never been on a team which needed one of those: at sufficient scale and development velocity, the baseline rate of naturally occurring failures is already high enough. We do have some tools like that, but they're more commonly used by the dev teams during testing (where the test environment won't be big enough to naturally experience all the failures that happen in production).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: