Hacker News new | past | comments | ask | show | jobs | submit login

Anyone able to compare and contrast Google's "Wheel of Misfortune" with Netflix's "Chaos Monkey" both in terms of the systems that enable them and the operations that relate to them?



They're unrelated. Wheel of Misfortune is just a role-playing replay of a previous incident as a training exercise. Someone will grab (or simulate) logs and dashboards from the incident and then play GM for the wheel of misfortune at a future team meeting. Someone who isn't familiar with the incident will be designated "on-call". They'll state what they want to do and the GM will tell them or show them what they see when they do those things.

Chaos Monkey is actually taking down production systems to make sure the system as a whole stays up when those individual pieces fall. Google does have (manual, not automatic) exercises doing similar things called DiRT (Disaster Recovery Testing), but it's not related to the SRE training exercise.

(standard disclaimer: Google employee, not speaking for company, all opinions are my own, etc.)


(I'm an SRE at Google. My opinions are my own.)

WoMs are a training exercise, intended to build familiarity with systems and how to respond when oncall. A typical WoM format is a few SREs sat in a room, with a designated victim who is pretending to be oncall. The person running the WoM will open with an exchange a bit like this (massively simpified):

"You receive a page with this alert in it showing suddenly elevated rpc errors (link/paste)" "I'm going to look at this console to see if there was just a rollout" "Okay, you see a rollout happened about two minutes before the spike in rpc errors" "I'll roll that back in one location" "rpc errors go back to normal in that location" ...etc

(Depending on the team and quality of simulation available, some of this may be replaced with actual historical monitoring data or simulated broken systems)

The "chaos monkey" tool, as I understand it, is intended to maintain a minimum level of failure in order to make sure that failure cases are exercised. I've never been on a team which needed one of those: at sufficient scale and development velocity, the baseline rate of naturally occurring failures is already high enough. We do have some tools like that, but they're more commonly used by the dev teams during testing (where the test environment won't be big enough to naturally experience all the failures that happen in production).


Chaos Monkey is more like Google's DiRT

http://queue.acm.org/detail.cfm?id=2371516


and Dust


Not a thing anymore for several years, i.e name consolidation. Also, a good chunk of DiRT is now continuous and automated (not autonomous though).

Disclaimer: I work at Google and ran the DiRT team for a few years incl. incident management itself.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: