Chaos Engineering Upgraded

romanhn · on Sept 26, 2015

Great to see the principles codified. We practice them at PagerDuty with our Failure Fridays - https://www.pagerduty.com/blog/failure-friday-at-pagerduty. Seeing the production system handle a data center failure gracefully during practice gives us real confidence that we can handle an unplanned region outage (and we have, in fact).

roghummal · on Sept 26, 2015

PagerDuty - YC S10

burnte · on Sept 25, 2015

They practice something I grabbed onto early in my computing life, error handling is critical. There's pretty much one way for an application to succeed, and that's by doing what the user told you do to without error. However, there are potentially an infinite number of ways to fail, and it's important to think about that early on. I spent the better part of wednesday debugging what should have been very simple DB stuff because things were failing silently, no errors at all.

akavel · on Sept 25, 2015

This interestingly leads I believe to the "let it crash" philosophy of Erlang (and apparently Akka?) -- see e.g.: http://c2.com/cgi/wiki?LetItCrash, https://lwn.net/Articles/191059/.

I do believe it's a very good and important approach, in that I don't really know of a better one. Still, on the other hand, I've learned that it seems to also have it's own weird and counter-intiuitive risks -- vide Systemantics: https://en.wikipedia.org/wiki/Systemantics#System_failure. As I understand the premise: it's easy to over-rely on one's "fail-safe" measures, to the point where the "regular" mode of operation is allowed to deteriorate such that the fail-safes are actually running the system (or even they're just hit too often); then, failure of a fail-safe takes the system down, and unexpectedly ("Why, when we had so many and so good fail-safes!").

burnte · on Sept 27, 2015

Oh, I'm all for the "let it crash" ideology, just make it crash in a loud and verbose manner. Let me figure out what went wrong. :)

fahimulhaq · on Sept 26, 2015

On the same subject, Facebook also turned off one of their data centers last year. http://bit.ly/1gX6E6h