Hacker News new | past | comments | ask | show | jobs | submit login
Chaos Engineering Upgraded (netflix.com)
140 points by vquemener on Sept 25, 2015 | hide | past | favorite | 6 comments



Great to see the principles codified. We practice them at PagerDuty with our Failure Fridays - https://www.pagerduty.com/blog/failure-friday-at-pagerduty. Seeing the production system handle a data center failure gracefully during practice gives us real confidence that we can handle an unplanned region outage (and we have, in fact).


PagerDuty - YC S10


They practice something I grabbed onto early in my computing life, error handling is critical. There's pretty much one way for an application to succeed, and that's by doing what the user told you do to without error. However, there are potentially an infinite number of ways to fail, and it's important to think about that early on. I spent the better part of wednesday debugging what should have been very simple DB stuff because things were failing silently, no errors at all.


This interestingly leads I believe to the "let it crash" philosophy of Erlang (and apparently Akka?) -- see e.g.: http://c2.com/cgi/wiki?LetItCrash, https://lwn.net/Articles/191059/.

I do believe it's a very good and important approach, in that I don't really know of a better one. Still, on the other hand, I've learned that it seems to also have it's own weird and counter-intiuitive risks -- vide Systemantics: https://en.wikipedia.org/wiki/Systemantics#System_failure. As I understand the premise: it's easy to over-rely on one's "fail-safe" measures, to the point where the "regular" mode of operation is allowed to deteriorate such that the fail-safes are actually running the system (or even they're just hit too often); then, failure of a fail-safe takes the system down, and unexpectedly ("Why, when we had so many and so good fail-safes!").


Oh, I'm all for the "let it crash" ideology, just make it crash in a loud and verbose manner. Let me figure out what went wrong. :)


On the same subject, Facebook also turned off one of their data centers last year. http://bit.ly/1gX6E6h




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: