Hacker News new | past | comments | ask | show | jobs | submit login
Principles of Chaos Engineering (2018) (principlesofchaos.org)
133 points by archielc on June 14, 2019 | hide | past | favorite | 16 comments



We built a system called „friendly fire“ that nukes a server every 10 minutes. It has changed the mindset of all engineers and made our infrastructure missile-proof.

Funnily enough it also improved our latencies a lot (which I guess is mostly due to memory leaks et al.)


We used to boot ~3x the servers we needed, run a hard load on them for a while, performance test them, and kill the "weakest" 2/3rds. You can get a bunch of nodes on iffy hardware or greedy neighbors (or all on the same physical box) and see significant performance improvements that way.

Of course this was a decade ago, but I think the fundamentals are still sound, as far as being skeptical about the quality and longevity of your nodes in a virtual environment.


At my previous job we did that for DB nodes on AWS. Still definitely a known technique.


That's awesome. Probably has helped your overall uptime numbers too. We couldn't get SLT approval to implement this, we had to settle for war games in dev/stage


It reminds me about the feeling of competition. You might fail during a tournament, but you know perfectly where you stand now.


The following link shows how we do Chaos Engineering in TiDB, an open source distributed database:

https://www.pingcap.com/blog/chaos-practice-in-tidb/

Regarding the Fault Injection tools we are using:

- Kernel Fault Injection, the Fault Injection Framework included in Linux kernel, you can use to implement simple fault injections to test device drivers.

- SystemTap, a scripting language and tool diagnose of a performance or functional problem.

- Fail, gofail for go and fail-rs for Rust

- Namazu: a programmable fuzzy scheduler to test a distributed system.

We also built our own Automatic Chaos platform, Schrodinger, to automate all these tests to improve both efficiency and coverage


I have not used it, but I have heard this is a very useful tool https://github.com/Netflix/chaosmonkey


Sure, often at a super initial stage the people running the test can just do this manually though. Don't be put off by having to set up the whole suite, a lot of the value from Chaos Engineering can be achieved from randomly removing bits of infrastructure manually (a d6 and a lookup table works fine). The value comes from the learnings as a result of infrastructure being terminated.


What would be advised in the following situation if one wanted to follow chaos engineering principles:

- there's a service that needs config data from a DB on another node to initialize itself to become useful - should the service die if it doesn't have connection to the DB on startup (so that the error propagates), or should it start and perform retry indefinitely until DB connection is set? Until that happens it sends back error code to its consumers.


I don't think what to do during the failure case is part of chaos engineering.

identifying that it needs to do something, and whether it does it or not is part of chaos engineering. eg by turning off the DB for a bit and seeing what it does


Basically it randomly kills prod nodes and forces engineers to consider this when designing their services. This is one of those "so stupid it's genius" ideas, kinda like mailing DVDs rather than having people to go to a movie store.


Other useful materials:

- Chaos Monkey Guide for Engineers https://www.gremlin.com/chaos-monkey/

- Recent HN discussion on Resilience Engineering: Where do I start? https://news.ycombinator.com/item?id=19898645


If you've never run a chaos experiment, how do you square up blast radius with running in prod?

It seems like this setup works great if built from the get-go but incredibly painful and possibly dangerous if starting with existing applications.


Set up a parallel deployment, run the experiments there. Document the failures in as granular a way as possible, decree that future deployments aren't allowed to add to the set of known failures. Assign e.g. 20% time to fixing the known problems. When confidence passes a threshold, start running the experiments in prod.

It's basically the strangler pattern[1]. It is painful, but can be made arbitrarily safe.

[1] https://news.ycombinator.com/item?id=19122973



I see no mention of AFL which seems like a fitting tool for the topic.

Also the term 'antifragile' (lightly controversial) comes to mind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: