We built a system called „friendly fire“ that nukes a server every 10 minutes. It has changed the mindset of all engineers and made our infrastructure missile-proof.
Funnily enough it also improved our latencies a lot (which I guess is mostly due to memory leaks et al.)
We used to boot ~3x the servers we needed, run a hard load on them for a while, performance test them, and kill the "weakest" 2/3rds. You can get a bunch of nodes on iffy hardware or greedy neighbors (or all on the same physical box) and see significant performance improvements that way.
Of course this was a decade ago, but I think the fundamentals are still sound, as far as being skeptical about the quality and longevity of your nodes in a virtual environment.
That's awesome. Probably has helped your overall uptime numbers too. We couldn't get SLT approval to implement this, we had to settle for war games in dev/stage
- Kernel Fault Injection, the Fault Injection Framework included in Linux kernel, you can use to implement simple fault injections to test device drivers.
- SystemTap, a scripting language and tool diagnose of a performance or functional problem.
- Fail, gofail for go and fail-rs for Rust
- Namazu: a programmable fuzzy scheduler to test a distributed system.
We also built our own Automatic Chaos platform, Schrodinger, to automate all these tests to improve both efficiency and coverage
Sure, often at a super initial stage the people running the test can just do this manually though. Don't be put off by having to set up the whole suite, a lot of the value from Chaos Engineering can be achieved from randomly removing bits of infrastructure manually (a d6 and a lookup table works fine). The value comes from the learnings as a result of infrastructure being terminated.
What would be advised in the following situation if one wanted to follow chaos engineering principles:
- there's a service that needs config data from a DB on another node to initialize itself to become useful
- should the service die if it doesn't have connection to the DB on startup (so that the error propagates), or should it start and perform retry indefinitely until DB connection is set? Until that happens it sends back error code to its consumers.
I don't think what to do during the failure case is part of chaos engineering.
identifying that it needs to do something, and whether it does it or not is part of chaos engineering. eg by turning off the DB for a bit and seeing what it does
Basically it randomly kills prod nodes and forces engineers to consider this when designing their services. This is one of those "so stupid it's genius" ideas, kinda like mailing DVDs rather than having people to go to a movie store.
Set up a parallel deployment, run the experiments there. Document the failures in as granular a way as possible, decree that future deployments aren't allowed to add to the set of known failures. Assign e.g. 20% time to fixing the known problems. When confidence passes a threshold, start running the experiments in prod.
It's basically the strangler pattern[1]. It is painful, but can be made arbitrarily safe.
Funnily enough it also improved our latencies a lot (which I guess is mostly due to memory leaks et al.)