Chaos Monkey Guide for Engineers

yavor-atanasov · on Feb 17, 2019

At the BBC a few years ago we completely replaced our Chaos Monkey setup with a "chaos-lambda". It's extremely simple to setup and saves us a chunk of cash in terms of compute and maintenance. I'm sure Chaos Monkey is more featureful, but a lot of times you don't really need these features. If you just need something that ticks on a cron and kills instances at random across your estate and you have a good amount AWS accounts, have a look.

https://github.com/bbc/chaos-lambda

It's always good to see this "chaos culture" being promoted as it drives best engineering practices in terms of architecting, testing and monitoring resilient systems. However it's also interesting to see how at times this space gets inflated, packaged and sold as this big complicated thing that requires "chaos engineers" to implement (almost like "agile" got inflated into a stand alone industry :)). It's just a set of good practices that engineers can do to improve critical systems.

expertentipp · on Feb 16, 2019

Honestly I've never worked in an environment and with stack mature and scalable enough, with roles within the team so clearly divided, for chaos monkey to be of any use. Hectic deployments, "it's not my job" attitude, feature creep, and a litany of minor errors and misconfigurations were taking its place.

jedberg · on Feb 16, 2019

Perhaps if you had introduced Chaos Monkey, those other problems would have gotten solved as a solution to the Chaos (tm). :)

karlkatzke · on Feb 17, 2019

No. That’s entirely not the way it works. Chaos Engineering works in companies with significant scale and sufficient maturity, and the purpose is to discover design or implementation failures in distributed systems before they happen during big rushes where you’re hopefully making lots of profit.

If you lack sufficient maturity or you are not large enough, you will actually lose benefits if you try to implement chaos engineering. It’s like teaching a toddler to use a propane torch. You just don’t do it, or everything you love in the world will burn.

karlkatzke · on Feb 17, 2019

Self reply to realize that I shouldn’t lecture my grandma on how to suck eggs.

jedberg · on Feb 17, 2019

Maybe I should have added a disclaimer with the :) at the end.

Thanks for the laugh though. Your phrasing was awesome.

empath75 · on Feb 17, 2019

I’m telling you from experience that’s wrong. We did it at a project with just 15 or 20 developers, starting from day one of development. Literally the first thing we installed was chaos monkey before doing anything else, even before getting Jenkins running.

delusional · on Feb 17, 2019

I think it was Netflix engineer at a conference that once said that you shouldn't run chaos monkey if you expect it to fail. If you don't think it will work, then go fix those places that you know to be rotten first.

Your comment was probably in jest, but it provided a nice platform for mine :)

jedberg · on Feb 17, 2019

It was indeed in jest. And there’s a good chance I was the Netflix engineer who told you that.

To be fair though when we launched it at Netflix we knew it would break things. But we also knew we had the corporate will to deal with it.

_skel · on Feb 17, 2019

> But we also knew we had the corporate will to deal with it.

That's really the key. If you don't have that, it's not going to work out in the long term.

marcosdumay · on Feb 17, 2019

If the GP introduced a Chaos Monkey, he would probably get fired by the same people that are not fixing those problems.

tomrod · on Feb 17, 2019

Corporations are groups of people, policies, and technologies. Of these, people are the hardest to work with.

Swizec · on Feb 16, 2019

Yeah I don't need a chaos monkey I've got myself.

empath75 · on Feb 16, 2019

Older versions of chaos monkey didn’t require spinnaker and still work perfectly fine.

Also, if you’re using this, the right time to start doing it is in development, account wide. It’s mildly obnoxious for devs to lose stuff like Jenkins servers and bastions, but you really quickly find single points of failure and start to engineer around it. Usually making your stuff resilient isn’t that much work, but it’s hard to know what you need to do until you test it.

If you’re randomly terminating everything in dev and testing and staging, you should already be battle hardened by the time you get into production.

NHQ · on Feb 17, 2019

I built a similar, much more generic thing, to test networks with various failure modes [0]. It creates proxies for tcp/http/APIs, and handles those streams however you want—cutoff, slow down, dropout, or DIY handlers. Everything can be configured w/JSON, routed by API endpoints, and modulated live through an admin API. Process clustering supported.

0. https://github.com/NHQ/netmorphic-1

tayo42 · on Feb 16, 2019

something that i dont think gets talked about or addressed enough are partial failures. a whole instance dying is pretty much a solved problem and i think most people are aware of it and how to deal with it. if its not addressed people are just taking their chances at this point. a server being removed is an on and off kind of thing. something sporadic latency or corrupted data can cause some surprising and unpredictable issues.

jedberg · on Feb 16, 2019

Chaos Monkey does that too. We used to have a separate monkey called Latency Monkey that induced network latency, because as you aptly point out, detecting if something is down is a lot easier than detecting if it is slow or intermittently down. Netflix has since incorporated those failure modes into chaos monkey: https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-...

smartbit · on Feb 17, 2019

What is a recommended Chaos Monkey guide/method on Kubernetes iow without running Spinnaker?

dankohn1 · on Feb 17, 2019

Here are several chaos engineering efforts that work with Kubernetes: https://landscape.cncf.io/category=chaos-engineering&format=...