Hacker News new | past | comments | ask | show | jobs | submit login
Netflix Chaos Monkey Upgraded (netflix.com)
271 points by dustinmoris on Oct 19, 2016 | hide | past | favorite | 84 comments



I wonder what the reasoning was for having version 2 only terminate instances (vs burning up CPU, taking disks offline, etc.)? I assume it's something to do with what Chaos Monkey is NOT trying to solve (ie. eating up CPU is caught elsewhere by another system and out of scope for Chaos Monkey now). Just trying to think it through...


See my other reply in this thread for our reasoning. https://news.ycombinator.com/item?id=12744567


Was wondering the same thing... I know in our environment, unexpected events on a box cause more problems than entire server failures. We design around servers coming in and out; specific processes failing in random ways is harder to design around.


If you can handle a server failing and you have good reporting, you can handle many of those random issues by simply rebooting the affected server. So I can see them taking out things like "load up the memory" or "load up the CPU". Logic errors (bad RAM, corrupted packets, high packet loss) are another story, but I don't know if V1 did those.


Seems to me those kind of issues would be a good way to test your monitoring. For instance verifying that the appropriate people are notified or automatic action is taken when within a reasonable amount of time after the CPU on the box starts spiking.


We have another system for outlier detection that can kill instances that start behaving badly in terms of CPU, response time, error rates, etc.

http://techblog.netflix.com/2015/07/tracking-down-villains-o...


I would assume that terminating is easy via the AWS API, whereas some of the other things need a process on the instance. You shouldn't really be connecting to boxes directly over SSH if you do DevOps correctly, so maybe they blocked port 22 to enforce this.


What do you mean by "do devops correctly" to avoid SSH on boxes? (I'm a developer, not devops.)


"Devops" has bazillions of meanings, but avoiding (human) ssh to production boxes is a generally sound principle these days because our infrastructures are becoming harder to understand by poking at boxes one or two at a time now even for forensic analysis.


So logging in to a server to check a logfile (assuming i dont or cant do centralized logging) is considered anti-devops ?

Edit: Sorry responded to wrong parent, sigh.


It's just a matter of scale. If you are at the scale of netflix, vm are probably too complex black boxes, and logs output somewhere else anyway. Plus the problem may involve the interraction of several vm, or the network, or other composants together.


At this scale, you basically have to have centralized logging. When you have thousands of parallel instances of a single application, searching logs box-by-box just isn't practical.

Consider also that if you're elastically scaling EC2 instances and you need logs off an instance that's since been terminated, too late! That disk is gone. So again, you need a central log service.


Parsing logs isn't the only way to troubleshoot.


It's not even the best way, just sometimes it's the only way.


Immutable infrastructure [1] is preferred. If SSH is detected, you must assume something on the server changed, and has deviated from the baseline.

[1] https://www.oreilly.com/ideas/an-introduction-to-immutable-i...


If deployment is automated and "clean", images get baked into machines and they just start. For instance, we use Ansible against the machine itself on boot, so we don't really need ssh access to it: everything is automated (but we keep it open to troubleshoot anything that may happen)


Everything goes through Spinnaker now, which in turn supports all clouddriver provides including aws, gcp, azure and kubernetes. Resource limits should be set by instance type. It's more of an application level thing than infrastructure which is what chaos monkey is supposed to simulate


I actually agree with you about immutable infrastructure, as I work at implementing it.

But that's dangerously close to the "One True Way". Which is certainly not the case - so much of this is evolving, and a wide variety of situations and circumstances.


To test a distributed system, there's not a lot of value to simulate all the many different conditions that can happen on a machine. From the system's perspective, you don't really care what happens on a single computer.

The conditions you need to simulate are (a) the machine being abruptly gone or (b) the machine still accepting requests, but being very slow in returning them and maybe (c) machine returning incorrect results. Seen from the outside, everything that can happen is usually (a) or (b), and with ECC memory and reasonable software (?) hopefully never (c).


I think it could be because it doesn't provide the value to the development teams. Why chase a memory leak or a CPU bug if it's just being caused by your fault testing app?

To prevent the negative effect of random machines dissapearing though.. that's a challenge that involves good ops, devs, even UI/UX I would imagine, and closer to something that users experience negatives because of in real life.


I posted this piece of news to my team Slack at work, and a colleague of mine wrote: "we don't need chaos monkey, we have developers for that".

While being funny, it also holds a lot of truth. I guess that Netflix can hire really top-notch devs who do not accidentally force downtime to their software.


Developers cause software downtime, but they usually don't cause infrastructure downtime, which is what Chaos Monkey does. Cloud VMs have failure rates somewhere around 1-2% depending on who you ask. This is low enough that you can ignore it most of the time, but it'll come back and bite you hard later. Chaos Monkey artificially forces that failure rate high enough that you'll notice problems immediately and fix them before they become too engrained into your architecture.


We have different tools for different kinds of failures.

Chaos Monkey helps ensure that you are resilient to single instance failure. Kong helps ensure that we are resilient to region failure.

Most of the developer-induced pain (which is most frequent source of pain) happens at the service level -- a bad code push that somehow made it through canary, accidentally doing something you shouldn't, misconfiguring something, etc. For tolerating service-level failures, we use different tools that minimize the fallout of the failure injection. Specifically, FIT (and the soon to be revealed ChAP.) These tools allow us to be more surgical in our injection of failure and tie that into our telemetry solutions.

We only inject failures we expect to be resilient to. Sadly, that is a subset of the failures that people cause ;)


The difference is scale. Only a handful of companies run at Netflix scale.

I wouldn't trust developers to do what Chaos Monkey does at such a scale, no matter how good you think they are.


The joke is funny but it is actually shockingly difficult to make nodes kill themselves instead of doing something for more malignant which is the typical case for bugs. That is a zombie just like in real life can be worse than a dead corpse (since we are being funny and all). Chaos Monkey shoots two in the head.

I have had issues killing errant JVMs and Rackspace nodes (yes sadly we are still on Rackspace).

I can understand why 2.0 is much more focused given the plethora of monitoring solutions.


I don't think even hiring top-notch devs would make this reality completely go away. Tools like this would likely put them in check too. And yeah same response on our internal chat here basically x)


From my experience, this is naive. It's funny, but really naive.

The bigger your application stack is (micro-services, API calls, network calls), the more failure you would need to test out and there's no way to "trust" developers to do it themselves.

Also, Netflix hires a lot of top-notch developers and their infrastructure is pretty awesome.

[I don't work @ Netflix, just a devops dude :)]


That rings so true. But on the flip side, it might be that because Netflix has a chaos monkey and all it's services need to be resilient to failures, accidental downtime isn't that big of an issue to them.

Their developers can still make the same mistakes that we do, but their architecture is better designed to handle that. Just a thought.


Netflix had a worldwide outage few weeks ago that last hours so yes they need those kind of tools :)


Another useful tool is https://github.com/gaia-adm/pumba - like ChaosMonkey, but just for Docker containers. The coolest part for us was emulating networking problems between containers (packet loss, unavailability etc).


Blockade is another tool to check out in this area: http://blockade.readthedocs.io/en/latest/


What do you use for high availability with Docker?


HA containers for us means smart orchestration tools. We did not want to lock ourselves into Docker-only infrastructure (even now rkt is a very compelling alternative), and wanted an orchestrator/scheduler that is focused entirely on that job. Outside of Swarm, Mesos & co appeared too intrusive, and Nomad is quite narrow in what it does. So we picked Kubernetes and are very happy with it.


I'd love to see this used in the Jepsen tests


AFAIR Jepsen is only meant to test the CAP aspects of a system, not how it behaves otherwise.


Parent may have meant that it would be cool to see a write up on how this tool can be used to simulate network conditions between nodes in a Jepsen test using only containers on a single docker host.


Yep, that's what I meant.


First, a shameless plug for an alternative implementation: https://github.com/BBC/chaos-lambda

Seems unfortunate that it requires the coupling with spinnaker - although i can see how it helps with the cluster definition features.

Edit: I'll add that we've been using the original chaos monkey and chaos lambda extensively in production for some time with very few problems.


Interesting that all of the resource burning features have been removed, I wish they had expanded on the reasons why. I always found those to be the most differentiating features of Chaos Monkey. Did they just not get a lot of use internally at Netflix?


Resource exhaustion manifests as latency or failure. We inject latency and failure using FIT, so we can limit the "blast radius". When you are testing these failure modes, you are more testing the interaction between micro services, and this requires a bit more precision and sophistication.

Source: I'm on the Chaos team here at Netflix.



New life goal: Join the "Chaos Team"


Yes, I wonder that too - complete node failure is the nicest failure that can happen. (See e.g. http://danluu.com/limplock/, because Dan Luu's site is always excellent.)

"We rewrote the service for improved maintainability" seems an important part of this blog post.


I wonder if they have a service that automatically detects all those failures and terminates the entire node instead?


I was thinking along these lines as well. Possibly they have other checks and balances in place for less than death situations for servers like high CPU usage and those situations result in another mechanism killing the server which is the same as what this solution provides.



Maybe internally such situations translate to a termination anyway.


perhaps it had too much liability.


This is awesome. Since Chaos Monkey now leverages spinnaker you can run it against clouddriver provider. Looking forward to trying this out with kubernetes. I believe spinnaker treats namespaces as regions. Eventually it would be cool to simulate masters or other entire federated clusters going down to test kubernetes scheduling resilience


Every time this is in the news I get a feeling of awe for operations teams having confidence enough to deploy this. I've usually been in small teams with the feature mill factor turned up way too high.


I plan to run on it on staging while running load test suites. Not brave enough to run it in production yet


If you don't use a tool like this, entropy will take care of taking your machines down for you. Only then, it won't be a regularly rehearsed part of "normal operations" so you might find yourself up creek without a paddle.


Yeah, sure, but then it won't be the fault of whoever (me?) thought using Chaos Monkey in production was a good idea.

Is it good for the organization? Yes. Good for the guy pushing it? Very possibly very not.


Beyond external termination services like Chaos Monkey, what are good examples of software that purposely increase internal nondeterminism or failure injection in production?

Go has race detector mode, but it is an optional debug feature with a performance cost. The Linux kernel's jiffy clock starts counting from -5 minutes so drivers must handle clock rollover correctly because it's not a uncommon "once every 48 days" event. Firefox has a chaos debug mode that does things like randomize thread priorities and simulate short socket reads, but that has performance costs.


Go intentionally introduces randomness when reading maps so developers don't write code dependent on order.


Ironically, some Go developers began depending on the randomized map ordering and were surprised when it changed! :)

runtime: hashmap iterator start position not random enough #8688

https://github.com/golang/go/issues/8688


The Linux kernel also recently added a debugging option that when enabled, instead of just probing a device it performs a probe / remove / probe sequence to ensure that device removal works.


Has anyone else deployed a Chaos Monkey in production?

I can imagine it would be a tough sell to the CEO. :)


Chaos Monkey is sort of like Advanced Continuous Deployment. Most shops are still struggling with the basics. You cant even think of trying to sell this running to the C-level until you've proven that you can at least walk (automated deployment and rollback).

I remember reading years and years ago about bandit algorithms... this kind of ops work is at a level that's found only in a few different companies.


Yes.

The "sell" can be tricky for some people, until your first production issue. Never let a crisis go to waste.

Machines go away all the time in the cloud. This tool increases the frequency so you can ensure your system handles it gracefully.

Some people believe their system can tolerate this class of failures, but without continuous validation, that is more of a hope than a certainty.


In my opinion, it doesn't count unless its in production.

Why? Your customers use your production environment, not your test environment. Something will cause loss of an instance for you: * Mistaken termination * AWS retirement (and you missed the email) * Cable trip in the data center * <Something else we can come up with> * <This list goes on>

So, vaccinate against the loss of an instance cratering your service. Give your prod environment a booster shot (with Chaos Monkey or something like it) every hour of every day. Then, when anything from the above list happens you're infrastructure handles it gracefully and without intervention. Continued booster shots ensure that this stability continues through config changes, software version changes, OS changes, tooling changes, etc.

I think the better question is "Why wouldn't you do this?"



How so? The benefits are worth it, and I doubt any CEO will be argue against having fault tolerant code :)

You catch bugs, and no one says you can't run Chaos Monkey in staging or a similar environment if it really is a tough sell.


The drawbacks of potentially causing downtime and therefore having the potential to drive away customers as well as obtain an image of unreliability can be much more damaging than not using it in the first place. Customer image means quite a bit.


Agreed, it should be hard to explain benefits even to non technical people. It's like doing a fire drill, if you do it frequently when the actual fire happens you will know what to do. Similarly with infrastructure, it might not be good handling rare events, but once these events are not rare you will learn to handle them.

The biggest issue IMO is explaining need to make things more resilient. Actually the technical people (mainly developers) might be the biggest obstacle, because it adds more work for them (with no visible benefit to them, because when application fails it's ops who get woken up).


"Chaos Monkey even periodically terminates itself."

Whoa, that's meta.


Just wait till it starts killing developers to test your bus factor.


Inspired by chaos monkey, I introduced malloc chaos mechanism into our codebase: https://reviewboard.asterisk.org/r/4463/

Although designed originally to catch places where malloc failure wasn't being handled, it can also be used to randomly trigger other off-nominal portions of the code that might not otherwise be tested.


I can see that Chaos Monkey adds selective pressure to ensure that systems evolve into a state where they can handle unexpected server outages.

But isn't there a danger that it also encourages maladaptions that come to rely on being regularly restarted by the Chaos Monkey? I'm particularly thinking that you might evolve a lot of resource leaks that go unnoticed so long as Chaos Monkey is on the job.


Holy fuck that is some small font size. And the paragraphs aren't in paragraph tags... just hanging out between <div>'s with some <br>'s to keep them company.


Following in line with the jokes, it's like they looked at my company's production environment and said "Let's make that a tool!"


It looks like it's working too well. The site is unreachable for me.


Every time that URL comes up people try to access it from https but the site is only available from http... Fix your Firefox, it's clearly at fault here.


It's NoScript's new-ish defaults at fault.

See https://news.ycombinator.com/item?id=12256720


"Every time"?

If one random guy's complaining about a URL being unreachable and you're already seeing a pattern... is it at all possible the users aren't at fault?


To be fair it does happen frequently enough. I wouldn't say every time but often on Netflix posts. A bit of a sampling for you:

https://news.ycombinator.com/item?id=12269411

https://news.ycombinator.com/item?id=12217900

https://news.ycombinator.com/item?id=12038367

https://news.ycombinator.com/item?id=11771714


In Thaxll's defense, I've seen this a lot with Netflix blog posts on HN.


It's up.


down : Secure Connection Failed

The connection to techblog.netflix.com was interrupted while the page was loading.


Looks like some sequence of events convinces Firefox that HSTS is set, and it always rewrites the request as https (which is not supported by the server) after that.


It's only available on http, not https.


I am receiving a DNS error when trying to reach the URL

C:\Windows\System32>nslookup http://techblog.netflix.com/ 8.8.8.8

Server: google-public-dns-a.google.com

Address: 8.8.8.8

* google-public-dns-a.google.com can't find http://techblog.netflix.com/: Non-existent domain


I'm not on Windows, but I'm pretty sure your problem here is the "http://" - try just:

nslookup techblog.netflix.com 8.8.8.8


remove the http:// part and try again?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: