Netflix: Lessons We’ve Learned Using AWS

c2 · on Dec 16, 2010

Sounds like the AWS architecture caused Netflix to write better code ( read: more durable, more fault tolerant ). Less assumptions baked in the code, and it will be easier to port it to a new data center/cloud architecture if AWS doesn't meet their needs.

As Netflix continues to scale, these changes will make managing that growth much easier.

A lot of you seem to take this post as being negative against AWS architecture. I take it more as a good collection of common things that you need to watch out for in distributed environments, specifically the dangers of assumptions within your current infrastructure which may change dramatically as you scale.

jemfinch · on Dec 16, 2010

Their "Chaos Monkey" approach reminds me of an excellent paper on "Crash Only Software": http://goo.gl/dqDII

The best way to test the uncommon case is to make it more common.

aw3c2 · on Dec 16, 2010

Please do not use URL shorteners when they are not needed. They are obfuscating.

Actual URL is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.... and that is a direct link to a PDF.

jemfinch · on Dec 16, 2010

When pg implements proper hyperlinking that doesn't clutter up my text with an ugly URL, I'll stop using a shortener. I understand the positives and negatives of using a shortener, but the positive impact on readability seems to outweigh the negatives when I'm confronted with an ugly link longer than the text I use to introduce it. Sorry if that offends you.

I should have noted that my link was a PDF, though. My apologies for that oversight.

dpritchett · on Dec 16, 2010

The HN community has long used footnotes to help with readability [1].

[1] http://www.google.com/search?q=%22[1]%20http%22+site:ycombin...

apu · on Dec 16, 2010

Interesting paper, but for future reference, please post non-shortened links here. Thanks!

squidsoup · on Dec 16, 2010

It would be fantastic to see Chaos Monkey open-sourced, but it's probably a very domain specific app that only works in the context of their infrastructure.

Some great ideas to be gleamed from the paper you've provided - thanks!

gfodor · on Dec 16, 2010

It'd be incredibly easy to implement a v1, a small shell script. AWS has "security groups", which generally are used to split up the roles of machines. (They're used for firewall rules, but in practice having these serve as role metadata is quite useful as well.)

Chaos Monkey is just a script that runs "ec2-terminate-instance" commands on node in certain security groups at a certain rate, and boots machines at the same rate.

Of course, the devil is in the details, but at a high level this wouldn't be too specific to the domain the infrastructure is built around.

somic · on Dec 16, 2010

The complexity of a system probably in use by Netflix comes from the number of instances and - most importantly - probably the number of accounts that they have. I strongly doubt all those instances they have are running on a single AWS account.

Agree that implementing similar system for a single AWS account is not going to be very difficult.

briandoll · on Dec 16, 2010

This reads like the 'fallacies of distributed computing' paper (http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...).

While the likelihood of failure (or added latency, impacting upstream changes, etc.) is greater in large-scale distributed environments for which you do not control vs. your home-grown datacenter, those scenarios are just facts of life in distributed environments.

An awesome side effect of hosting an app in a cloud environment is that you must face up to those fallacies immediately or they'll eat you alive.

wmf · on Dec 16, 2010

Paying large engineering costs upfront is "awesome"?

briandoll · on Dec 16, 2010

Ignoring the fallacies of distributed computing in web applications is akin to technical debt. You're going to have to pay it eventually, and as the metaphor suggests, it's cheaper to pay it off sooner rather than later.

wmf · on Dec 16, 2010

That's probably true for Netflix, but it may not be for people with a short runway who want to "fail fast".

briandoll · on Dec 17, 2010

There are lots of advantages and disadvantages to hosting in a "cloud" environment, so I'm not trying to paint it as rainbows and ponies.

If a company has a short runway (ie. not a lot of cash in the bank) however, hosting in the cloud means essentially renting capacity for which you do not need to maintain. Sounds good to me.

It's a fun pendulum to watch swing though, I'll admit. Some companies, with small runways, host in the cloud because it's rented capacity without too much administration. They may grow wildly, when they notice they are paying the "cloud tax" and could considerably save money by hosting themselves. They grow some more, see how much money and attention they need to put into infrastructure and how slow it is for their business to increase capacity. In order to keep up with growth and maintain focus on their core business, they move to the cloud.

easp · on Dec 18, 2010

Huh?

Netflix is a going concern, and has been for quite some time. Even the streaming business is a few years old.

It may be an up front investment since the full benefits may take some time to accrue, but it is an investment made based on a great deal more information than a typical startup has.

Further, it sounds like an investment that is providing them with an immediate and valuable benefit, they now have a foundation for ramping up their streaming business. This is going to be a big growth area, and it's is one that they are well positioned to succeed in. Or, you could flip it around, failure to secure a strong position in the streaming business will be the death of the company, and a squandering of the business equity they have been building, since their founding, to be well positioned for this transition.

And finally, if you have doubts about the value of making up-front engineering investment in order to reduce forecastable operating costs (and avoid what is probably an even larger up-front investment in datacenter buildout), well, what are you doing here?

aristus · on Dec 16, 2010

I'm pretty sure "session-based memory management" should be "memory-based session management", ie they kept user session state in memory.

joevandyk · on Dec 16, 2010

Maybe they rewrote malloc to make a http call to a session provider which keeps track of the allocated memory on the machine? :)

wccrawford · on Dec 16, 2010

I want a Chaos Monkey, too!

Actually, that was my first reaction, but after thinking for a moment, that isn't really a reliable way to test. If you make changes to something, you don't know for sure if the chaos monkey hit while you were testing a certain thing or not. Proper unit tests would seem to be a lot more useful.

joevandyk · on Dec 16, 2010

It's not meant to be a reliable way to test. Unit tests wouldn't help the problems that Chaos Monkey is meant to solve.

Chaos Monkey is meant to force apps to go into failure modes "in the real world". If they see anything unexpected happening after a service is killed or degraded, they can investigate.

It sure beats having a service go down unexpectedly for the first time six months ago and not have tested scenario in production.

SpikeGronim · on Dec 16, 2010

Unit tests and fault injection tests catch a different but overlapping set of errors. Chaos Monkey is a terrific idea and I commend Netflix for implementing it early on. If you're not willing to randomly kill hosts then you're not confident that your distributed system really works.

This excellent paper[1] by James Hamilton (then MS, now AWS) recommends never doing clean shutdowns on applications - just kill them. Unless you have a lot of persistent state to manage this is a great idea. If it takes hours to migrate multiple TB off your storage host, maybe not such a good idea.

1. http://www.usenix.org/event/lisa07/tech/full_papers/hamilton...

mickeyben · on Dec 16, 2010

How would you write unit tests to test against services not responding and routing issues between your different instances ?

jdludlow · on Dec 16, 2010

That's what mock objects are for. You mock the object that would normally call the remote system, have it artificially fail, and use that to test clients of that object.

samuel · on Dec 16, 2010

So you only test those cases you have thought can fail(which is compulsory to test anyway). The purpose of the Monkey Chaos is to trigger situations you haven't dreamed about.

OK, it ain't perfect. Things usually doesn't fail in a uniformaly distributed way and you can't be sure you'll see a different kind of fail on production, but it sounds useful(yeah, and pretty cool) nonetheless.

jdludlow · on Dec 16, 2010

Why are people replying to me as if I ever said that Chaos Monkey was a bad idea? I was responding to a question about how to unit test remote system failures. Obviously that's not the only testing you're going to do.

joevandyk · on Dec 16, 2010

The problem is "artificially fail". You need to ensure that the systems work in production even after real failures.

ghshephard · on Dec 17, 2010

SQLite [1] uses a test harness to do this:

"SQLite responds gracefully to memory allocation failures and disk I/O errors. Transactions are ACID even if interrupted by system crashes or power failures. All of this is verified by the automated tests using special test harnesses which simulate system failures. "

http://www.sqlite.org/about.html

qjz · on Dec 16, 2010

If Chaos Monkey hits during a test, you just run the test again. Multiple tests are a good thing, after all, and the failure might expose a weakness you wouldn't have seen otherwise.

Jd · on Dec 16, 2010

Basically the gist is: You need to be prepared for anything to stop working at any time.

The tone of this post indicates to me that the criticism and problems experienced by Netflix with AWS are understated, which I can understand given their position as a flagship AWS customer, etc.

SpikeGronim · on Dec 16, 2010

You need to be prepared for halting failures and variable performance. Moving from higher end, dedicated hardware to lower end, shared hardware means an increase the failure rate and in the variance of your latencies. In exchange you get elastic scaling, you convert capital costs to operational costs (no more building datacenters), and you hand the "undifferentiated heavy lifting" to AWS. The posts aren't, IMO, understating problems but demonstrating the tradeoffs that come with AWS' core value proposition.

petercooper · on Dec 17, 2010

Including your whole AWS account, if Amazon doesn't like you for some reason, or if you attract too many attacks and disrupt their network.

wglb · on Dec 17, 2010

Interesting: The Chaos Monkey’s job is to randomly kill instances

Another way to say "If it ain't tested, it's broken".

kondro · on Dec 16, 2010

Hardware is always going to fail eventually. Moving to AWS caused NetFlix to write better code to deal with these failures.

Failures were always going to happen, even in their own datacentre. What they have now is a more fault-tolerant system which should have less downtime overall.

byteclub · on Dec 16, 2010

If you do decide to adopt your very own pet Chaos Monkey in your next project, make sure you ARE able to gracefully degrade your service in case of failures. Otherwise your customers will see the monkey in action, manifested by "we'll be back shortly" messages. It's easier said than done, since a lot of the time all of us forget to write (or feel lazy, or have no idea how to properly handle) the "else" statements in case of errors/unavailable services/unreachable databases.

Otherwise, good idea. It forces you to think about the perils of distributed environment from the very beginning, as opposed to leaving it to be an afterthought.

dochtman · on Dec 16, 2010

Lesson they have not yet learned: including a HTML title tag in their Blogger templates.

jasonkester · on Dec 17, 2010

I love the idea of setting up a fully working system on AWS, then repeating all traffic from your live site over to it to see how it stands up under load.

No need to simulate traffic for testing purposes. Here's our actual traffic. All of it.

Nice.

ergo98 · on Dec 16, 2010

Reading both this entry and the one that explained why they went with AWS, I'm left confused about why they ever went to AWS in the first place.

recampbell · on Dec 16, 2010

Having built a non-trivial production system, I have a love-hate relationship with AWS. I love the possibilities and potential, but I'm deeply disappointed when it fails to live up to the hype.

It's just a matter of alternatives. There's frankly no other IaaS out there that can match the features of EC2 (eg, EBS, elastic IP). These guys are unstoppable, they come out with more features every week.

wmf · on Dec 16, 2010

Yeah, EC2 is the best cloud but it's a shame that it's still missing features that ESX 2.0 had (and Amazon has shown no interest in fixing it).

jeffbarr · on Dec 16, 2010

Wes, what features are you looking for?

agmiklas · on Dec 17, 2010

Not the original poster, but here's some stuff I really wish AWS had:

a) Ability to map an elastic IP to an ELB. The CNAME thing it uses now has way too many drawbacks [1].

b) Ability to make an RDS instance a slave to or a master for a normal MySQL instance. This would make it possible to use RDS as a backup to our ordinary DB infrastructure using normal MySQL replication (and eventually, vice versa)

c) Retention periods for EBS snapshots. You can get this yourself by writing a few simple scripts, but it would be really nice if snapshots could simply be labelled "delete after 90 days".

d) A cross-availability zone, synchronously replicated EBS volume. This is probably fairly specific to our use case, but it would be neat if AWS natively provided something like DRBD.

[1] http://blog.pagerduty.com/2010/08/31/load-balancers-need-sta...

wmf · on Dec 16, 2010

It's not me in particular, but if you read the EC2 forum going back to 2006 people have been asking for full virtualization, multiple IPs, multicast, non-NAT, reverse ARP, shared EBS, etc.

bmurphy · on Dec 17, 2010

Disk I/O performance that doesn't suck.

Let us add dev pay instances to an ELB.

More ram.

Elastic private ip addresses.

Change security groups of running instances.

I have a lot more, but those are my big ones.

jollojou · on Dec 17, 2010

While it is true that you cannot add new ones or remove existing ones, you can modify an existing security group of a running instance: http://aws.amazon.com/articles/1145?_encoding=UTF8&jiveR....

gcr · on Dec 16, 2010

Do you work with AWS?

jerf · on Dec 16, 2010

Me too. This feels like "oooh, shiny!" from high levels, not engineering needs. You shouldn't outsource your core competency, and this feels like what they are doing.

I say "feel" advisedly. I don't have inner knowledge. It's just that the explanation up to this point doesn't seem all that compelling; for what seems like rather dubious benefits it seems like they've taken on an awful lot of risks they have little ability to manage themselves.

irons · on Dec 16, 2010

This came up in the first entry. Netflix would apparently dispute your take on their core competency.

http://techblog.netflix.com/2010/12/four-reasons-we-choose-a...

The problems [Amazon] are trying to solve are incredibly difficult ones, but they aren’t specific to our business. Every successful internet company has to figure out great storage solutions, hardware failover, networking infrastructure, etc.

Their third argument for virtualized infrastructure was "We're not very good at predicting customer growth or device engagement", which strikes me as a compelling problem to solve. Having to change their approach to managing service dependencies and failures doesn't seem like too high a price to pay.

jerf · on Dec 16, 2010

Yes, I read the first entry. The use case that particularly came to mind when I said it seemed like they were outsourcing their core competency was actually the location of data centers and other such physical constraints, like peering arrangements. They're stuck with Amazon's decisions on that front. Granted they're probably large enough to get Amazon to do whatever it is they need to do, heck they can probably all but simply purchase a new data center and announce to Amazon that Amazon now has a new data center to manage :), but it still seems like an odd choice of control to release.

mechanical_fish · on Dec 16, 2010

the location of data centers and other such physical constraints, like peering arrangements

This is not Netflix's core competency either. This is the core competency of a CDN. If you have problems like this, you should almost certainly hire one, and architect accordingly.

Netflix has long since done so:

http://blog.streamingmedia.com/the_business_of_online_vi/201...

patrickgzill · on Dec 16, 2010

Actually Netflix now uses Level3's (level3.com) CDN: http://www.level3.com/index.cfm?pageID=491&PR=958

elq · on Dec 17, 2010

Notice how they say - "primary content delivery network", primary not sole.

netflix actually uses many cdns, including akamai and l3.

lambda · on Dec 17, 2010

No, they still use Akamai (dig cdn-0.nflximg.com to verify), but they are planning on moving to Level3 (assuming the Comcast/Level3 fight doesn't sour that deal too badly).

byteclub · on Dec 16, 2010

Sounds like the initial investment into reengineering their software to handle distributed environments well turned out to be quite large, which is why it does sound a bit like they brought the troubles upon themselves. But presumably they now have 2 advantages: ability to scale horizontally, and having their software adapted to distributed environments (which means that technically they are not limited to AWS, in case problems arise with Amazon's service). Sounds like the foundation of their house is stronger now, which was probably the whole point of the exercise.

jolan · on Dec 16, 2010

I think they alluded to it in the first post. When they rolled out Xbox 360 streaming, they severely underestimated demand. It was nearly impossible to get a full quality stream for 2-3 months after the roll out.

Now with AWS, they can spin up additional instances at will.

wmf · on Dec 16, 2010

Note that Netflix does not use AWS for streaming.

jollojou · on Dec 17, 2010

Can you please provide a reference?

tybris · on Dec 16, 2010

In short: Because restaurants don't farm their own crops or raise their own livestock. They cook.

ergo98 · on Dec 16, 2010

Analogies merely confuse (they always have that superficial "aha!" moment, yet never bear scrutiny). If it must be, however, in this case it's more synonymous with a restaurant that farmed out their kitchen and their wait staff, but they carefully crafted the menu, chose the silverware, and did all of the marketing.

RobertKohr · on Dec 16, 2010

That is an interesting business proposal.

patio11 · on Dec 16, 2010

It seems to be the franchise model: you buy the land and hire the grunts, we'll give you the Big Book O' Brand and handle the marketing.

sabat · on Dec 16, 2010

I'll bet other companies (e.g. Heroku, Dropbox) that use AWS/EC2 would have similar things to say.

I did have this one question, being a guy with an IT background: they expected stability? Really? I always expect host/app/system failure, and am pleasantly surprised when it doesn't happen.

Travis · on Dec 16, 2010

My take on their experience was that they had only sort of abstractly thought about stability. Then they severely underestimated the variance in performance. Before moving to Amazon, stability was something you worked at in the lower levels, e.g., hardware, power, etc. Once you move to amazon, stability cannot be controlled at a lower level, so you have to make your application code more robust.

With your own IT infrastructure, you'd logically put the focus into improving the reliability on your hardware stack, rather than just focusing on handling those issues in software.

easp · on Dec 18, 2010

One difference is that Heroku and similar are fairly young companies, whose infrastructure was largely built in "the cloud" and for "the cloud." Whatever preconceptions the builders had about the charactaristics of the underlying environment, they started learning the reality and adapting to it relatively early in the project.

Netflix is a sizable and successful business. It's more than a decade old. Their streaming business launched only 4 years ago, more than 6 months before EC2's general availability, and almost 2 years before EBS, and it would be even longer before any of it was a reasonable choice for a company of their size, launching a new service building off their existing customer base, that had to succeed to allow the company to navigate a major transition that the company had been planning for for years.

The similarities and differences in their experiences would be interesting and probably informative, particularly if considered against the vastly different starting points.