Netflix is Down

toomuchtodo · on Dec 25, 2012

You'd think Netflix would learn by now to move back to their own gear.

EDIT: Downvote away; its practically dogma on HN to use AWS. How much downtime are people willing to tolerate for a "superior" technology? Sure, Amazon AWS has some great ideas and tech, but you might as well give up if your business depends on EBS in us-east-1 at all

jrockway · on Dec 25, 2012

The gear isn't the problem: the dependency on a single data center is the problem. It requires a lot of software engineering effort to maintain a service that works when a data center suddenly goes away. To be fault-tolerant, Netflix has to do this engineering regardless of whether or not they own the servers. But if they use Amazon, they don't have to actually fix the servers when they break, freeing up engineering resources to fix the software.

(Why doesn't Amazon offer transparent replication? Because the price of replication is unbelievably high and most people can't believe how high it is. If you want to write to 12 replicas around the world, budget five seconds for your transaction to complete. Compare this to a MySQL database on an SSD that can do millions of writes on the same entity group in the same amount of time.

This doesn't even include the cost: 12x the storage cost, and 12x the bandwidth cost for data going from your frontend to all of the backends.)

moe · on Dec 25, 2012

You're blowing this a little out of proportion.

Replicating a few petabytes of static videos is not rocket science nor cost prohibitive for a company the size of Netflix. Nor is engineering a system that can withstand a datacenter outage. Especially one as trivial as Netflix which is largely read-only. Thousands of systems of higher complexity are engineered to that standard, many of them are much larger than netflix.

You rarely hear of Google outages, or iTunes, or Youtube, or [insert six dozen other popular brands], do you? Yes it can happen to the best of them, but the EC2 outages are really piling up lately.

jrockway · on Dec 25, 2012

You rarely hear of Google outages, or iTunes, or Youtube, or [insert six dozen other popular brands], do you? Yes it can happen to the best of them, but the EC2 outages are really piling up lately.

To me, this makes perfect sense. Google products are almost always designed to tolerate data center failures, and so you don't hear about data center failures affecting Google products. On the other hand, Joe's Random EC2 app is not designed for the same high availability, and so it's down whenever Amazon is. (This, incidentally, is a reasonable trade-off for most people. A few hours of downtime a year is often much cheaper than paying engineers to ensure that those few hours are five minutes or less.)

So the flaw here is not Amazon. The flaw is relying on one cluster of computers to handle all your computing needs. Amazon could make their product high-replication by default, but it would be so slow and expensive that nobody would use it. So the task is kicked to the application developers instead of the platform developers, and you should consider holding them responsible for the downtime.

(Incidentally, you can compare EC2 to AppEngine here. How often is your favorite AppEngine app down? Less than your favorite EC2 app, probably, because AppEngine pretty much forces the high-replication datastore, even though the semantics are strange for developers used to traditional single-homed or master/slave architectures. And if you read the AppEngine discussion groups, you'll see that users raise most of the concerns that I describe here: "Why is AppEngine more expensive than EC2?" "Why is AppEgnine slower than EC2?", and so on.)

Now, why does Netflix's video serving go down when an EC2 cell does? I have absolutely no idea. I do know that they do more than just stream bytes, though; they have to authenticate users, track what they're watching, and apply DRM to the streams. So there's that.

If you want fault-tolerant video distribution, look no further than the Pirate Bay. Although their servers are frequently seized, the video bits keep flowing. That's by design, not by accident.

pixl97 · on Dec 25, 2012

You rarely hear of Google outages

http://techcrunch.com/2012/12/10/gmail-experiences-a-widespr...

,or iTunes

http://appleinsider.com/articles/12/11/19/itunes-match-down-...

,or Youtube

http://abcnews.go.com/blogs/technology/2012/10/youtube-goes-...

Everyone goes down, but EC2 has been buggy as hell from what I can tell.

And before everyone says Netflix should `just` move or add more bandwidth...

http://www.nbcnews.com/technology/technolog/netflix-uses-32-...

They represent a huuuge amount of bandwidth usage!

toomuchtodo · on Dec 25, 2012

So work with eyeball networks and move your gear to the edge; dumping it all in AWS and hoping for the best is failing (at best). Do all the auth/billing/recommendations/transcoding in AWS, and push all the video content out to boxes at the ISPs.

mh- · on Dec 26, 2012

That bandwidth for delivering video is being streamed from a CDN outside AWS.

res0nat0r · on Dec 25, 2012

>Especially one as trivial as Netflix which is largely read-only.

I stopped reading after this, since you obviously dont understand how complex their architecture really is.

moe · on Dec 25, 2012

I work for a media streaming site (rank <30k on alexa) and I assure you the architecture doesn't change much when you scale it up to rank 99 (netflix).

"Trivial" may sound harsh but in terms of large scale systems it really doesn't get much simpler than serving static files. Obviously their auxiliary services (billing, content ingestion, social etc.) are nowhere near trivial. But there's no reason for the core services that are required to get the catalog and video to most[1] devices to not be damn near 100% available.

I'm not talking down netflix for having an outage anyway. Shit happens and afaik their overall track record is not bad at all. I only replied to jrockway's claim that multi-datacenter redundancy would require inconceivable engineering effort or amounts of money - neither is true.

[1] Excluding those that need realtime transcoding, absurd DRM schemes or similar.

brown9-2 · on Dec 25, 2012

How can you assure us of what the #99 system is like when you admit to only working on the ~30k system? That's not what "assure" means.

moe · on Dec 25, 2012

We actually talk to other people in the industry sometimes, imagine that...

sbov · on Dec 25, 2012

I'm never one to trivialize another company's problems so would hesitate to say Netflix's case is easy, but you cannot counter this statement by saying their architecture is complex. I can build a complex architecture for "Hello World" if I wanted to. Complex architecture != complex problem.

toomuchtodo · on Dec 25, 2012

Netflix: Spitting out 1 of 120 different encoded version of a video to a client device after its been authenticated.

So, you've got an authentication layer, which can pass a token to the browser or physical hardware client access device. This should be very lightweight, no? Even with millions of users, this database shouldn't be enormous.

And then you've got your encoded libraries. This should be data being spit off platers on CDN networks. CDNs aren't hard anymore. Cloudfront is cheap, or you can go with Akamai, Limelight, Level3, etc. Properly built, you should never NOT be able to serve from somewhere in the CDN, even if the serving location isn't optimal.

Yes, this is HEAVILY simplified. I've left out recommendations, their encoding process/infrastructure, etc. This is not atomic data that is being replicated in realtime; these are videos that are lazily encoded, stored, and then served on demand.

It should not be this hard for this use case.

1qaz2wsx3edc · on Dec 25, 2012

Just poking in on this, does anyone have a graph of outages over time? I'm wondering if it has been increasing or decreasing; and therefore if amazon is getting better at managing it's demand. Bonus points if it compares the size of the customer base over time in comparison or relation.

superuser2 · on Dec 25, 2012

Because systems only crash when they're maintained by other people?

toomuchtodo · on Dec 25, 2012

Assumption: Your business uses Amazon's EBS (because heh, you need to store your data somewhere between EC2 instance reboots/creation/destruction). EBS is down. What do you do? Hope AWS engineers get it back up fast? You can't do a damn thing about its reliability; you're stuck with whatever reliability Amazon has decided to deliver (which I think we can agree is much less than a company with revenue can depend on).

When you build your own system, you decide how reliable you want it. Otherwise, you can use Amazon and just live with the continual problems.

herge · on Dec 25, 2012

> When you build your own system, you decide how reliable you want it.

Ha! You mean that you discover how hard and expensive it is to build your own reliable infrastructure or how hard it is to hire people to do it for you.

toomuchtodo · on Dec 25, 2012

No offense, but I used to work at a DoE lab on data-taking from the Large Hadron Collider CMS detector.

It's not as expensive as you'd think. We were dealing with hundreds of petabytes of data, and customers much more demanding than your consumer paying $16/month to stream movies.

herge · on Dec 25, 2012

How big was your team? How much were your IT costs?

toomuchtodo · on Dec 26, 2012

10 people; 4 people on the actual sysadmin/operations side, 6 people who managed the higher application layer for the distributed filesystem, job management, etc.

If you're just talking the spinning disk and servers, probably ~$5-6MM/year. If you include the StorageTek tape archives, add in another $5MM-10MM.

Netflix operating income for FY2011 was $376 million.

hga · on Dec 26, 2012

It's worth asking "Compared to what?"

Unless my memories of their business model are incorrect, you pay a subscription fee for access to their stuff. Same as their competitors; in the case of the one I know best, the DirectTV my parents use, heavy rain knocks out their K band based transmission systems. Cable companies are also known for providing something less that 5 9s uptime.

Netfilix might be quite willing to accept these occasional outages for what else they perceive AWS proving them.

philwelch · on Dec 25, 2012

How many AWS outages has Netflix survived?

smoyer · on Dec 25, 2012

Soon it will be news when the N. Virginia services are actually up! I don't want to be too hard on Amazon because what they've built is pretty amazing, but I really have to wonder what's different about the N. Virginia site ... And why they're not having similar failures at the other sites.

campnic · on Dec 25, 2012

Based on some rough second hand estimation [1] it is an order of magnitude larger then all other AZs and is about as big as them all combined.

[1]: https://huanliu.wordpress.com/2012/03/13/amazon-data-center-...

cpeterso · on Dec 25, 2012

What is an "AZ"? Availability Zone? "AZ" is not a very Google-friendly acronym. <:)

TkTech · on Dec 25, 2012

You guessed correct!

lelandbatey · on Dec 25, 2012

If I remember correctly, the N. Virginia site is the default for EC2 instances, and is much much larger than the other availability zones. I've even heard that it's bigger than all the other zones combined.

However, I'd love for someone to corroborate my statement.

cperciva · on Dec 25, 2012

That's certainly true for the public IP spaces. It's possible that the other regions have a higher proportion of instances in VPC which don't have public IP addresses, or a higher proportion of larger instance types, though.

rhizome · on Dec 25, 2012

Not all instance types are available in all zones, and some of the smaller ones are only available in this one.

bgentry · on Dec 25, 2012

us-east-1 is much, much larger than the other regions. Many more services use it, which is why when it has issues you actually notice the effects.

cowkingdeluxe · on Dec 25, 2012

Netflix is down because Amazon is down, so we had to watch shows on Amazon (prime) instead. Funny how that works!

sjs382 · on Dec 25, 2012

I had the same experience. When watching movies with my SO tonight, I explained that Netflix uses AWS and that AWS was experiencing problems. We started watching a movie on Amazon prime and she asked "So I guess Amazon Video isn't using AWS?"

antidoh · on Dec 25, 2012

I tried to watch a movie rented from Amazon over Roku last night (not Prime) and it was unwatchable. It would get stuck on some scene for seconds to minutes, occasionally letting loose a bit of audio.

The movie was Inception. Maybe there's some irony in there.

criswell · on Dec 25, 2012

Someone's Christmas Eve just went south. Godspeed.

zrail · on Dec 25, 2012

Seriously. Instead of getting angry at Amazon I'm going to wish them a happy Christmas Eve and hope their collective pagers go quiet soon.

chuhnk · on Dec 25, 2012

http://status.aws.amazon.com/

Amazon CloudSearch (N. Virginia) - CloudSearch issues

"6:00 PM PST We are continuing to investigate the issue. Domain creation and indexing operations continue to be unavailable. Changes to existing domains may have severe delays in being processed."

Amazon Elastic Compute Cloud (N. Virginia) - Elastic Load Balancer issues

"5:49 PM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. Traffic for some ELBs are currently experiencing significant levels of traffic loss."

toddh · on Dec 25, 2012

This really sucks for the engineers. Not a fun way to spend xmas at all.

arkonaut · on Dec 25, 2012

If you're netflix, I'm sure you have staff/engineers at the office 24/7 during this time of year specifically. At least, I would hope they did.

gurch101 · on Dec 25, 2012

Having been in a similar position in the past, I actually feel bad for all the people at amazon and netflix that'll need to work late tonight...

jimwalsh · on Dec 25, 2012

They are getting OT/Holiday pay I'm sure. So they will be ok.

saraid216 · on Dec 25, 2012

Compensation isn't actually the point here...

jongold · on Dec 25, 2012

I hope y'all have Home Alone on DVD…

yoda_sl · on Dec 25, 2012

The net says it is mostly on the East coast but I am in California just a few miles from Netflix HQ and it is down for me on my iPad. I should try on the Apple TV since they mentioned that it is working on other devices. I will be curious to hear the post mortem of the outage from both AWS and Netflix. The timing I am sure doesn't help with Christmas Eve around the corner and probably many engineers not reachable quickly.

yoda_sl · on Dec 25, 2012

Interesting to see that the outage is in fact all over the place as the comments below are showing: http://www.isitdownrightnow.com/netflix.com.html

protomyth · on Dec 25, 2012

Apple TV in St. Paul, MN failed so I don't think the east coast thing is true. Parents are not happy campers. I get the feeling all iOS devices are include.

amartya916 · on Dec 25, 2012

If Netflix can deactivate clients based on device size; that'd be pretty neat. People on tablets–Netflix on my Nexus 7 is out–would usually be watching stuff alone (kids may huddle around an iPad though) and for Christmas, a movie on a large screen is probably more important to keep going.

Makes me think that apps should consider using Push notifications to tell consumers about issues; e.g. instead of a cryptic error number; say something like "Netflix servers are overloaded at the moment, NetFlix may not work on your mobile device".

Not sure if that will piss off people more, but something like that instead of an error number would probably be better.

pbiggar · on Dec 25, 2012

Interesting that Heroku rode this one out. Any Heroku engineers want to tell us your LB setup? Do you use ELB?

zrail · on Dec 25, 2012

Heroku did not ride this one out. I have an app right now behind a failed ELB.

https://status.heroku.com/incidents/479

Edit: Make that two apps.

pbiggar · on Dec 25, 2012

Oh right. I just checked Heroku.com.

alxndr · on Dec 25, 2012

heroku.com is up for me (near NYC) as are the Heroku-hosted sites we have at work.

zrail · on Dec 25, 2012

It's not every app for me either. It appears to be a subset of apps that use the ssl:endpoint add-on to implement HTTPS.

malyk · on Dec 25, 2012

And its not all sslenpoint sites either. We've been up through the whole thing. Which is critical seeing as we are a gifting company that provides an easy last minute gift!

alxndr · on Dec 25, 2012

Great timing!

antidoh · on Dec 25, 2012

Amazon movie rental over Roku is cutting in and out, and I can't watch Netflix (over Roku) at all. Colorado.

arkem · on Dec 25, 2012

A pedantic note: In the title "EVE" should capitalized as "Eve"

muyuu · on Dec 25, 2012

Are they insured against a contingency of this calibre? Does Amazon provide/offer such an insurance?

zrail · on Dec 25, 2012

Amazon EC2 has a Service Level Agreement (SLA)[1] that guarantees 99.95% uptime in a rolling 365 day period, and provides for service credits. Of course, a huge customer like Netflix can and probably did negotiate their own SLA terms.

[1]: http://aws.amazon.com/ec2-sla/

muyuu · on Dec 25, 2012

I was wondering if they'd pay damages and how would that be calculated.

I don't think free service for a while would cut it with a contingency of this sort. Maybe someone has first-hand information of their contract (or any other big player) and can answer to this publicly.

moe · on Dec 25, 2012

I don't know the Netflix contract but I've been involved with a few big (telco) SLAs. The penalties are usually calculated with a points-system and a multiplier that raises according to the duration/impact of an outage.

E.g. the first 15 minutes of an outage may cost 1 point per minute, 15-60 minutes 2 points, and so on. You also have multipliers for the severity (partial or full outage, customer impact, affected countries, etc.), time-of-day, and so on.

Collected points may then later be traded in for dollars or a nice lawsuit.

Corporate lawyers love to go nuts on these things, an enterprise SLA can easily span a hundred pages of legalese.

bavidar · on Dec 25, 2012

you would think they would have learned to build redundancy into these types of systems.

pixie_ · on Dec 25, 2012

They have been working hard on this very issue..

http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

It'll be interesting to see what failed when they put out a statement.

yoda_sl · on Dec 25, 2012

Usually Netflix do well when AWS has an outage... At least in the past that was the case, so it really pick my curiosity on what is the difference with this outage vs the others. In addition it will be interesting to know if the timing of it -aka just before XMas- did affect the responsiveness of the engineers/team for both Amazon and Netflix.

zrail · on Dec 25, 2012

The difference is that elastic load balancers are the things that actually implement the redundancy in AWS, and if one goes down and another can't be immediately started in it's place, no traffic will be getting to the instances that sit behind it.

I believe AWS has been working on a solution to this, but thus far haven't released anything.

davidf18 · on Dec 25, 2012

Clearly the work of "The Grinch who Stole Christmas...."

sigzero · on Dec 25, 2012

I just got finished watching something on Netflix?

Raphael · on Dec 25, 2012

You don't seem to sure of yourself.

lucian303 · on Dec 25, 2012

Cloud with a single point of failure. Merry XMas!

jpdevereaux · on Dec 25, 2012

How is anyone supposed to enjoy the holidays with their families if they aren't consuming media?

rhizome · on Dec 25, 2012

1) some families like to watch stuff together

2) not all netflix subscribers have families

jpdevereaux · on Dec 25, 2012

Fair enough. I for one enjoy expressing anti-television sentiments, with and without company.

staunch · on Dec 25, 2012

http://www.theonion.com/articles/area-man-constantly-mention...

rhizome · on Dec 25, 2012

Good holidays to you, then.

jpdevereaux · on Dec 25, 2012

You too! Sorry if my above absurdism comes off as offensive.

marshray · on Dec 25, 2012

We're fresh out of chestnuts here, but thankfully Halo 4 matchmaking is online.

31reasons · on Dec 25, 2012

What is the purpose of Xmas Eve if you spend it watching TV just like any other eve. Just saying.

antidoh · on Dec 25, 2012

For many of us, watching a movie on TV is anything but just another evening.

Everyone is different, and no one needs to justify their time to anyone.

dusing · on Dec 25, 2012

A lot of people (including my family) watch christmas movies and gorge ourselves on food and drink.

buzzkillr2 · on Dec 25, 2012

Doing normal things, but with family.