Hacker News new | past | comments | ask | show | jobs | submit login
Netflix is Down (gigaom.com)
76 points by bavidar on Dec 25, 2012 | hide | past | favorite | 78 comments



You'd think Netflix would learn by now to move back to their own gear.

EDIT: Downvote away; its practically dogma on HN to use AWS. How much downtime are people willing to tolerate for a "superior" technology? Sure, Amazon AWS has some great ideas and tech, but you might as well give up if your business depends on EBS in us-east-1 at all


The gear isn't the problem: the dependency on a single data center is the problem. It requires a lot of software engineering effort to maintain a service that works when a data center suddenly goes away. To be fault-tolerant, Netflix has to do this engineering regardless of whether or not they own the servers. But if they use Amazon, they don't have to actually fix the servers when they break, freeing up engineering resources to fix the software.

(Why doesn't Amazon offer transparent replication? Because the price of replication is unbelievably high and most people can't believe how high it is. If you want to write to 12 replicas around the world, budget five seconds for your transaction to complete. Compare this to a MySQL database on an SSD that can do millions of writes on the same entity group in the same amount of time.

This doesn't even include the cost: 12x the storage cost, and 12x the bandwidth cost for data going from your frontend to all of the backends.)


You're blowing this a little out of proportion.

Replicating a few petabytes of static videos is not rocket science nor cost prohibitive for a company the size of Netflix. Nor is engineering a system that can withstand a datacenter outage. Especially one as trivial as Netflix which is largely read-only. Thousands of systems of higher complexity are engineered to that standard, many of them are much larger than netflix.

You rarely hear of Google outages, or iTunes, or Youtube, or [insert six dozen other popular brands], do you? Yes it can happen to the best of them, but the EC2 outages are really piling up lately.


You rarely hear of Google outages, or iTunes, or Youtube, or [insert six dozen other popular brands], do you? Yes it can happen to the best of them, but the EC2 outages are really piling up lately.

To me, this makes perfect sense. Google products are almost always designed to tolerate data center failures, and so you don't hear about data center failures affecting Google products. On the other hand, Joe's Random EC2 app is not designed for the same high availability, and so it's down whenever Amazon is. (This, incidentally, is a reasonable trade-off for most people. A few hours of downtime a year is often much cheaper than paying engineers to ensure that those few hours are five minutes or less.)

So the flaw here is not Amazon. The flaw is relying on one cluster of computers to handle all your computing needs. Amazon could make their product high-replication by default, but it would be so slow and expensive that nobody would use it. So the task is kicked to the application developers instead of the platform developers, and you should consider holding them responsible for the downtime.

(Incidentally, you can compare EC2 to AppEngine here. How often is your favorite AppEngine app down? Less than your favorite EC2 app, probably, because AppEngine pretty much forces the high-replication datastore, even though the semantics are strange for developers used to traditional single-homed or master/slave architectures. And if you read the AppEngine discussion groups, you'll see that users raise most of the concerns that I describe here: "Why is AppEngine more expensive than EC2?" "Why is AppEgnine slower than EC2?", and so on.)

Now, why does Netflix's video serving go down when an EC2 cell does? I have absolutely no idea. I do know that they do more than just stream bytes, though; they have to authenticate users, track what they're watching, and apply DRM to the streams. So there's that.

If you want fault-tolerant video distribution, look no further than the Pirate Bay. Although their servers are frequently seized, the video bits keep flowing. That's by design, not by accident.


You rarely hear of Google outages

http://techcrunch.com/2012/12/10/gmail-experiences-a-widespr...

,or iTunes

http://appleinsider.com/articles/12/11/19/itunes-match-down-...

,or Youtube

http://abcnews.go.com/blogs/technology/2012/10/youtube-goes-...

Everyone goes down, but EC2 has been buggy as hell from what I can tell.

And before everyone says Netflix should `just` move or add more bandwidth...

http://www.nbcnews.com/technology/technolog/netflix-uses-32-...

They represent a huuuge amount of bandwidth usage!


So work with eyeball networks and move your gear to the edge; dumping it all in AWS and hoping for the best is failing (at best). Do all the auth/billing/recommendations/transcoding in AWS, and push all the video content out to boxes at the ISPs.


That bandwidth for delivering video is being streamed from a CDN outside AWS.


>Especially one as trivial as Netflix which is largely read-only.

I stopped reading after this, since you obviously dont understand how complex their architecture really is.


I work for a media streaming site (rank <30k on alexa) and I assure you the architecture doesn't change much when you scale it up to rank 99 (netflix).

"Trivial" may sound harsh but in terms of large scale systems it really doesn't get much simpler than serving static files. Obviously their auxiliary services (billing, content ingestion, social etc.) are nowhere near trivial. But there's no reason for the core services that are required to get the catalog and video to most[1] devices to not be damn near 100% available.

I'm not talking down netflix for having an outage anyway. Shit happens and afaik their overall track record is not bad at all. I only replied to jrockway's claim that multi-datacenter redundancy would require inconceivable engineering effort or amounts of money - neither is true.

[1] Excluding those that need realtime transcoding, absurd DRM schemes or similar.


How can you assure us of what the #99 system is like when you admit to only working on the ~30k system? That's not what "assure" means.


We actually talk to other people in the industry sometimes, imagine that...


I'm never one to trivialize another company's problems so would hesitate to say Netflix's case is easy, but you cannot counter this statement by saying their architecture is complex. I can build a complex architecture for "Hello World" if I wanted to. Complex architecture != complex problem.


Netflix: Spitting out 1 of 120 different encoded version of a video to a client device after its been authenticated.

So, you've got an authentication layer, which can pass a token to the browser or physical hardware client access device. This should be very lightweight, no? Even with millions of users, this database shouldn't be enormous.

And then you've got your encoded libraries. This should be data being spit off platers on CDN networks. CDNs aren't hard anymore. Cloudfront is cheap, or you can go with Akamai, Limelight, Level3, etc. Properly built, you should never NOT be able to serve from somewhere in the CDN, even if the serving location isn't optimal.

Yes, this is HEAVILY simplified. I've left out recommendations, their encoding process/infrastructure, etc. This is not atomic data that is being replicated in realtime; these are videos that are lazily encoded, stored, and then served on demand.

It should not be this hard for this use case.


Just poking in on this, does anyone have a graph of outages over time? I'm wondering if it has been increasing or decreasing; and therefore if amazon is getting better at managing it's demand. Bonus points if it compares the size of the customer base over time in comparison or relation.


Because systems only crash when they're maintained by other people?


Assumption: Your business uses Amazon's EBS (because heh, you need to store your data somewhere between EC2 instance reboots/creation/destruction). EBS is down. What do you do? Hope AWS engineers get it back up fast? You can't do a damn thing about its reliability; you're stuck with whatever reliability Amazon has decided to deliver (which I think we can agree is much less than a company with revenue can depend on).

When you build your own system, you decide how reliable you want it. Otherwise, you can use Amazon and just live with the continual problems.


> When you build your own system, you decide how reliable you want it.

Ha! You mean that you discover how hard and expensive it is to build your own reliable infrastructure or how hard it is to hire people to do it for you.


No offense, but I used to work at a DoE lab on data-taking from the Large Hadron Collider CMS detector.

It's not as expensive as you'd think. We were dealing with hundreds of petabytes of data, and customers much more demanding than your consumer paying $16/month to stream movies.


How big was your team? How much were your IT costs?


10 people; 4 people on the actual sysadmin/operations side, 6 people who managed the higher application layer for the distributed filesystem, job management, etc.

If you're just talking the spinning disk and servers, probably ~$5-6MM/year. If you include the StorageTek tape archives, add in another $5MM-10MM.

Netflix operating income for FY2011 was $376 million.


It's worth asking "Compared to what?"

Unless my memories of their business model are incorrect, you pay a subscription fee for access to their stuff. Same as their competitors; in the case of the one I know best, the DirectTV my parents use, heavy rain knocks out their K band based transmission systems. Cable companies are also known for providing something less that 5 9s uptime.

Netfilix might be quite willing to accept these occasional outages for what else they perceive AWS proving them.


How many AWS outages has Netflix survived?


Soon it will be news when the N. Virginia services are actually up! I don't want to be too hard on Amazon because what they've built is pretty amazing, but I really have to wonder what's different about the N. Virginia site ... And why they're not having similar failures at the other sites.


Based on some rough second hand estimation [1] it is an order of magnitude larger then all other AZs and is about as big as them all combined.

[1]: https://huanliu.wordpress.com/2012/03/13/amazon-data-center-...


What is an "AZ"? Availability Zone? "AZ" is not a very Google-friendly acronym. <:)


You guessed correct!


If I remember correctly, the N. Virginia site is the default for EC2 instances, and is much much larger than the other availability zones. I've even heard that it's bigger than all the other zones combined.

However, I'd love for someone to corroborate my statement.


That's certainly true for the public IP spaces. It's possible that the other regions have a higher proportion of instances in VPC which don't have public IP addresses, or a higher proportion of larger instance types, though.


Not all instance types are available in all zones, and some of the smaller ones are only available in this one.


us-east-1 is much, much larger than the other regions. Many more services use it, which is why when it has issues you actually notice the effects.


Netflix is down because Amazon is down, so we had to watch shows on Amazon (prime) instead. Funny how that works!


I had the same experience. When watching movies with my SO tonight, I explained that Netflix uses AWS and that AWS was experiencing problems. We started watching a movie on Amazon prime and she asked "So I guess Amazon Video isn't using AWS?"


I tried to watch a movie rented from Amazon over Roku last night (not Prime) and it was unwatchable. It would get stuck on some scene for seconds to minutes, occasionally letting loose a bit of audio.

The movie was Inception. Maybe there's some irony in there.


Someone's Christmas Eve just went south. Godspeed.


Seriously. Instead of getting angry at Amazon I'm going to wish them a happy Christmas Eve and hope their collective pagers go quiet soon.


http://status.aws.amazon.com/

Amazon CloudSearch (N. Virginia) - CloudSearch issues

"6:00 PM PST We are continuing to investigate the issue. Domain creation and indexing operations continue to be unavailable. Changes to existing domains may have severe delays in being processed."

Amazon Elastic Compute Cloud (N. Virginia) - Elastic Load Balancer issues

"5:49 PM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. Traffic for some ELBs are currently experiencing significant levels of traffic loss."


This really sucks for the engineers. Not a fun way to spend xmas at all.


If you're netflix, I'm sure you have staff/engineers at the office 24/7 during this time of year specifically. At least, I would hope they did.


Having been in a similar position in the past, I actually feel bad for all the people at amazon and netflix that'll need to work late tonight...


They are getting OT/Holiday pay I'm sure. So they will be ok.


Compensation isn't actually the point here...


I hope y'all have Home Alone on DVD…


The net says it is mostly on the East coast but I am in California just a few miles from Netflix HQ and it is down for me on my iPad. I should try on the Apple TV since they mentioned that it is working on other devices. I will be curious to hear the post mortem of the outage from both AWS and Netflix. The timing I am sure doesn't help with Christmas Eve around the corner and probably many engineers not reachable quickly.


Interesting to see that the outage is in fact all over the place as the comments below are showing: http://www.isitdownrightnow.com/netflix.com.html


Apple TV in St. Paul, MN failed so I don't think the east coast thing is true. Parents are not happy campers. I get the feeling all iOS devices are include.


If Netflix can deactivate clients based on device size; that'd be pretty neat. People on tablets–Netflix on my Nexus 7 is out–would usually be watching stuff alone (kids may huddle around an iPad though) and for Christmas, a movie on a large screen is probably more important to keep going.

Makes me think that apps should consider using Push notifications to tell consumers about issues; e.g. instead of a cryptic error number; say something like "Netflix servers are overloaded at the moment, NetFlix may not work on your mobile device".

Not sure if that will piss off people more, but something like that instead of an error number would probably be better.


Interesting that Heroku rode this one out. Any Heroku engineers want to tell us your LB setup? Do you use ELB?


Heroku did not ride this one out. I have an app right now behind a failed ELB.

https://status.heroku.com/incidents/479

Edit: Make that two apps.


Oh right. I just checked Heroku.com.


heroku.com is up for me (near NYC) as are the Heroku-hosted sites we have at work.


It's not every app for me either. It appears to be a subset of apps that use the ssl:endpoint add-on to implement HTTPS.


And its not all sslenpoint sites either. We've been up through the whole thing. Which is critical seeing as we are a gifting company that provides an easy last minute gift!


Great timing!


Amazon movie rental over Roku is cutting in and out, and I can't watch Netflix (over Roku) at all. Colorado.


A pedantic note: In the title "EVE" should capitalized as "Eve"


Are they insured against a contingency of this calibre? Does Amazon provide/offer such an insurance?


Amazon EC2 has a Service Level Agreement (SLA)[1] that guarantees 99.95% uptime in a rolling 365 day period, and provides for service credits. Of course, a huge customer like Netflix can and probably did negotiate their own SLA terms.

[1]: http://aws.amazon.com/ec2-sla/


I was wondering if they'd pay damages and how would that be calculated.

I don't think free service for a while would cut it with a contingency of this sort. Maybe someone has first-hand information of their contract (or any other big player) and can answer to this publicly.


I don't know the Netflix contract but I've been involved with a few big (telco) SLAs. The penalties are usually calculated with a points-system and a multiplier that raises according to the duration/impact of an outage.

E.g. the first 15 minutes of an outage may cost 1 point per minute, 15-60 minutes 2 points, and so on. You also have multipliers for the severity (partial or full outage, customer impact, affected countries, etc.), time-of-day, and so on.

Collected points may then later be traded in for dollars or a nice lawsuit.

Corporate lawyers love to go nuts on these things, an enterprise SLA can easily span a hundred pages of legalese.


you would think they would have learned to build redundancy into these types of systems.


They have been working hard on this very issue..

http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

It'll be interesting to see what failed when they put out a statement.


Usually Netflix do well when AWS has an outage... At least in the past that was the case, so it really pick my curiosity on what is the difference with this outage vs the others. In addition it will be interesting to know if the timing of it -aka just before XMas- did affect the responsiveness of the engineers/team for both Amazon and Netflix.


The difference is that elastic load balancers are the things that actually implement the redundancy in AWS, and if one goes down and another can't be immediately started in it's place, no traffic will be getting to the instances that sit behind it.

I believe AWS has been working on a solution to this, but thus far haven't released anything.


Clearly the work of "The Grinch who Stole Christmas...."


I just got finished watching something on Netflix?


You don't seem to sure of yourself.


Cloud with a single point of failure. Merry XMas!


How is anyone supposed to enjoy the holidays with their families if they aren't consuming media?


1) some families like to watch stuff together

2) not all netflix subscribers have families


Fair enough. I for one enjoy expressing anti-television sentiments, with and without company.



Good holidays to you, then.


You too! Sorry if my above absurdism comes off as offensive.


We're fresh out of chestnuts here, but thankfully Halo 4 matchmaking is online.


What is the purpose of Xmas Eve if you spend it watching TV just like any other eve. Just saying.


For many of us, watching a movie on TV is anything but just another evening.

Everyone is different, and no one needs to justify their time to anyone.


A lot of people (including my family) watch christmas movies and gorge ourselves on food and drink.


Doing normal things, but with family.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: