Amazon EC2 Outage Takes Down Foursquare, Instagram, Quora, Reddit, Etc

fourspace · on Aug 9, 2011

Maybe I'm oversimplifying things, but why haven't these companies distributed their compute resources across various facilities and cloud providers, enabled instant failover, and tested this before outages like these?

smanek · on Aug 9, 2011

It costs engineering time to do so. Time that could otherwise be used to build features, better protect against more common failures, attract users, etc.

Amazon probably has ~5hrs/year of complete failure of a region. Figure, conservatively, it would take 3 months of engineering time to protect against that, plus a 'continuing' cost of 1/2 a week per month to maintain that protection. You'd also have to (at least) double your provisioned capacity (which may include a larger ops team, etc). Assuming your servers cost $20k/month and devs cost $100/hr (both fully loaded), we're talking about ~$340,000 to prevent 5 hours of downtime (just for the first year).

If downtime costs you more than $50K/hr, then it might make sense to be that fault tolerant. Otherwise, there might be better places for a startup to spend its (limited) resources.

cageface · on Aug 9, 2011

Not to mention that it's very easy to increase overall downtime by introducing all the extra complexity this kind of redundancy can bring.

nirvana · on Aug 9, 2011

It costs engineering time to simply choose Amazon in the first place. You could spin up VPSes at backspace or dedicated machines elsewhere for less money, and have a local, reliable, fast hard drive.

Instead, to go with amazon you have to architect for Amazon, not counting on your ECC instances to be up all the time, accounting for their local fast storage going away, or accounting for how EBS, which is persistent, is slow.

The alternative is, get the enterprise version of Riak, purchase dedicated nodes in two data centers, tell them about each other. (no engineering required.)

If engineering resources are the most precious commodity, it seems AWS is the more expensive option.

protagonist_h · on Aug 9, 2011

This is not an easy thing to do, especially if you use EC2 in conjunction with EBS volumes, which is typical setup. EBS volumes are created in a particular availability zone ("facility") and can only be accessed from the same zone. Therefore you not only need to distribute computing resources but also data, which is significantly harder. So even distributing across multiple Amazon data centers is not that simple. To distribute across different cloud providers you would have to rewrite large chucks of code for each provider or come up with some way to "abstract away" cloud providers. Either way it would be nightmare to manage and is likely not worth it for a typical startup. But yes, this CAN be done.

DrStalker · on Aug 9, 2011

To do that you basically you need to forgo all the benefits of cloud computing and think of it like hosting at two traditional datacentres.

It's not unsolvable, but as others have mentioned it's probably not worth the time or money needed to have a live standby. There will be a level of failover speed that is worth having, but that might be "if EC2 is gone we can recover in 48 hours with no more than 6 hours of data missing" so the contingency will not kick in for a short EC2 outage.

dialtone · on Aug 9, 2011

I fail to see how this is related to "forego the benefits of cloud computing"... Does your startup have the resources to manage multiple datacenters by itself (such as one in US and one in EU)? Isn't this a clear benefit of cloud computing (and a pretty big one too)?

DrStalker · on Aug 9, 2011

I should have been clearer - you forgo benefits of cloud computing for the functionality related to having a second cloud installation to failover. Each cloud will have all the benefits, but between them you're needing to come up with a way to synchronize applications and data that isn't just "create a cloud backed relational DB and point all the app servers to it"

talonx · on Aug 9, 2011

They will, now :)

lars512 · on Aug 9, 2011

It's a testament to how successful Amazon has been in its cloud offering. We're used to sites going down for one reason or another. What's weird is that Amazon's success has made all these failures so correlated. It's a strange feeling when many sites you like all fail at once.

stevenp · on Aug 9, 2011

My t1.micro instance in us-east-1b seems to be up and running just fine as far as I can tell.

bane · on Aug 9, 2011

Well there goes all the parts of the Internet I'm interested in. Time to go read a book.

dbuizert · on Aug 9, 2011

Someone said business continuity? It can be costly, but could save your business. Stop saving that VC money and start saving your business.

i386 · on Aug 9, 2011

and here I was thinking the change I just rolled out to our EC2 instances had boned our test environment. Two failures in a week? Does not really inspire confidence right now :(

fizx · on Aug 9, 2011

It's back now.

sibsibsib · on Aug 9, 2011

reddit is currently working for me.

scrod · on Aug 9, 2011

LOL, the cloud.

thechut · on Aug 9, 2011

This may be a stretch...but anything to do with the Verizon line workers strike?

protagonist_h · on Aug 9, 2011

we were thinking to migrate our service to EC2 from our dedicated softlayer server. now we will probably stick with our current setup.

Vexenon · on Aug 9, 2011

My reaction: shocking (but not really).