Hacker News new | past | comments | ask | show | jobs | submit login
Large-scale Amazon EC2 Outage (amazon.com)
260 points by bradly on Aug 9, 2011 | hide | past | favorite | 141 comments



Two large, lengthy outages within a few months of each other? I hate to piggy-back on Amazon's follies, but it's times like these that make me love my Linode boxes. Total downtime over the last 5 years? 5 hours. I really couldn't ask for a better hosting company.

Edit: The 5 hours are what I've noticed from either personal servers or those of friends/acquaintances. YMMV.


It's worth pointing out that I am in the US-East with 8 beefy instances and my site stayed up during the entirety of both outages, so don't assume you're getting better uptime than all AWS customers.


I'm guessing you haven't been affected by any of the Fremont(Hurricane Electric DC) outages? I have a lot of Linode boxes and love their service but I'm still splitting load with different providers as well.


Yep, the Fremont DC has been down at least 3 times in the past year. I still love Linode, but i'll probably move my (personal) VPS to another of their facilities.


I am in fact planning to move to Dallas later this week. 60ms of latency just isn't worth it.


Do you know how long the total downtime was?


Linode is pretty good about accurate updates for Requests For Outages(RFO): http://status.linode.com/

I don't keep track of the totals but I think it you search their forums, I seem to recall someone who had monitored them with Pingdom or Wormly(or equivalent) for a fair period of time.


Thanks. I went through and counted up the downtime and came up with 16.5 hours down for Fremont in the past year (from today).


I remember the outages at the Freemont location. Every outage this year has been related to power issues. The company running the Fremont datacenter is Hurricane Electric. I hope Hurricane Electric gets things sorted soon at the Fremont datacenter. Every time I think about the company name "Hurricane Electric", I think "now there's a company name that inspires confidence." The name "Disaster Service" must have already been taken.


Yep, I'm on Linode, too. Love it. Linode guarantees three nines but in practice it gets close to four nines. EC2 looks like it's falling below three nines for some regions this year, even though it guarantees 99.95%.

http://en.wikipedia.org/wiki/Nines_(engineering)


You don't get what you don't pay for. "Linode works for me" is just anecdotal. I suspect Linode has smaller margins than Amazon, though, and tries to pass savings onto their customers despite not having Amazon's economies of scale.

An Amazon small instance has 1.7 GB of memory, and 160 GB of "local instance storage" (which Amazon will wipe when your instance goes down), and "1 EC2 unit" (a single core on a 1.2 GHz 2007 Xeon processor). It will cost you about $60 a month.

A similar plan on Linode will give you 1.5 GB RAM, and 60 GB storage. They will also throw in bandwidth and persistent storage, though.

Amazon tends to give you more flexibility (on demand prices, spot prices, dedicated prices, bigger instances, smaller instances, persistent storage, cheap temporary storage, and so on). But it's not as easy to use as Linode.

Linode targets people who want a cloud server. Amazon has a number of uses.


An Amazon "small" instance in no way compares to a Linode 1.5GB server. Linode beats the pants off of Amazon at every price point so badly that it's hard to even compare the two. A 4GB Linode performs faster than an 8GB EC2 "large" instance. Only the Amazon "High-CPU" instances come close, so you'd need to compare an EC2 "medium" with a Linode 2GB to make it even close to fair. The Linode also comes with 8 virtual cores, the EC2 unit with 2.

The real trouble with Amazon is the erratic performance of the EBS system which can kill an otherwise fast system. Linode's I/O does seem to be much more consistent and predictable.

Amazon's storage capacity and the cost of storage is unmatched, though. If you have a mountain of data, Linode will be prohibitively expensive.


I must have picked a bad time to become a Linode customer. My node in Fremont has been up for two days and has already seen two network outages.


Overall they have great service. Only their Fremont location has been a bit touchy this year. I've been with them since the days they used UML before Xen and if it's any consolation, I have experienced at least one downtime in each DC(but this was over the course of many years now). Unless you plan for distributed systems, expect some failures from time to time no matter who is the provider.


From what I've encountered, and what I've read, Fremont has been the most patchy Linode DC - they've had two outages there recently. One just a day or two ago, and another in early May. Apparently there was a significant outage late last year too.

p.s. The last two weren't network outages - they were total power failures. Your 'node would have been hard-booted as a result.


Just a fluke. I've been a Linode customer since 2008 and have managed multiple Linodes (on the East coast) and as far as I can tell, Linode has given me more than four nines.


I noticed the same outage. I signed up a week ago and was moving some stuff to the freemont site. Maybe I should have picked Dallas.


I see a lot of Linode customers here. Where are all the Rackspace users? Or are we a rareity on hn?


I run a handful of virutal machines on Rackspace's private cloud. And while their customer service tends to be dire the VPS's themselves are pretty solid. Desperately needs more features though - with a private back-end network between the servers (rather than shared!?) it'll be appropriate for many more applications.


Yep. A year a half so far with my linode. Not so much as a minute of down time.


Linode down for 1 hour 1 day ago.


Yes and it was particularly painful since my blog was on HN's front page when my VPS went down


I'm with you there yo.


It gets better - I hope nobody was relying on their AWS DB snapshots for backups - I just had a note from Amazon to say that one or more of EU-West database snapshots had missing blocks (due to an EBS software errors) and had been removed.


Yep just had the same email - I've lost all bar one EBS snapshot. Don't think I can trust them anymore, I wish I had more time to move our infrastructure elsewhere!

They've essentially swiss-cheesed all our backups.

Email copy follows....

Hello,

We've discovered an error in the Amazon EBS software that cleans up unused snapshots. This has affected at least one of your snapshots in the EU-West Region.

During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted. The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference counting process. This process compares the blocks scheduled for deletion to the blocks referenced in customer snapshots. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them. We've addressed the error in the EBS snapshot system to prevent it from recurring.

We have now disabled all of your snapshots that contain these missing blocks. You can determine which of your snapshots were affected via the AWS Management Console or the DescribeSnapshots API call. The status for any affected snapshots will be shown as "error."

We have created copies of your affected snapshots where we've replaced the missing blocks with empty blocks. You can create a new volume from these snapshot copies and run a recovery tool on it (e.g. a file system recovery tool like fsck); in some cases this may restore normal volume operation. These snapshots can be identified via the snapshot Description field which you can see on the AWS Management Console or via the DescribeSnapshots API call. The Description field contains "Recovery Snapshot snap-xxxx" where snap-xxx is the id of the affected snapshot. Alternately, if you have any older or more recent snapshots that were unaffected, you will be able to create a volume from those snapshots without error. For additional questions, you may open a case in our Support Center: https://aws.amazon.com/support/createCase

We apologize for any potential impact this might have on your applications.

Sincerely, AWS Developer Support


are your backups too large to be elsewhere completely? i'm working on a site on appengine, and although i can't back up all data, i can copy the critical stuff (the user accounts) to servers elsewhere.

[i realise this is a little off-topic, but what do other app-engine users do?]

[edit: wasn't being critical, just trying to understand what others do and why]


They're not mission critical failures, but that's not the point to me - I pay a monthly fee to have those snapshots there, and they just carved holes in them. Disappointing to say the least, even though it's not actually taken down any of our instances.


I’m glad to hear that you didn’t lose anything critical. But snapshotting EBS volumes are not backing them up. If, by definition, it’s stored using the same service, and therefore susceptible to all the same catastrophes that might befall the service, then it’s not a back up. I always set up a local read slave of the database server which has hot-swappable hard drives that get rotated out of a safe deposit box or fireproof safe. The only way I can properly trust that things are being backed up properly are to do it myself.


My site (hosted on Heroku and RDS) was inaccessible, as was heroku.com ... but the AWS status website said everything was OK. Heroku status also said everything was OK.

What is the point of those status dashboards if they are not actually monitoring the health of the cloud in real time?

Google App Engine's status dashboard quickly returns to green the minute an outage goes away, hiding the overall unreliability from view.


I've found http://search.twitter.com far more reliable than any status dashboard for finding out if there is an issue or not.


one should build a crowdsourced status checker to really know what happens.


http://downrightnow.com/ monitors twitter, their own user reports, official status RSS feeds, etc for a probability calculation of if a site is down or not.


My business is on 15+ AWS instances and we are completely down and all hosts are unreachable.


as of 9:59 PM CST it appears we are back up. Additionally, all our instances are running normal. It appears to be a network connectivity issue for AWS.


I'm moderately embarrassed to admit that I learned about this outage when both the Facebook games I play went down at the same time. Both back up now though.


I thought it was silly how everyone was going apeshit the last time this happened. Even here on HN people were falling over themselves proclaiming the end of the cloud. But you know what, I kind of like seeing outages like this because it means Amazon is (hopefully) going to reduce the risk of a bug like this ever happening again which in the end just makes the platform more reliable in the future.


To be fair the last one was a lot longer than a 1/2 hour and even though it was a small percentage there was permanent data loss. 0.06% of 100 PB is 60 TB which is still a lot of data. I have no idea how much data they store in one DC, maybe somebody could hazard a real guess.


the last one was a lot longer than a 1/2 hour and even though it was a small percentage there was permanent data loss

There was permanent data loss on EBS volumes, which are explicitly stated to be vulnerable to data loss.


the last one was a lot longer than a 1/2 hour

About 25% of EU west instances have been down for about 36 hours now...


Oh! I did not know that.


I would like to run my app on whatever AWS's status board is running on :-P


http://status.aws.amazon.com/rss/EC2.rss

http://reports.panopta.com/cloudharmony-borderless/server/57...

various ec2 clients affected:

- dotcloud

- dropbox

- engine yard

- foursquare

- heroku

- hootsuite

- instagram

- kicksend

- netflix

- pagerduty

- reddit

- twilio


pagerduty down?! I don't know if I should be sarcastic here and laugh, or just worry...


Their website might have gone down but I'm told that their underlying service is redundant across multiple availability zones.

Perhaps they should do the same for the website just to instill extra confidence, though.


I didn't get any SMSes or calls from PagerDuty even though I see 4 incidents triggered in the PD web UI. Although that could be due to Twilio going down.


We (PagerDuty) were down for a bit. We will move off of AWS asap once the dust settles.


Why move off, why not just replicate your service to another provider as well?


There's a good reason for that: when AWS has problems, we are under very heavy load because many of our customers are on AWS. The last thing we want to do at that point is an emergency flip to a secondary provider.


Is it just their user facing website that's hosted on EC2? Or do the pages originate from EC2 as well?


Did Netflix actually go down? They're supposed to be pretty Chaos Monkey proof.

http://www.codinghorror.com/blog/2011/04/working-with-the-ch...


Yeah, i was surprised too because their infrastructure is redundant across AZs if i remember correctly, but netflix was completely down for me.


I experienced intermittent issues. It tried retrieving the content to stream and failed. This lasted for about 20-30 mins. Was using wii streaming during that time.


Reddit is intermittent too.


I wonder how many 9's Amazon is down to? It's really a shame because it's such a good service in theory.

But it really is time for Amazon to start thinking about some kind of automatic failover system.


3 out of 4 us-east-1 AZs are unavailable from the outside. The AZ that is working is the one that has chronic availability issues, so there is no way to recover service in that zone.


There's an AZ with "chronic" availability issues? Where do I find out more about this?


I should have been more specific, I was in a rush =)

The original us-east-1b has chronic capacity issues, meaning that it is always at or near its capacity. AWS refuses to sell instance reservations for this AZ, and it's often difficult/impossible to launch new instances in this zone.

Keep in mind that AZs are remapped for all accounts newer than a certain point in time (say, 1 yr ago). Your 1b may not be my 1b.


Reddit gives: An error occurred while processing your request. Reference #97.374a7b5c.1312858006.1a17105

Must be this.


That's why I came here after I found out reddit was down :P


Why are these big name companies still relying on exactly ONE web host? These companies are centralizing failures to just one cloud host. They definitely need to abstract cloud deployment with something like libcloud and jcloud APIs and make use of multiple cloud deployments on a dime. As for storage, they should have replications set up on other cloud hosts as well.



Remember there is large EC2 EU West Outage that isn't fully fixed. About 25% of EC2 instances have been down for about 36 hours now...


It looks like the outage was only 30 minutes long. At least that is Pingdom's measurement of the downtime on my EC2 East 1 servers.


It depends on what you mean by outage. Instances with EBS volumes attached have problems even 24 hours later.


Please refrain from judging reliability of cloud providers unless you have a representative sample. By representative sample I mean several hundreds of instances.

That your instance runs 99.9% does not say anything, you are lucky. That your instance dies in two weeks does not day anything, you are unlucky.


Why don't people just be redundant and use multiple locations, or have a backup server at your business that can at least somewhat take over? It seems silly all these people rely solely on Ec2 and not have their own hardware anymore.


They could get the same thing by simply reading Amazon's recommendations and having EC2 servers in multiple regions as well. Most people either don't need reliability or failed to plan for it - the only difference with EC2 is that it's easier to add redundancy when you don't have to deal with physical hardware, network connections, etc.


I hear all these things about the need to run in multiple regions, etc. Why do I have to worry about this? Why isn't there an easy off the self solution? Honestly, isn't there an easier way to have redundancy?


It's a hard problem because there are many different things which can cause downtime: S3 is simple and redundant but also limited so they constrain the engineering problem to something manageable.

If you need anything more complex than delivering static content, you have to spend the effort thinking about what's important to you and how you're comfortable spending and/or compromising to meet that reliability goal. The answers for this tend to be pretty specific to your business needs and tech stack.


Data synchronization between data centers is tough. There's A LOT of data moving in those data centers all once--it isn't easy to mirror that across the country in real time.


That's why the good lord invented eventually consistent data synchronization models. If your use case can tolerate it, use it.


There is an off the shelf solution - Google App Engine.


Huh. My piddly little website hosted on a micro instance is down. This kind of sucks, not that I get a lot of hits, but I have a few things hosted there that local stuff depends on.


I think amazon can fix this problem simply by raising us-east prices to be on par with us-west. At least then there would not be quite the incentive to make the foolish choice of running all your instaces out of one region. (I suppose they could lower prices in us-west to match as well)

But in any case, I expect the PAAS people like heroku to at least start to step up their game. My apps on Google App Engine are up and I can run clojure and ruby there too.


My servers (on Engine Yard) were down for a few minutes but appear to be back up again.


I pay the extra money for US-west which seems to have a much better track record.


rtomayko: Everyone hates AWS when it fails in exactly the way they clearly state it will in the fine print. It's not free. You have to engineer for it.


Looks like Pagerduty is hosted on ec2? Really?


100% green for me


scroll down.


Appears to be back up, at least for us.


Yes, my site is up now too.


US-East is the Big Lots of EC2 zones.


Netflix also seems to be affected.


Is this pure network issue (ie: strictly connectivity), or were instances rebooted?


Seems like network only. I have an instance that was unavailable via SSH. When things came back up I still had 81 days of uptime.


Is this why Reddit is down?


yup.


This will likely kick Heroku into high gear with jumping off EC2.


Netflix is down as well.


Looks like it is up now. (7:58 pacific). I can access netflix.


11:17 pm EST - netflix.com does not work (though shows some minor signs of life).


is this related to the one from the other day?

DUH - doesn't seem to be - that was west coast, IIRC, and the only issue I see now is in Virginia.


Netflix is down.


We've been down for 3 days now.


All my boxes just came back up


Looks like we're back.


Twilio is down. Bummer


damn, http://status.twilio.com/ - calls are still going through, just a few extra rings before my service receives the http request from twilio... gotta love twilio they rock - looks like no call recordings...


netflix down on my PC and their app is down on my TV.


It's back up now.


Dotcloud is down


Back up now.


Our minecraft server on linode is down.


Linode relys on amazon? My linode's still up...


I'm a fool, it's mojang's authentication.


I'm sure Amazon will fix this issue, but the question it brings to my mind is- why are so many people using ECC? Most startups need hosting, not an elastic compute cloud. ECC makes sense for someone who needs to spin up 1,000 workers, pull data out of S3, process it, store the results in S3 and then spin down the workers when it is done.

But startups need hosts that are up 24/7. ECC doesn't give you any guarantee of uptime, and if it goes down the local (fast) disk is ephemeral. Yet, the EBS alternative which is backed up, is very slow.

The basic VPS offerings, that Linode, Rackspace, and just about everyone else offers, isn't available at all from Amazon (near as I can see.) Yet this is what startups need- local disks, a small monthly fee, and up all the time.

So, Amazon requires extra engineering--to account for nodes going down more often and to make ephemeral disks reliable or EBS performant. It also puts you on the path to lock-in, since so many of amazons services have their own unique APIs. It isn't exactly cheap, near as I can tell, when compared to, say, dedicated hosting in germany.

Its advantages exist elsewhere- if you need to spin up a bunch of machines with an API you can do this at rackspace cloud. And... that's about the only advantage I'm able to think of. Personally, I'm architecting to have some overcapacity built in and to survive a spike, because I use that excess capacity in the off hours for heavy lifting. I wouldn't try to bring up extra nodes in the morning and shut them down in the evening anyway... and I doubt that many startups are really doing that.

Possibly, I'm missing something. I tend to forget about features that aren't compelling to me, but are compelling to others. So maybe there's something that's important to these startups.


Having used EC2 for web hosting I can affirm that you are exactly right. Web hosting on EC2 is a challenge and quite expensive. From my experience EC2 is much better suited for large data storage, distributed computing, and any other use case that can tolerate EC2 quirks, hiccups, and outages.


S3 is a big part of it. If you're storing lots of data, a "local disk" just isn't enough. Now you're spending your time solving problems Amazon already solved.

Occasional downtime just isn't that big of an issue for many startups, who are still looking for product-market fit.


I see what you're saying, and I failed to illuminate that I'm coming from the perspective of someone who is working with a cluster of riak nodes. Riak is a ring topology much inspired by the Amazon S3 design. I'd never trust anything to just one spinning rust device, sure. I was taking a distributed storage architecture (using whatever open source platform you prefer) for granted. But I recognize that S3 predates many of them.


If you expect to grow your start-up quickly, it makes sense to use a provider that allows you to scale. It is very costly to maintain two separate infrastructures for different providers.


I think I'm misunderstanding your point. Why would you need two different infrastructures? Anything you can host on ECC you can host on backspace cloud, but without being ephemeral. Or you can order servers by the dozen from other providers.

Maybe you're talking about ordering servers by the hundreds, but are successful startups like foursquare doing that? Seems a few a week is what they'd add, and you could do that with most other hosts, without the added complexity of Amazon.


The difficulty of scaling is not in getting the hardware, but in the operational burden and frequency of failure. With EBS, machine and disk failures are effectively irrelevant and you can instantly migrate to another machine. You don't need to worry much about data loss. If a disk breaks in one of your machines, EBS keeps working. If the machine breaks, the data is still in EBS. If EBS breaks, the back-up is still in S3.

Dedicated hosting and rackspace cloud may be cheaper, but if you expect to be in a situation where failure becomes frequent, you'd better be on EC2.


With AWS, the development, operational and financial burdens are higher, and the frequency of failure is much higher.

EBS is not a unique feature, as there are multiple equivalents of EBS, along with higher level methods of replicating data, namely, couchDB, riak, cassandra, etc.

But I think you've answered my question- people use AWS because they don't realize this.


If you are starting up, where are you going to get the capital, expertise, and people to operate and maintain those systems? You have better things to do.


I love it when people tell me spending more money and time and effort on something that is less flexible, reliable or performant is somehow saving money, time and effort. I think you do not understand what you are talking about and you're giving me marketing spin, in fact that must be the case. You're simply wrong. I've explained it, you're not listening.

Of course the kicker is that this is in a thread about how a bug in Amazon means that the EBS volumes that were allegedly being backed up, actually weren't.


As tybris's comment indicates, the allure is primarily that you can configure the system to automatically spin up new instances on-the-fly as load increases, in case your startup gets on the front page of reddit.

Because most startups want to gratify that vanity, they set up on EC2 because they want to prepare for the "inevitable" deluge of users without paying for an adequate pipe or server configuration around-the-clock.


Imagine launching your startup on the day Amazon AWS goes down? On another note.. why don't these power outages ever affect Amazon.com from going down? Come'n, eat your own dog food!!!


Amazon specifically tells you to set up multiple servers in multiple availability zones. They probably follow their own advice, and, as a result, don't go down.

I've talked to some people about AWS about this, and the reason why they have availability zones is because they don't want to charge you the speed cost of syncing data between zones if your app doesn't need 100% uptime. Generalized replication slows down your app. AWS gives you the option of not having replication or bringing your own.


Multiple AZs went down this time. Only way to have survived it would have been to be multi-region which is...harder.


The last two outages have affected multiple availability zones in the US East region. To really account for it, you'd need instances in different regions.


...and only Amazon can afford to follow their own advice. Multiple AZ hosting ain't cheap. Most CEO/CIO/CTO types spit out their coffee when they see the costs of fully redundant hosting in the cloud at which point they decide "For that price we can afford to be down for 24 hours."


Very few businesses need 100% uptime. As long as you have good recovery strategies, and exercise them routinely, you should be set. When was the last time you ran a failover simulation? do your ops guys know what to do? are there clear lines of communication as to the status of the event?

Outages are hard to avoid, but the pain can be lessened if your customers are aware of the recovery progress and you can deliver on your recovery time goals. Nothing is worse than being down, and leaving customers in the dark to start rumors that your guys are not even aware of the problem.


Amazon.com doesn't run on the same, precisely speaking, infrastructure. My understanding from other Amazon employees was that it (EC2) started as an internal tool.


What? You mean without a backup plan to substitute in for the platform with known reliably issues? Yeah I guess that would suck.

Edit re your edit on dogfooding: supposedly they use the same tech, just not the same servers, so their AZs are probably separate form ec2s. Also, they're in all of their geographical locations, not just us east.


Do you have a source for Amazon.com running on AWS or AWS technology? As far as I know, they've never gone on the record with that.


According to http://en.wikipedia.org/wiki/Amazon_Cloud the Amazon.com retail website has been running on AWS since November 2010.


Wow - looks like you're right! I dug into the presentation (link below, important slide is #33) and Amazon are now claiming that all their web servers run on AWS. It's only their webservers (not their DBs, load-balancers or "services" which I presume means anything stateful). Of course, the co-location option for 'difficult' services is only available to Amazon, and before the switch-over in November they presumably weren't using AWS nearly so heavily, but this is at least a step in the right direction.

http://www.slideshare.net/AmazonWebServices/2011-aws-tour-au...


they still use different AS for their own retail stuff and cloud services. see radb/bgp


I'm on my celly so I can't, but if I remember after I land ill try and dig up where I read about it.


Amazon.de went down when the outage started, but they quickly recovered.


Dropbox is up but stored content seems to be inaccessible. Yay for single points of failure!


Using dropbox is a bad idea overall in my opinion. They lied to their customers about what they can and cannot view in your account. They lied to you. They intentionally told you that they could not view your files when in fact they can. If that's not enough, they had a critical security vulnerability (log in with no password) for four hours, proving that their systems fail open. Finally, if all this wasn't bad enough yet, they do not encrypt your files when they store them[1].

[1] Technically they do encrypt the files, but the keys are right next to them on the same infrastructure. Doesn't do any more good than not encrypting them.


From a security conscious perspective it is a bad idea, but too many people undervalue noncritical data and don't secure it like they should (That's part of why infosec is such big buisness). A prevelant lack of security sense is basically the only reason dropbox is still in business.

But from a HN perspective they've received YC cash so your post is currently voted <= 0.


This really isn't a problem if you aren't storing anything sensitive, especially if you aren't even paying for it.


For the bits that need to be protected there is Tarsnap (http://www.tarsnap.com/). The client is open source, so you can check for yourself that things are encrypted before going on the wire.


I'm not convinced they were intentionally misleading. I was never 'lied' to; I understood what their system delivered.


My servers are down in east us.



This type of comment is not desired here. Take it to reddit.


This site is already, effectively, /r/technology combined with /r/politics.

I like this comment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: