Hacker News new | past | comments | ask | show | jobs | submit login
Amazon EC2 and RDS in US-EAST zone down (amazon.com)
90 points by akhkharu on June 29, 2012 | hide | past | favorite | 94 comments



Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.


I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.


At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.


Except when you have to factor in all the plane flights to replace broken HDD. And the risk of not making it in time for when it breaks.


Most colo facilities let you buy hands on time through their techs or include a small amount per month for things like hard drive/ram swaps.


Yeah, I don't think I'd go with less than RAID-6 (or full system redundancy plus 1 drive redundancy in each). Rebuilds just take too long, even with an in-chassis spare on RAID5.

Unfortunately Areca is really the only controller I've found which is well supported and does RAID6 fast.


would those be managed at that price? because it's a hell of a lot more expensive when you factor in the cost of devops to make sure it stays working and fails over properly.


Poor inherited architecture, working to scale out greatnonprofits.org horizontally but it will be a while before we get there.

  I have nothing against colo but I don't really have time to run around the country checking on servers.


I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.


I know what you mean. I have a lot of issues with AWS, but the AWS console is exactly what my manager needs so he can do things himself. Simple things such as AWS load balancing fails when we get any decent amount of traffic.


[deleted]


I suspect it's the "all the things you can do with it" part, not the format. Using the SDKs you don't see any of the underlying ugly, anyways.


Thanks for clarifying my statement. Boto ftw.


Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.


Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.


The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.


That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.


Can't you tell management that it isn't as reliable as they claim?


I did. Unfortunately in the financial services industry, believing it means taking responsibility for it.


If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?


If only people understood that fact. Unfortunately few do.


Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.


No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.


Yeah unfortunately it looks to be an EBS problem and if your underlying EBS volume housing your primary DB instance takes a dump then that is unfortunately going to cause replication to fall over too


Multi-AZ RDS deployment is supposed to protect you from that though. That's why it's 2x the price. We should have failed over to a different AZ w/o EBS issues.


If your source EBS volume is horked then you aren't going to be replicating any data to your backup host while the EBS volume is messed up (since your source data is unavailable). EBS volumes also don't cross/failover between AZ boundaries.

Maybe there was something bad with your replication server before the outage? It's hard to guess without knowing exactly what was happening at the time...


I don't think you're familiar with how Multi AZ RDS works: http://aws.amazon.com/rds/faqs/#36

The whole point is to protect you from problems in one AZ by keeping a hot standby in another AZ. It doesn't matter whether it's due to EBS, power, etc. This is one of the primary reason to use RDS instead of running MySQL yourself on an instance.


Yes...what also sounds plausible is that since this was an EBS outage that the underlying EBS volume wasn't detected as being unavailable (if it in fact did become unavilable) so no failover to your other RDS server was initiated.


Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?


I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.

Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.


Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.


Individual availability zones can be identified using the API.

   ec2-describe-reserved-instances-offerings --region
will tell you what the zone's identifier is.

After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.

This Alestic article shows how to label them all.

[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones


Keep in mind AZs are different per account. My us-east-1b is not necc'ly your us-east-1b (as someone reminded me on twitter just now).


I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.


Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.

Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.


I've learned to ignore the checks passed for quite a while, especially for servers on load balance.


SNS sent me an e-mail of my instance alarms pretty quickly.

EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.


By over fifteen minutes in my case. Possibly thirty. WTH.


EC2 comes with a free Chaos Monkey service. It's called EC2.

I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.

It's great when you suddenly need a hundred more servers, though.


I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.

"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."


Absolutely agree - that's just silly. Their status page is close to useless.


I'm running in us-east-1 and my EC2 instances and EBS volumes are still responding ok for the moment...

Fingers crossed (just deployed to AWS less than 2 weeks ago).


It's not entirely down as I can still access my instances. I'm in us-east-1b.


Your us-east-1b might be my us-east-1a.


9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.


I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.


For what is worth, my small website is online again.


Anyone have any details on why us-east-1 seems to be less reliable than the other regions? Is it the oldest?


According to this calculation (which attempted to probe all the racks in EC2), over 70% of EC2 lives in us-east.

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...


I'm under the impression it's the most used.


It probably is the most used, being a cheaper alternative to us-west, but are you suggesting it fails more because it is used more? It does seem that the big AWS outages (in the us) have been concentrated in us-east. I have wondered if it just because us-east is newer so they haven't had has much time to work things out, or that the us-west team is a little better?

edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.


There are two different us-west regions. One in Oregon (priced the same as us-east) and one in California.


My theory is "used more".


It's the oldest, yes.


I'm curious why no public paas is multiple AWS region.


1) because AWS East is so much cheaper (and none of us like spending money) 2) AppFog actually is multi region (and multi IaaS as well)


Oregon is same as AWS East, seems to have a smaller set of boxes, have gotten errors in the past about not having any more servers to allocate.


Same in that they are both AWS and sometimes generate errors - yes. Not the same in that East has had four significant outages in the last 16 months and West has not.


I'd tolerate multi-AZ as a baseline.

Thanks for AppFog -- I hadn't heard of them, but will check them out.


Interesting enough not only the EBS is down, but ELB can not register instances even if there are not EBS based and completely operational.

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.


I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

ELBs are sometimes EBS backed.


Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.


Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).


What zone? I really wish Amazon would provide that info, instead of saying that it only affects one zone.


AFAIK zones are randomized. 1a for me is 1d for you.


That's really interesting if it's true. I had never heard this before. Thanks for the tip.


Ahh. Now that you mention it, I think I recall reading that before. It struck me as weird though because 1a was obviously the first one and 1e was recently added. So, would they rebalance my labels in that case?


Do you know why this is?


It was done due to badly written tools and scripts firing up instances in Availability Zone A every time.


probably to prevent folks from all stacking up in a single AZ.

Just think if someone posted a blog post saying "I've noticed that EBS performance is far better in 1d vs. 1a"


Yeah, for me, 1d experiences the lowest load of all zones. According to the pricing history for spot instances, 1d experiences the fewest price spikes compared to 1a and 1b. I'd be interested to see if other users have noticed the same thing for their zones.


I find that random(5) is the best performing. For okay but consistent performance, random(5) is decent, but you should definitely avoid random(5) due to high load.


Well, they also return errors if a zone is out of capacity. That seems like it would guard against the issue a bit. But maybe they just don't want to have to field loads of questions about that.


Supposedly it's for load-balancing purposes. (Most people spin up their machines in 1a.)


Humans are predictable - it wouldn't spread the load across zones very well.


I believe it's also a security measure.


Unfortunately, there is no meaningful way for them to say which zone because zone labels are different for each user.


I have instances in two different zones - both are down, although I don't know if AWS's randomization means that my 1a and 1d are actually located in the same logical zone.


The zones map differently per account, but if you've launched instances in different zones for your same account you are for sure in different AZ's.


Both my MySQL master (I'm not using RDS) and Redis Master servers are affected and are located in zone us-east-1a.


Why do you have two masters in the same AZ?


It is affecting me on US-EAST-1B


Good time to consider Google's Compute Engine as an alternative? What will we call it, GCE?


currently, it is a limited beta. Also, it looks to be more expensive.


Actually, if you do the normalization to make it apples to apples (and adjust for the difference in RAM) it looks price competitive. My numbers make it look slightly more expensive than AWS EAST (teh suck) and slightly less expensive than AWS WEST.


us-west-2 (Oregon) has identical pricing to us-east-1 (Virginia).


ahh Acronyms


I suggest until Amazon uses RDS For their database - that you don't either...


Mine is okay


dotcloud was down also but its now up. (they rely on ec2)


My instances are not down too.. I will back it up now in case things go bad.


I am experiencing two out of four instances in us-east-1e unreachable.


my instances in us-east-1c are fine


Goat rodeo.


Yep - Forums are exploding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: