Summary of the AWS Service Event in the Sydney Region

jaketay · on June 9, 2016

Our instances with ap-southeast-2 were out for around 12 hours. We used multiple availability zones and it didn't prevent downtime at all. It's very interesting the difference between AWS and Google outage responses. AWS is down for 12+ hours for some customers, force each customer to chase service level credits and sign off the postmortem with a nameless & faceless "-The AWS Team". Not one person at AWS was willing to take responsibility for this failure.

Whereas Google was recently down for less than 18 minutes. A VP at Google sent an email advising all affected customers, posted continuous updates to their status page, sent a further apology email at the conclusion, posted a service credit exceeding the SLA to all customers in the zone (without forcing customers to chase this themselves with billing) and lastly wrote one of the most well written post mortems I've ever seen. AWS has much to learn from Google about how to handle outages properly.

jread · on June 9, 2016

I independently monitor availability of 150 public cloud services and only observed 1.73 hours downtime for this event. This is the first EC2 outage I've observed in any region for over 6 months. According to my stats, in 2015 EC2 was highly available with 6 of 9 regions (including every US and EU region) having no outages, and total average service availability of 99.998% (78% of downtime in sa-east-1). I haven't observed a single outage in us-west-2 or eu-west-1 in nearly 3 years. This compared to 16-33 minutes of downtime in every region for GCE in 2015, and total average availability of 99.995%. Additionally, since 2013 I've never observed a global EC2 outage like the 4/11/2016 GCE event.

https://cloudharmony.com/status-for-aws

brazzledazzle · on June 9, 2016

Most of the criticism seemed to be centered around communication/PR so statistics don't really address that. Definitely important when considering a provider though.

While the numbers are nice, if some outages only impact a subset of customers and your monitoring accounts aren't one of them it's hard to determine how good your monitoring data really is. If he was impacted by a 12 hour outage and you only show ~2 hours that's a really significant difference.

I guess it really depends on how you monitor and how comprehensive it is. Do you monitor from multiple ISPs on different network paths in multiple regions/countries? Do you monitor each of the services under different load conditions and monitor multiple accounts? Sometimes "up" only tells part of the story.

jread · on June 9, 2016

2 hours aligns with the postmortem, 12 hours does not. The monitoring is based on a sampling of instances running in each service and region. Outages are verified from multiple network paths. While not comprehensive, over the years this has been generally accurate because most outage events have been network or power related, thus impacting most or all instances in the affected data center. There have been some isolated hardware failures, but they are rare.

brazzledazzle · on June 9, 2016

I may be mistaken, but 2 hours only aligns with their post mortem for 80% of the impacted instances. By their own account there were instances impacted until 8AM and a small number after that even.

Your monitoring sounds really comprehensive. That's a very cool way to advertise the service it's built on. Do you monitor service providers for outages that reduce capacity or increase latencies but otherwise the service is "up"?

jread · on June 9, 2016

The stats are simply to provide some context. I agree criticism is centered around communication/PR, but also seems to imply the outage was of the same magnitude or worse than GCE's global outage, which is inaccurate. Per HA best practices, multi-region load balancing or failover would have adverted downtime during this event, whereas it would not have during the global GCE outage. Had this been a global outage, I think the postmortem and customer outreach would have been much better.

brazzledazzle · on June 9, 2016

You're right. It's not quite as bad as apples to oranges but comparing them without contextualizing is hardly fair.

Fenntrek · on June 9, 2016

To be fair, that was a total outage of every region on the planet in google cloud platform's 18 minute outage case.

Compared to the sole region ap-southeast-2 (for you) in AWS's case. (though it was a longer outage).

That being said, I 1000% agree that Google cloud platform's response to fix issues, postmortems and actions to make things right are top notch.

ejdyksen · on June 9, 2016

> We used multiple availability zones and it didn't prevent downtime at all.

Can you explain this a little more? Amazon says this only affected one AZ, and they specifically note:

    For this event, customers that were running their applications across multiple 
    Availability Zones in the Region were able to maintain availability throughout 
    the event.

bigiain · on June 10, 2016

+1.

Apart from one internal project which mistakenly had all it's app server instances in -2b (ooops!) - all my production mobile app backends are spread across the 3 Sydney AZs. That's a few dozen EC2 app servers across about 15 projects.

My monitoring reported a worst case of 57 seconds of degraded connectivity - which was an instance in -2b going offline and the ELB not taking it out of the rotation very quickly, the app running on that had interruption, but only while waiting for the timeouts. Crashlytics and GA crash reporting didn't bat an eyelid... I had under 70 users active at the time, 1/3rd of them may have seen a minute or less of loading spinner if they'd fired of a UI blocking api call during those 57 seconds. I'm not looking _super_ closely, but nothing I'm monitoring apart from EC2 - like RDS, S3, ELB, SNS - showed _any_ glitches (I'd _probably_ have caught even single digit second problems for _some_ of that...)

I'm actually quite happy with how everything went - we don't go to any particular heroic lengths to ensure HA or uptime, we just follow recommended best practice, and at least in this outage, that worked out fine for us (except for that project where all the app servers were in -2b, and I'm happy to wear that as our fuckup)

beachstartup · on June 9, 2016

people vote with their dollars and their votes are overwhelmingly telling amazon they're doing a great job.

brazzledazzle · on June 9, 2016

For the record I agree with you but existing customers who are heavily invested in AWS would find it difficult to vote with their dollars.

atonse · on June 9, 2016

Agreed. I have a couple of clients that have a pretty substantial AWS spend, but the cost of switching to Azure is too high compared to the difference in offerings. You don't want to spend tens of thousands of dollars of developer time and risk switching datacenters for a small improvement.

beachstartup · on June 9, 2016

yeah, it works out great for amazon. less work, more money, what's not to love.

jaketay · on June 10, 2016

It's difficult to change when you have reserved instances. A business decision yesterday might not be the best decision today. That being said, I think overall AWS is very good but Google is definitely starting to create some real competition which is positive for everyone.

ben_jones · on June 9, 2016

I wonder if due to the scale of AWS and certain AWS customers, if AWS signs a post mortem with -Fred would a very large AWS customer have the pull to say "Amazon, fire Fred"? Just curious if anonymous postmortems are company policy at certain places and why that might be.

plandis · on June 9, 2016

Do you have proof? Your claims directly contradict what was written in their post mortem both in terms of time and scope. AWS might be using nice wording but I can't see them lying that much about the scope of the incident.

chrismorgan · on June 9, 2016

Why, oh why do they report times in PDT rather than AEST (the zone of the affected area) or UTC (the standard everything else is based on)?

(Mutter, mutter, … something about Americans and their timezones … and northern hemispherians and their seasons …)

bmon · on June 9, 2016

This also baffles me. Here are the times converted to AEST:

  Sun July 5th
  3:25 PM -- Initial Power outage
  4:42 PM -- Instance launching in unaffected AZ's restored
  4:46 PM -- Power Restored
  6:00 PM -- 80% of instances recovered
  7:49 PM -- DNS recovered

  Mon July 6th
  1:00 AM -- Almost all instances recovered

mentalpiracy · on June 9, 2016

Small correction: dates are for June 5th and 6th, not July!

bigtones · on June 9, 2016

Because this post-mortem was written by an American in Seattle

financequoll · on June 9, 2016

For those wondering what the "severe weather" was:

* http://www.smh.com.au/national/australias-wild-weather-sydne...

* http://www.abc.net.au/news/2016-06-07/sydney-weather-storm-d...

* http://www.sbs.com.au/news/gallery/pictures-wild-weather-sav...

bigiain · on June 9, 2016

Heh - I love the image in my head of the flywheel providing a few extra seconds of power to the coffee urn in the Blackwoods warehouse out the back and to all the fan heaters and big screen TVs in Toongabbie - just as Foxtel, Dominos, and Channel 9's Nagios dashboards all start turning red and their ops staff phones start beeping.

daniel-levin · on June 9, 2016

>> The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage)

It is false to assume that the state of the electrical supply is either on or off. This may come as a surprise, but not to me. In 2008, Eskom (South Africa's electricity suppliers) experienced similar faults. The mains supply voltage is 220v here. At one point, some devices started to fail in my house, and others, such as lights, continued to work, but significantly dimmer. We measured 180v at the plugs. There were similar outages in my area last year, where an outright cut-off was preceded by voltage drops. This outage is interesting because it is an example of a bug owing to false assumptions!

There have also been incidences where certain cables have been stolen [1] and that has caused the opposite: voltage spikes.

[1] I couldn't tell you which, or what kind, but I remember it has something to do with "the neutral"

aroch · on June 9, 2016

It doesn't sound like they were treating it as binary, they have breakers in place for brownout detection too -- those breakers just weren't triggered fast/early enough in the brown out.

PhantomGremlin · on June 9, 2016

I love reading about problems like these, it's great that Amazon is forthcoming about them. There's always some new wrinkle.

E.g. in this case, in normal operation, power from the utility power grid spins a flywheel. When the grid fails, the flywheel provides a holdover until Amazon's diesel generators can start.

But in this failure the voltage from the grid sagged, rather than going away completely. The breaker isolating the flywheel from the grid didn't open quickly enough. So power from the flywheel was sent out to the grid. It didn't succeed in powering the grid for very long. Oops.

smegel · on June 9, 2016

There's a reason why electrical engineering students are usually the smartest people at uni (I wasn't one of them but was often in awe).

JorgeGT · on June 9, 2016

Something else must have failed or was not properly configured, because backup Diesel generators should kick-in after 2-15 seconds of voltage drop, regardless of the flywheel. The flywheel is used in critical systems to cover only that <1m gap.

jbg_ · on June 9, 2016

Amazon addresses that in their report. Each UPS is fed by generator power and grid power. Because the UPSes had been forced to try to supply the grid, and because they are giant spinning weights that you really don't want to go wrong and kill someone or destroy property, they did a safety inspection before powering them back up, which meant a delay before the facility could be supplied by generator power.

shermozle · on June 9, 2016

I'm a bit dubious about their "if you used multi-AZ you'll be fine" when I had multiple outages in a multi-AZ Elastic Beanstalk application of over an hour. Methinks the load balancers aren't as magical as they'd like to make out.

jdc0589 · on June 9, 2016

> Methinks the load balancers aren't as magical as they'd like to make out.

Agreed. I'm still put off by the fact that ELBs specifically can not handle a sudden spike in traffic orders of magnitude higher than the previous rate. They fall on their face, bad. If you expect a spike like that to happen, you literally have to submit a ticket and ask them to pre-warm your ELB...

thesandlord · on June 9, 2016

Yeah this is one big advantage GCP has over AWS, no need to prewarm load balancers. Not sure how Azure works (I work for Google).

vacri · on June 9, 2016

I knew that this was a big event when it happened last Sunday, because the AWS service status page had a yellow triangle rather than a green tick. Usually when they have an outage, they just put a tiny blue 'i' on the green tick...

benjaminRRR · on June 9, 2016

That's right, and you're even lucky to the (i) usually it's buried in the rss which is why https://twitter.com/aws_shd is useful.

clentaminator · on June 9, 2016

Or, in summary, "Uninterruptable power supply is actually interrupted."

mryan · on June 9, 2016

There is something Orwellian about referring to this as a 'service event'.

I am reminded of 'The Event' from That Mitchell and Webb Look [0]. We don't talk about The Event.

https://www.youtube.com/watch?v=wnd1jKcfBRE

voltagex_ · on June 9, 2016

I'd love to see a write up from the power company's point of view.