Hacker News new | past | comments | ask | show | jobs | submit login

I independently monitor availability of 150 public cloud services and only observed 1.73 hours downtime for this event. This is the first EC2 outage I've observed in any region for over 6 months. According to my stats, in 2015 EC2 was highly available with 6 of 9 regions (including every US and EU region) having no outages, and total average service availability of 99.998% (78% of downtime in sa-east-1). I haven't observed a single outage in us-west-2 or eu-west-1 in nearly 3 years. This compared to 16-33 minutes of downtime in every region for GCE in 2015, and total average availability of 99.995%. Additionally, since 2013 I've never observed a global EC2 outage like the 4/11/2016 GCE event.

https://cloudharmony.com/status-for-aws




Most of the criticism seemed to be centered around communication/PR so statistics don't really address that. Definitely important when considering a provider though.

While the numbers are nice, if some outages only impact a subset of customers and your monitoring accounts aren't one of them it's hard to determine how good your monitoring data really is. If he was impacted by a 12 hour outage and you only show ~2 hours that's a really significant difference.

I guess it really depends on how you monitor and how comprehensive it is. Do you monitor from multiple ISPs on different network paths in multiple regions/countries? Do you monitor each of the services under different load conditions and monitor multiple accounts? Sometimes "up" only tells part of the story.


2 hours aligns with the postmortem, 12 hours does not. The monitoring is based on a sampling of instances running in each service and region. Outages are verified from multiple network paths. While not comprehensive, over the years this has been generally accurate because most outage events have been network or power related, thus impacting most or all instances in the affected data center. There have been some isolated hardware failures, but they are rare.


I may be mistaken, but 2 hours only aligns with their post mortem for 80% of the impacted instances. By their own account there were instances impacted until 8AM and a small number after that even.

Your monitoring sounds really comprehensive. That's a very cool way to advertise the service it's built on. Do you monitor service providers for outages that reduce capacity or increase latencies but otherwise the service is "up"?


The stats are simply to provide some context. I agree criticism is centered around communication/PR, but also seems to imply the outage was of the same magnitude or worse than GCE's global outage, which is inaccurate. Per HA best practices, multi-region load balancing or failover would have adverted downtime during this event, whereas it would not have during the global GCE outage. Had this been a global outage, I think the postmortem and customer outreach would have been much better.


You're right. It's not quite as bad as apples to oranges but comparing them without contextualizing is hardly fair.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: