I independently monitor availability of 150 public cloud services and only obser...

brazzledazzle · on June 9, 2016

Most of the criticism seemed to be centered around communication/PR so statistics don't really address that. Definitely important when considering a provider though.

While the numbers are nice, if some outages only impact a subset of customers and your monitoring accounts aren't one of them it's hard to determine how good your monitoring data really is. If he was impacted by a 12 hour outage and you only show ~2 hours that's a really significant difference.

I guess it really depends on how you monitor and how comprehensive it is. Do you monitor from multiple ISPs on different network paths in multiple regions/countries? Do you monitor each of the services under different load conditions and monitor multiple accounts? Sometimes "up" only tells part of the story.

jread · on June 9, 2016

2 hours aligns with the postmortem, 12 hours does not. The monitoring is based on a sampling of instances running in each service and region. Outages are verified from multiple network paths. While not comprehensive, over the years this has been generally accurate because most outage events have been network or power related, thus impacting most or all instances in the affected data center. There have been some isolated hardware failures, but they are rare.

brazzledazzle · on June 9, 2016

I may be mistaken, but 2 hours only aligns with their post mortem for 80% of the impacted instances. By their own account there were instances impacted until 8AM and a small number after that even.

Your monitoring sounds really comprehensive. That's a very cool way to advertise the service it's built on. Do you monitor service providers for outages that reduce capacity or increase latencies but otherwise the service is "up"?

jread · on June 9, 2016

The stats are simply to provide some context. I agree criticism is centered around communication/PR, but also seems to imply the outage was of the same magnitude or worse than GCE's global outage, which is inaccurate. Per HA best practices, multi-region load balancing or failover would have adverted downtime during this event, whereas it would not have during the global GCE outage. Had this been a global outage, I think the postmortem and customer outreach would have been much better.

brazzledazzle · on June 9, 2016

You're right. It's not quite as bad as apples to oranges but comparing them without contextualizing is hardly fair.