Edit: <snipped> out my rant. It's been a long day because of this. Just going to...

ohthehugemanate · on Sept 4, 2018

They have some services that are "global", ie not tied to a given region. Those services' requests are actually processed all over the place, but south central is a big datacenter. The 9th biggest in the world, apparently. When it lost cooling and shut down, everything routed around it as planned... But it caused so much extra traffic that it overwhelmed the connections to other datacenters. The backlog of requests is tremendous of course, so even after they got south central back up, all the other datacenters are way over their traffic capacity. They've got the Datacenter back up, and are now restoring storage and storage dependent services.

Honestly it's hard to imagine a good mitigation for this. "Build more datacenters" is already happening as fast as it can. "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.

If you, as a customer, are uptime focused enough that it's worth paying extra, then the sensible practice has always been Cross-Cloud infrastructhre/failovers. At least since the Amazon Easter failure of 2011. That's what giants like Netflix do.

eric_b · on Sept 4, 2018

> "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.

Err what?

It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage. I don't care how big the DC is, losing one should not take out the world, especially not for the length of time this one has been going on.

cloakandswagger · on Sept 4, 2018

Indeed. This article by AWS VP James Hamilton gives a unique insight into how Amazon approaches the problem of sizing data centers for redundancy:

https://perspectives.mvdirona.com/2017/04/how-many-data-cent...

msandford · on Sept 5, 2018

Here I was hoping this was a reference to James Mickens: https://blogs.microsoft.com/ai/james-mickens-the-funniest-ma...

youdontknowtho · on Sept 5, 2018

Agreed.

With that, though, it sounds like the size of this datacenter is way out of scale compared to the rest of their DC's. They are really going to need to break apart the services that they host there to make sure that DC to DC and region to region fail over works correctly.

DoofusOfDeath · on Sept 5, 2018

> It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage.

(I apologize if the following sounds snarky. I don't mean it that way, I just can't find better wording.)

Microsoft has repeatedly violated my sense of "reasonable" in the past, including in recent times with Windows 10. Therefore this kind of glitch isn't very shocking to me.

asdfasgasdgasdg · on Sept 5, 2018

> Honestly it's hard to imagine a good mitigation for this.

Besides the one that AWS and GCP have implemented? That is, to have at least N+1 datacenters? Actually, I think N+1 is the old Google prod regime. I suspect that GCP is at least N+1 per continental region, and I'd be surprised if AWS isn't as well.

cm2187 · on Sept 5, 2018

But you are basically saying that azure cannot deliver the redundancy they charge their customers for?

kirillseva · on Sept 4, 2018

Just want to point out that once Azure launches their submersible datacenter units Azure Functions may literally become dead in the water.

dharmab · on Sept 5, 2018

One of the demos at Build was Azure Stack running on oil rigs, so you're not far off

youdontknowtho · on Sept 5, 2018

I, for one, salute you, sir. Well played.