Microsoft Azure Outage

pkorzeniewski · on Nov 19, 2014

Azure support is probably the worst one I had ever deal with. When my account (and service itself) stopped working, I haven't received any email. When I tried to sign in, all I got was some generic error saying "There's something wrong with your account". My services of course were down and I couldn't do ANYTHING. I've contacted the support to learn that my account has been blocked (!) because there was some suspicious (!!) activity going on. What the... No, they couldn't tell me what exactly it was. I've exchanged emails back and forth with the support for several days to learn nothing new, my account and services were still disabled and I was more than pissed off. From that day I hate Azure and I advise anyone against using it, because such situation is absolutely unacceptable.

breischl · on Nov 19, 2014

Interesting, I actually just got one of those "suspicious activity" emails last week. But they just threatened to shut down my services, they haven't actually done it (yet).

It's still pretty obnoxious, because they basically said "there's something suspicious going on, we won't tell you what, but you better fix it or we'll shut you down." Gee, thanks. After a week of daily emails they finally responded with a network trace... that showed we're doing some outbound HTTP calls. That's it - they redacted everything except for the ports and the first two octets of the destination IP addresses. So very helpful, and certainly looks suspicious... </sarc>

Someone1234 · on Nov 19, 2014

Wait, what? Outbound HTTP is enough to get your account shut down? The hell! What if your web-app is utilising any API in the world, the vast majority are over HTTP/S utilising JSON or XML (and sometimes you need back end API access rather than client API access, like updating a product database).

breischl · on Nov 19, 2014

Officially, no. Actually we've been doing a lot of outbound HTTP for years with no problems. But somebody (they won't tell me who) complained, and they "investigated" and decided that something we're doing looks malicious, but won't tell me exactly what. They just sent me an unhelpful network trace.

It's still possible there is something malicious going on, but they've been stunningly unhelpful in finding it. Mostly they've just asserted that there's a needle somewhere in our haystack, and we'd better go find it. They've been threatening to shut us down, but haven't actually done it.

donatj · on Nov 19, 2014

I had a similar issue with azure where one of my drives just up and disappeared. Took days(!) to get it back. I'd never use Azure again.

blablabla123 · on Nov 19, 2014

There is probably a lot to complain about Azure, especially when it comes to proactive communication, dirty hacks and Windows weirdness. But their support was actually really nice to me...

Touche · on Nov 19, 2014

I had a similar situation with DigitalOcean. Droplets were taken offline with no notification whatsoever, only answered after I submitted a ticket. This has happened to me twice.

debacle · on Nov 19, 2014

Why were they taken offline? I had a droplet taken offline for security reasons once and they were very communicative and responsive.

Touche · on Nov 19, 2014

Same reason. Did they send you a notification that it was going to happen? I did not, and only discovered it when the application being hosted on the droplet wasn't loading. Luckily it was just a personal tool.

debacle · on Nov 19, 2014

They opened a ticket and sent me an email before they turned off the NIC.

atmosx · on Nov 19, 2014

Days? What do you "days"???? Azure was down for like 9 hours.

garretraziel · on Nov 19, 2014

I think that he is not referring to this incident.

tiagocesar · on Nov 19, 2014

I love how deleted comments actually never disappear.

photorized · on Nov 19, 2014

I feel like an idiot. MS featured my Azure startup today, quoting me about overall stability etc (which has been the case for us, until today). They then proceeded to go down, taking all our production systems with them.

(yes we do have AWS, too)

Sigh.

photorized · on Nov 19, 2014

Microsoft, you've got to be kidding me. Just tried opening a billing ticket, completed the forms in detail, attached screenshots, clicked Submit...

'Unable to Submit Request We are unable to complete the incident submission process at this time. Please refer to this page for phone numbers to call for Azure support.'

Encosia · on Nov 19, 2014

For what it's worth, I've been watching several Azure-hosted sites that I control and they've been coming back online sequentially (and are all back online now). Whatever they're fixing, it seems to be taking some time, but is progressing steadily at a good pace in the last hour.

photorized · on Nov 19, 2014

Mine have been gradually coming back, too.

Timing couldn't have been any better for me. Some Alanis material there:

https://twitter.com/bizspark/status/534858596748906496

Now I'll have to distribute between AWS and Azure, too.

photorized · on Nov 19, 2014

They all have outages periodically, in my experience.

Encosia · on Nov 19, 2014

Yeah, that's the main thing to take away from these events. I've never seen a system (cloud, local, co-located, or otherwise) with 100% uptime, despite every effort to the contrary. Even sites like Facebook and Google have downtime. In the last few years that I've been using it, Azure has been at least as stable as other cloud providers.

That doesn't help with the awful timing though. Ouch. I just Buffer-retweeted your BizSpark tweet above, scheduled for tomorrow. Maybe a little bump now that you're back up and running will help ease the pain...

photorized · on Nov 19, 2014

Thank you.

icantthinkofone · on Nov 19, 2014

For nine hours?

toomuchtodo · on Nov 19, 2014

Why not stick with AWS? (Just curious!)

osipov · on Nov 19, 2014

Check out Softlayer. It has been more reliable than AWS in my experience.

toomuchtodo · on Nov 19, 2014

Really? I manage several hundred VMs and their associated EBS volumes on AWS and we've had 0 problems. Also no problems with S3. Ever.

razzberryman · on Nov 19, 2014

They should run their support system on AWS since they're likely to get a lot of ticket requests if Azure is down. :)

higherpurpose · on Nov 19, 2014

Azure has had quite a few outages this year. I'd say it's already lower than that 99.999 percent uptime or w/e they are advertising.

rbanffy · on Nov 19, 2014

According to https://cloudharmony.com/status-1year-for-azure, they didn't reach 99.99%.

toyg · on Nov 19, 2014

0.001% of 365 days is 8.76 hours. So yeah, shot for the year; but of course they'll do some "hollywood timekeeping" (or just ignore the matter altogether) and keep advertising...

kazoolist · on Nov 20, 2014

0.001% of 365 days = 5.26 minutes (http://www.wolframalpha.com/input/?i=0.001%25+of+365+days)

cddotdotslash · on Nov 19, 2014

And the year isn't even over yet!

cddotdotslash · on Nov 19, 2014

Just curious - if you have AWS too, then why did it take everything down? Can't you just swap the DNS?

photorized · on Nov 19, 2014

To answer your question: some parts of our SaaS (e.g. data gathering/processing) are on both AWS and Azure, but the customer-facing portal web app is 100% on Azure (in two regions North East and North Central), so we couldn't just swap the DNS.

We're changing that now, will need to replicate across different cloud providers, too. We're changing a lot because of last night's outage.

philwelch · on Nov 19, 2014

Does DNS propagate quickly enough to alleviate an outage or is it just a matter of ensuring that you recover within a few hours rather on waiting on an outage resolution that might take longer?

Alternately, can't you just have multiple A records to distribute your load across cloud platforms and just drop the one for whichever platform is having an outage?

latch · on Nov 19, 2014

From experience with multi-datacenter setups, if you set a 60second TTL on your DNS records, you'll see 95%+ of traffic get the update within 5 minutes.

Also, you can associate multiple addresses with a record. It's up to the client to retry on failure, but all browsers do (as far as I know)

cube00 · on Nov 19, 2014

Wouldn't that kill DNS if everyone did that considering it relies on caching for performance across the world?

latch · on Nov 19, 2014

I'm no expert, but no. Most big sites rely on a fairly short TTL.

It's a thick layer of caches. Your browser, OS, router, ISP, and a bunch of intermediaries can cache the DNS. So even at 60s, you get good cache hits (the busier, the more true that is, of course)

Also, the update can always happen asynchronously. You and 9999 people ask your ISP for Facebook's IP. It serves all of you a slightly stale IP and asynchronously fetches a new one (thus turning 10000 requests into 1). AKA: thundering heard problem.

DNS mostly uses UDP, which is more efficient for the server and harder to DOS (the server doesn't have to maintain state per request).

Finally, # of requests is usually (always?) a factor in the price of DNS services. So the cost is borne by the clients, not the service providers. And since DNS hosting is seemingly profitable, I assume they're more than happy to build up the infrastructure to deal with additional requests.

jbergens · on Nov 19, 2014

But there is a difference between a big site and many small sites regarding DNS caching. If 10 big sites has 1 million requests each within an hour most requests will be cached. If 1 million small sites has 10 requests each within an hour most request will NOT be cached but forcing a cache-refill. I think that might strain the DNS infrastructure.

silverbax88 · on Nov 19, 2014

The idea of cloud storage being down is less of an issue - I don't like it, but I understand it. What bothers me about this is:

1. I was never notified of the outage. I noticed it myself when attempting to log into one of my VMs and then started looking for status updates. Sadly, the best status updates I got were here on Hacker News.

2. When my servers did come back up, at least one of my IP addresses had changed, which meant I had to update all of the relevant DNS entries (which, as everyone here no doubt knows, can take up to 48 hours to propagate). I was never notified of this change in any way.

Maarten88 · on Nov 19, 2014

2. Azure does not guarantee that you keep your ip address by default. You should configure a cname if you use Azure Websites or get a reserved ip address, available with Cloud Services

silverbax88 · on Nov 19, 2014

Actually, I am supposed to have a fixed IP. And I have CNAMES configured. I will review to determine if there is some other way I can set this up, but my issue is that I would never allow my products to be offline for hours without notifying my customers.

coreysa · on Nov 20, 2014

First, I am really sorry about the impact the incident had on your service. Specifically, for your changing IP, are you currently using the reserved IP feature. You can reserve both your external IP and your internal IP.

External IP: http://msdn.microsoft.com/en-us/library/azure/dn690120.aspx

Internal IP: http://msdn.microsoft.com/en-us/library/azure/dn630228.aspx

badgersandjam · on Nov 19, 2014

Also as I found out, set the TTL on your CNAME very low.

Also this is a PITA if you use the @ entry in your DNS.

mnutt · on Nov 19, 2014

Why should you have to keep the CNAME low if it always resolves to the same domain on Azure?

badgersandjam · on Nov 19, 2014

Only because some caching DNS resolvers are broken and keep the referral cached for the TTL of the CNAME record.

shaydoc · on Nov 19, 2014

Do you having monitoring setup on your cloud vm, because you should have alerts triggering emails to notify you when this is happening.

Secondly, you are using an IP address and expecting that to be static? The recommended approach is to use a CNAME so you don't hit that issue, alternatively, you can have up to 5 Reserved-IPs per subscription and attach that Reserved-IP to your VM : New-AzureReservedIP from powershell

Edit : see http://azure.microsoft.com/blog/2014/05/14/reserved-ip-addre...

silverbax88 · on Nov 19, 2014

Yes, I do. That's why I was checking my VMs. This still doesn't absolve my cloud server provider from notifying me when they have a global system failure.

yourad_io · on Nov 19, 2014

> which, as everyone here no doubt knows, can take up to 48 hours to propagate

I think that's been largely dispelled.

I can't find the link right now unfortunately but I remember a post looking into DNS propagation realities from either this or last year, and they found that overwhelming majority of DNS servers they tried (99%+) respected the TTLs set exactly as they should. *

My personal rule of thumb is, if it hasn't propagated within an hour, I need to look at it again because I messed up.

Tools like this [1]are invaluable when you're paranoid about whether your new record has propagated.

[1] https://www.whatsmydns.net

* ugh. Does anyone know which post I'm talking about? My google-fu is failing me hard.

vijaykiran · on Nov 19, 2014

Perhaps not the link you are looking for - but there was some discussion on here sometime ago - https://news.ycombinator.com/item?id=3397253

hosay123 · on Nov 19, 2014

Sadly this is a common situation. They appear to hold off making any public notice (including often on their own "service dashboards") until support forums are screaming with upset users.

Google App Engine has had numerous outages like this, the only one I can find any public documentation for being a 6 hour outage in 2012: http://googleappengine.blogspot.co.uk/2012/10/about-todays-a... (and let's not forget the old-style Datastore corruption incident, where every App Engine user got to manually merge split-brain database tables after a messed up failover)

dewitt · on Nov 19, 2014

Hi hosay123,

The App Engine team has a proactive policy about posting about downtime:

https://groups.google.com/forum/#!forum/google-appengine-dow...

Since the team highlights basically anything that looks like it is impacting customers, the issues don't always warrant a stand-alone blog post, but you'll notice that generally speaking the last post in each thread is a full public post-mortem with diagnosis and remediation.

Let me know if there's more you think might be useful for you as a GAE customer. Thanks!

blablabla123 · on Nov 19, 2014

I'm not trying to be evil but I used to migrate a small website away from GAE because of continuing issues and very limited communication from Google.

Azure isn't perfect but their support is way more responsive, at least to us eventhough we don't have any fancy support contract. Eventhough my GAE experience was 2-3 yrs ago, I have to say that Azure has way less issues.

patwhite · on Nov 19, 2014

So, the worst part about this is that zero communication has come out of Microsoft - we first started seeing issues on Sunday and filed a ticked, had an open ticket while this larger outage happened, and haven't gotten a single email saying there's an outage. I found out about it from, sigh, buzzfeed.

Question - are AWS or GCE better at proactively messaging when there's an outage?

crb · on Nov 19, 2014

Google's operations groups are not only regularly updated during an outage, there's a root-cause analysis with remediation and prevention information posted a couple of days after any issue.

See https://groups.google.com/forum/#!forum/gce-operations and https://groups.google.com/forum/#!forum/google-appengine-dow....

nkvoll · on Nov 19, 2014

I've never ever received a message from AWS when they've had outages that have been affected us significantly. On the contrary, there's been multiple cases where we've experienced issues, contacted them and it's taken a few hours before they realize they're actually having infrastructure problems. Many of these don't even get an entry on their service status pages. So there's still a lot of room for improvement on AWS's side of things as well.

bad_user · on Nov 19, 2014

I can confirm this. I remember once when half of the Internet was down and the status reported for EC2 was yellow - experiencing some minor issues :-)

And I find out about it by yelling at Heroku - they told me that Amazon is having issues before Amazon's status turned yellow.

kalleboo · on Nov 19, 2014

Usually when AWS has an outage they have a nice green circle but with a small blue "i" next to it that you need a loupe to see. Extremely dishonest.

jread · on Nov 19, 2014

I run a site that monitors cloud service availability. Based on VMs and Blob storage containers I maintain and monitor, the outage affected every US Azure region with 1-2 hours of downtime: https://cloudharmony.com/status-for-azure

razzberryman · on Nov 19, 2014

Wow, comparing that to AWS is staggering!

https://cloudharmony.com/status-for-aws

etha · on Nov 19, 2014

It's not as if AWS has never gone down (http://aws.amazon.com/message/65648/). It just hasn't had a major outage in the last 30 days.

rbanffy · on Nov 19, 2014

Comparing https://cloudharmony.com/status-1year-for-azure with https://cloudharmony.com/status-1year-for-aws tells a different story.

jread · on Nov 19, 2014

I don't recall an AWS outages of this magnitude. Most cloud outages, including AWS tend to be data center/AZ specific.

maxsec · on Nov 19, 2014

you need to add in the new Australian Clusters

zaroth · on Nov 19, 2014

That status table with all the randomly located green checks is painful to look at... I guess a green check in the 'Global' column implies a green check in all location specific columns? But what about all the rows which have no Global green check, but most columns are still empty? Are those regions where the service is not deployed? Can we gray out those boxes or something if they are 'N/A'?

Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.

Why would I want to 'X' out specific rows/columns in the table? It was so complicated to begin with, someone thought adding more complication through end-user customization was a good idea? I just noticed, you can even expand some of the rows...

Seriously, a status page should tell you either "It's up" or "What's down". It's not even showing history over time, this is just a snapshot. The text at the top directly contradicts the icons in the table, making the whole thing even more ridiculous.

The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s

tawy1 · on Nov 19, 2014

History: http://azure.microsoft.com/en-us/status/#history

unclesaamm · on Nov 19, 2014

Running Chrome 38.0.2125.111 m, zoomed out row headers look fine

ballstothewalls · on Nov 19, 2014

Are you confusing row headers and column headers? I have the same version as you and the row headers got funky when zoomed out.

razzberryman · on Nov 19, 2014

http://i.imgur.com/Uv81QYi.png

_bbks · on Nov 19, 2014

The most damaging part to me is that "All good! Everything is running great." message on the status page.

Mistakes happen, services go down, I can get over that. What matters is how its dealt with. At the moment I would not want to be an Azure customer dealing with 9 hours+ downtime whilst MS are saying everything is great. At the very least change it to "Having some issues" or similar!

teovall · on Nov 19, 2014

The postmortem for this should make for a good read. How does storage go down in eleven regions at once?

ohyesyodo · on Nov 19, 2014

Just apply same buggy network patch to all DCs at once? They use software networking so causing something like this should be easy. Or mess up network routing for *.blob.core.windows.net which pretty much all of Azure relies on.

icebraining · on Nov 19, 2014

Isn't applying the same patch everywhere at once a major anti-pattern?

ohyesyodo · on Nov 20, 2014

Turns out this was exactly what happened - they applied a buggy patch to all data centers at once by mistake.

davis · on Nov 19, 2014

photorized · on Nov 19, 2014

I suspect a human config error. I don't see how else multiple services, in multiple regions, can all be affected at once.

Looking forward to the post mortem.

inglor · on Nov 19, 2014

Our sites have been down for more than 3 hours now.

EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.

It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)

joshuak · on Nov 19, 2014

Major outages should absolutely weigh into your decisions as to what platform to use. That being said you can mitigate the effect of instability by engineering your app to failover to other availability zones or even to another cloud platform (depending on your app) if the entire platform goes down.

Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.

In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.

inglor · on Nov 19, 2014

Thanks! This is very helpful.

Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.

Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.

joshuak · on Nov 19, 2014

There are any number of uptime, and ping services that you can google for. This can raise the alarm in a timely fashion when your site, or parts of your site go down, and then you decide how to handle those issues.

You may also want to google for DNS failover services, to help you automatically redirect traffic in more catastrophic failure cases. There are offerings from google[1], AWS[2], and others.

[1]: https://cloud.google.com/dns/docs

[2]: http://aws.amazon.com/route53/

jfroma · on Nov 19, 2014

Our main cluster is on azure west us but we have another cluster on amazon east and route53 on top of that. When the main clusters fails, route53 switch to secondary, so we where not affected at all this time.

The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.

We had also to purge our crashed mongo nodes because the journal was broken.

https://auth0.com/availability-trust/img/auth0-infrastructur...

barkingllama · on Nov 19, 2014

On site disaster recovery? Off site disaster recovery? Split your hosting between multiple providers?

It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.

inglor · on Nov 19, 2014

If I had to quantify this - 3 hours * 3 people who can't work and publish posts + about a week of marketing costs for damaged rep (apologies, PR, ads for exposure). I'd say that for the very least this cost us at least 1000$ and probably north of 3000$.

This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.

freehunter · on Nov 19, 2014

So the question becomes, would putting a DR site on AWS or Google cost more than the $3000 this outage cost you? If the answer is no, wouldn't it be worth architecting to not put all of your eggs in one basket?

Be mad at the service provider if they don't live up to the number of nines they promised. Be mad at yourself if you expected more nines than they can deliver.

duncans · on Nov 19, 2014

Look into Cloudflare. They can act as a kind of reverse proxy to keep static stuff online. Obviously doesn't help if the transactional part of the site/database goes down, but end users will see a friendly message rather than it timing out.

toomuchtodo · on Nov 19, 2014

As others have mentioned, multiple cloud providers, service checks, and withdrawing bad providers at DNS.

photorized · on Nov 19, 2014

Azure + AWS

sphildreth · on Nov 19, 2014

So much for the idea of 99.999% uptime with the magical "cloud" buzzword. I noticed during this downtime in North America that Word Online wasn't functioning as my daughter tried to use it to do some homework.

shaydoc · on Nov 19, 2014

If you want 99.999% uptime, use Azure TrafficManager and set up failover loadbalancing to different datacenters.

We have failover loadbalancing running between multiple datacenters, no issue here!

edit : 99.99%

equoid · on Nov 19, 2014

The SLA mentions 99.9% or 99.99% for Database connectivity. Where does 99.999% come from?

Do Microsoft say this about Traffic Manager or are you suggesting you have to pay for extra services to get the advertised reliability figure?

matthewmacleod · on Nov 19, 2014

So much for the idea of 99.999% uptime with the magical "cloud" buzzword

Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…

nnx · on Nov 19, 2014

They are selling 99.99% availability over a monthtly cycle for their Storage service. 99.95% connectivity for their Virtual Machine service.

http://azure.microsoft.com/en-us/support/legal/sla/

9 hours of downtime means they are down to at most 98.75% for this cycle.

sphildreth · on Nov 19, 2014

Oh its 99.9% from Microsoft see http://azure.microsoft.com/en-us/support/legal/sla/ which means they get 9h a year to hit the SLA see http://uptime.is/

tankenmate · on Nov 19, 2014

Except most people use monthly periods for their SLAs

kelvin0 · on Nov 19, 2014

You should try Google's App Engine (paid premium account) tech support when your critical files disappear. Can't be any worse than this ... That's the problem with these hosted cloud solutions, your systems are at the mercy of the bad tech support. Try explaining that to your own customers ...

nnx · on Nov 19, 2014

Actual link to status page: http://azure.microsoft.com/en-us/status/#current

(not that convenient to copy paste the OP link from a mobile device)

gwgwegewg · on Nov 19, 2014

Microsoft are refusing to help us with our downed servers because we don't have a support contract. The outage is their issue not ours!!

ownagefool · on Nov 19, 2014

If your app is down, it sounds very much like it's your problem.

While you're obviously going to be unhappy with downtime, this is a genuine part of calculation you should have made when you decided to outsource all your eggs into one basket.

codeshaman · on Nov 19, 2014

As more and more services and apps depend on 'the cloud', I'm wondering, how many of them would survive a major cloud outage: the cloud company going bankrupt, stock market crash or economic meltdown, a malware exploiting a major server-side bug (like heartbleed or shellshock, but worse) wiping or encrypting the data on the infrastructure/user machines.

How much of the user's data would be forever lost in such an event ?

The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

Anyway, the point I'm trying to make is that we should design our services or apps with this in mind - the cloud can and will fail from time to time, maybe forever. So, if possible, use the cloud as a 'bonus' feature, a means to back up data and store user's data offline for when the dark day comes at least the user still has his data.

maccard · on Nov 19, 2014

> The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

Is havin your stuff stores locally any more secure in that situation. If someone wants your data they'll knock on your door and beat you and your family until you give it to them

jacalata · on Nov 19, 2014

If you have the only copy, you can destroy it.

rbanffy · on Nov 19, 2014

They will still beat you up until you produce the data you have destroyed, which is until they get tired of beating you up. You could keep some decoy data you produce in such situations, preferably before the beating starts.

coldtea · on Nov 19, 2014

Reality call: ANY and ALL Cloud services, be it Google, Azure, AWS etc, will be down for hours at some point every few years.

ExpiredLink · on Nov 19, 2014

Reality call: ANY and ALL services, be it local or remote, will be down for hours at some point every few years.

Drakim · on Nov 19, 2014

If it's my own fault, I can at least curse my own lack of knowledge and expertise, and I can strive to do better in the future.

When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't happen again. Or maybe we could send an angry letter to Microsoft, and hope somebody reads it.

cmdkeen · on Nov 19, 2014

It's about abstracting away the cost of it being your own fault. Realistically the cost of employing enough people, and buying enough hardware, to provide anything close to 99.X% uptime is much more than punting that over to a Cloud Provider.

vidarh · on Nov 19, 2014

I've found very few cases where "punting that over to a cloud provider" has been remotely cost effective for base load. It's gotten closer over the years, but the gap is still massive for all but some very specific types of workloads.

It's great for convenience, and it's great for managing without certain skillsets that may be hard to obtain, and it's great for temporary capacity, but it's not cheap.

cmdkeen · on Nov 19, 2014

It's not cheap to have any confidence of any uptime realistically at all. The thing is that most people either live without that guarantee, or just get lucky enough not to care. It becomes problematic if you've made promises to others about uptime that are built on a house of sand.

Unless your base load cloud costs are more than the cost of full time, ready at a moments notice, experienced ops people you don't get close to any guarantee of uptime by non-managed hosting. The salary cost alone of that is substantial, let alone hardware spread across multiple locations. My firm pays at least 7 figures a year on IT ops and don't come close to 99.9% uptime across everything.

LLWM · on Nov 19, 2014

Sure. But in the end, what do you care about more? Better results? Or personal responsibility for those results? It's the same argument as self-driving cars.

grey-area · on Nov 19, 2014

From experience, that's not true. Things typically go wrong when something changes, or more rarely when something runs out of space/memory. Big cloud providers change stuff all the time, and typically do break things now and then. When they do all your can do is wait.

If you're using your own servers, or even VPS, you do have control over infrastructure, and can plan for changes and mitigate problems quickly, and you can run for years without downtime if nothing is changing significantly. Depending on your staff, funding, etc that might be attractive or not. Each has its own advantages, and disadvantages.

rlpb · on Nov 19, 2014

When it's local, you can control your own at-risk periods. For example: you can avoid doing risky work when an important company deadline is looming.

jedgrant · on Nov 19, 2014

Regretting the decision to go with Azure. Talk about terrible timing. We have media outlets interested in our site, we send info and the site is dead. Talk about a crap first impression.

craigvn · on Nov 19, 2014

It is totally frustrating, but at the end of the day similar outages happen with all cloud providers.

razzberryman · on Nov 19, 2014

Looks like Azure has been down more hours in the past week than AWS has been down all year.

https://cloudharmony.com/status-1week-for-azure

https://cloudharmony.com/status-1year-for-aws

ZoF · on Nov 19, 2014

Can you point to an example of this happening on AWS in multiple regions simultaneously?

dangrossman · on Nov 19, 2014

Yes. The day half the internet seemed to be down. It took Amazon more than 24 hours to recover, and having your services in multiple availability zones did not shield you from the failure.

http://aws.amazon.com/message/65648/

tarblog · on Nov 19, 2014

This seems to have only affected one region. Am I missing something?

dangrossman · on Nov 19, 2014

Yes. It started as a failure in one region, and propagated to others as it overloaded the "control plane" -- the stuff that runs "the cloud", and EBS tried to replicate "failed" disks to the point that Amazon ran out of disk space in the cluster. At the time, I was paying for RDS Multi-AZ which runs your database in multiple availability zones at once with hot failover if the primary goes offline. It failed to fail over despite that. Many large sites went down for a very long time that day, and people couldn't spawn replacement instances even in other AZs than the one the failure started in.

jedberg · on Nov 19, 2014

You're confusing region with AZ. They've never had a multi-region outage (yet).

parhamn · on Nov 19, 2014

It was one region, multiple availability zones. You're right Multi-AZ != Multi-region (for things like Sandy and natural disasters) BUT they have mostly separate infra which does make multi-az failures very unlikely. Due to its simplicity (with VPC and stuff) some people (perhaps wrongly) treat multi-az like multi-region.

ZoF · on Nov 19, 2014

Ah, cool, thanks for the info.

Interesting to think about the potentially compounding failure modes these services are dealing with.

I'll have to look that incident up to check out their postmortem vs. what Microsoft ends up putting out.

Thanks again ~

photorized · on Nov 19, 2014

I feel your pain. My startup was featured on TechNet today, I was quoted saying how great Azure worked for us so far... lots of folks were checking us out, and then Azure went down hard, taking our production systems with them. Talk about negative publicity.

toddgardner · on Nov 19, 2014

Our VMs and websites on USEast are unreachable, however our storage seems to be working fine. There is something very backwards with how they are communicating this outage.

Beached · on Nov 19, 2014

This may be greater then just west Europe. I personally have servers in US East that are unreachable, and there are a few reports of others in US region reporting partial unavailability for the US based servers.

I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

ceejayoz · on Nov 19, 2014

> I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure to do the same.

LamaOfRuin · on Nov 19, 2014

Each of Amazon's high profile failures did have many people formulating previously non-existent escape plans though, and there are now several alternatives in this space that can offer the same scale.

It's obviously not going to destroy anyone's business, but there is a lot more competition than there used to be.

chuckouellet · on Nov 19, 2014

FYI We had to reboot some virtual machines via the portal due to the issues in East US yesterday but since then they all work.

Beached · on Nov 19, 2014

After hours of no response on the microsoft forums, VM's just started working again. I didnt change a thing, just poof.

inglor · on Nov 19, 2014

We've been noticing ups and downs for the last few hours of our VM powering an important database in West Europe.

Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.

At least initially their status indicated they're handing the problem but lately it's just been "All Good" and they said they resolved it on twitter but it's not at 100% yet: http://azure.microsoft.com/en-us/status/

joshmlewis · on Nov 19, 2014

Someone else mentioned routing to divert traffic to working data centers. That might be an option for you.

Varcht · on Nov 19, 2014

Oh no, did we break the status page too? Sorry Azure team, really didn't mean to pile on!

bengali3 · on Nov 19, 2014

keeping the load light? <html><head></head><body>The page cannot be displayed because an internal server error has occurred.</body></html>

elpool2 · on Nov 19, 2014

Yup! Azure websites and Storage are down in multiple regions.

andrea_s · on Nov 19, 2014

VMs too... At least for western europe

syassami · on Nov 19, 2014

Storage, Websites and Visual Studio Online - Multiple Regions - Partial Service Interruption 5 mins agoStarting at 19 Nov 2014 00:52 UTC we are experiencing a connectivity issue to Azure Services including Storage, Websites and Visual Studio Online. The next update will be provided in 60 minutes.

wenbert · on Nov 19, 2014

Well, their status page is telling lies.

bursteg · on Nov 19, 2014

Storage is the source of the outage, and most of the services rely on it, so they are all impacted.

plasma · on Nov 19, 2014

Still down even 2 hours later, regardless of the status page saying its OK.

jmnicolas · on Nov 19, 2014

Judging by how cloud services "frequently" go down when everything is normal, it makes me wonder what would happen in case of a real problem (volcano eruption, social unrest, nuclear disaster, alien invasion ...). I still don't get the cloud infatuation, and no you don't have to get off my lawn, I'm "only" 36 (yeah I know, in IT I'm already a dinosaur).

freehunter · on Nov 19, 2014

What would happen to your own datacenter in case of a similar disaster? Your servers would go down and you would spin up from your disaster recovery site. Cloud doesn't mean you don't need a DR plan anymore.

Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even hybrid cloud, do something. Have a DR plan.

jmnicolas · on Nov 19, 2014

I'm thinking as the little guy here : not data center but personal computers.

If the disaster strikes my region, I probably have better things to do than IT things (like running for my life :-).

But with the cloud the disaster could be thousand of kilometers away and still affect me. That's the problem with the cloud : why should I stop working in my remote French town because there's a landslide in Ireland (or wherever they put the European cloud data centers) ?

I don't say the cloud doesn't have it's uses (especially as a redundant backup far far away) but the all cloud model has way more risks than what people think ... and vendors don't rush to explain that.

I'm one of those guy that think the future will be more and more harsh for the western civilization (think collapse of the Soviet Union). There will be less money for everything, infrastructure in particular, things will fail and you will have to deal with it locally and the DIY way.

gregd · on Nov 19, 2014

We used to have a locked "oh shit" box. I was supposed to put our DR (which we did actually have) and a host of other things in it (it was suggested we even put a loaded gun in it) to get by with in the case of a total disaster. We were supposed to then ship it off to Iron Mountain. That oh shit box sat empty for years on the premises...

jmnicolas · on Nov 20, 2014

You must have really interesting stories to tell, but I guess it's classified.

scientist · on Nov 19, 2014

See also https://news.ycombinator.com/item?id=8627630

us0r · on Nov 19, 2014

My VMs are down. This much be something major.

plasma · on Nov 19, 2014

I think because the disks are backed by blob storage.

iancarroll · on Nov 19, 2014

Seems to be back up now, my site (https://ian.sh) was down for a while.

scoj · on Nov 19, 2014

There really isn't anything I can do either. My VM isn't back up yet. I'd go to sleep and just expect it to be online in the morning (when it really matters), but I'm afraid a drive won't reattach or something like that. Meanwhile, twiddling thumbs...hit F5...twiddle thumbs...)

csbowe · on Nov 19, 2014

The page cannot be displayed because an internal server error has occurred.

Their error pages are less graceful than mine.

duedl0r · on Nov 19, 2014

come on! give them some slack.. they probably aren't very experienced at managing their linux servers! ;)

hyperliner · on Nov 19, 2014

Clearly neither did the developers of these apps which are now down who thought of spending as few pennies as possible and save a few other pennies with load balancing failover, and then expecting magic!

jsudhams · on Nov 19, 2014

This is why i have server class refurbished machine handy as working backup so that you can restore if ther service is not restored with in few minutes. Or have another copy of vm/db in other provider like rackspace or something

sspies · on Nov 19, 2014

Do you run multi-region or maybe multi-provider setups? How do you migrate your instances from failed regions to healthy ones? How do you route users to the healthy regions? DNS? Do you think anycast could be an alternative?

NicoJuicy · on Nov 19, 2014

My website, my webapplication for member management + my clients are down :s, i really don't like this...

Didn't receive any calls yet, but i don't think that will take long.

silverbax88 · on Nov 19, 2014

This is nice. My web site server IP was changed when the server came back up. So now I have to update all of the site DNS settings.

ohyesyodo · on Nov 19, 2014

Hmm. You should be using CNAME records rather than IP addresses. Or are you using the new fixed IP features?

NinjaTime · on Nov 20, 2014

Disgusting Virtual service

Disgusting management interface

Abysmal support

Way to fuck up a mustard sandwich Microsoftie

We moved everything we had away from that Virus named Azure.

scoj · on Nov 19, 2014

My VM is still down (US East). Is anyone else still experiencing issues?

photorized · on Nov 19, 2014

My VMs appear to be up mostly, but they are primarily in North Central.

scoj · on Nov 19, 2014

Thanks, I just restarted it and it took a while (5 minutes or so), but after that, it appears fine.

Nmachine · on Nov 19, 2014

"Everything is running great"

smoyer · on Nov 19, 2014

It's obviously "All Good!"

ExpiredLink · on Nov 19, 2014

Just a data dump for the NSA. Nothing serious!

Aoyagi · on Nov 19, 2014

So did anyone receive a call "The cloud is down"? Or at least an e-mail?

superuser2 · on Nov 19, 2014

Every time this happens, ask yourself... Are you outage-proof? Do you have a rational reason to believe that internally-managed infrastructure would never have a problem like this?

damian2000 · on Nov 19, 2014

I'm guessing the reason that this site is down I was trying to load ... http://www.dotnetrocks.com/

csbowe · on Nov 19, 2014

Maybe they unknowingly upgraded to Intel's latest SSDs in their storage array. https://news.ycombinator.com/item?id=8626928

BeeDunc · on Nov 19, 2014

This outage exposes the clowns that actually chose Azure as their cloud provider. If you use AMZN and it goes down, at least you're in good company, with the likes of Netflix, Twitter, Instagram, and so on. It's like yeah, I'm big like they are. So what, it went down, so is Netflix.

What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.

keithwarren · on Nov 19, 2014

Over 80% of the Fortune 500 run on Azure.

20% of Azure VMs are Linux.

You are not well informed.

brongondwana · on Nov 19, 2014

"run on" - I suspect you're being fed a unicode pile of poo here.

More likely the have _something_ which runs on Azure. Fortune 500s are, pretty much by definition, quite large - and probably have tons of departments and sub departments. And at least one of those departments probably has a task of trying out new things, like Azure, by running something on it.

What surprises me is that nearly 20% of Fortune 500s _don't_ have something running on Azure.

(I wonder what percentage "run on" Amazon)

VSpike · on Nov 19, 2014

Agreed. When I was looking at cloud providers recently I noticed that most of them make a claim along the lines of "$X percent of {Fortune 500,FTSE 100} companies use $OURPRODUCT" where $X is > 50%. My conclusions were:-

* Most major companies use more than one cloud provider

* "Use" is a very loose term here. It could mean anything from "the accounts team in some branch office uses S3 to back up their Sage data (or uses an online backup service that uses S3 in the back end)" to "they run their main product on our infrastructure".

curiousDog · on Nov 19, 2014

Ignoring the fortune companies, most of Microsoft's own services like Office 365 run on Azure. That's a pretty big bet right there.

apapli · on Nov 19, 2014

Actually I don't think they do - to the best of my knowledge Office 365 didn't have any downtime as a result of this outage. And Yammer stayed up, so I assume they haven't yet migrated from AWS...

nullrouted · on Nov 19, 2014

Someone is lying to you. Can you provide your source for this?