Verizon has no excuse for its planned cloud outage

blfr · on Jan 10, 2015

Why does Verizon have an enterprise cloud? And why is anyone using it?

Is it somehow related to their main business? Do the servers run deep within their network allowing quicker access for mobile users? What is their unique selling proposition?

They have a whole "why Verizon" section[1]. I don't feel like it answers this question.

[1] http://cloud.verizon.com/why-verizon

ben1040 · on Jan 10, 2015

Verizon Enterprise is the service lineup that they inherited from acquiring MCI some years ago. MCI, of course, having been kind of a Big Deal as it relates to the internet way back when.

http://en.wikipedia.org/wiki/Verizon_Enterprise_Solutions

pwarner · on Jan 10, 2015

Telcos logic lately seems to be: Folks are using the network for something exciting. We know the network part, let's expand into the exciting thing that uses the network. So cloud services, streaming video, app stores. All failures so far.

Wouldn't be such a problem if they did a good job at their core business but as we learn more about google and Amazon's network, it seems like the telcos may fail there too.

Spooky23 · on Jan 10, 2015

They bought Terremark.

TheDong · on Jan 10, 2015

This articles misses the point of the cloud. The argument for "live migrations" and lambasting of AWS for bouncing servers for the Xen exploit are both completely misguided.

The point of the Cloud is not that a server never restarts or has downtime. It's that your app runs on many servers in different AZs and regions such that any one server failing is not going to have a real impact and can be easily and quickly replaced.

AWS, when bouncing their servers, did them one az and one region at a time. Because of that, if you ran across multiple AZs you would get to test your failure scenarios, but not be down.

The author's focus on other cloud providers having restarts makes it sound like he's saying "Verizon is bad, but in bad company". There is no comparison to be made between taking down your whole cloud, and losing a server here and there.

t0mas88 · on Jan 10, 2015

Spot on. A random server could go down at any moment, either by a hardware failure or because your cloud provider decides it needs maintenance. The latter actually gives you an up-front warning, the former is just "bang". So any well designed system independent of running on a cloud or on your own hardware should be able to deal with this, since it's the reality of servers: sometimes one breaks.

All of the bigger Clouds (Amazon, Rackspace, SoftLayer) handle their security related restarts this way, one area/region/DC at a time carefully making sure that well designed systems don't notice downtime.

What Verizon does is a totally different thing and something most "important but not life threatening" systems do not design for: Taking down the whole cloud with all servers at the same time.

devonkim · on Jan 10, 2015

What hasn't really been said well is that there's constant incidents that are going on in large cloud providers and you don't need to have a maintenance period like this to experience failures. I work in a place where people are still used to individual machines being up constantly by IP alone and aren't practicing high availability architectures in their cloud environments. I had to explain to an architect that VMs can simply just hiccup in AWS or any other provider and that just relaunching the instance is fine. The idea of random failures may be foreign, but I was certainly a bit surprised that someone with such rudimentary understanding of infrastrucure wouldn't be familiar with the idea of "have you tried turning it off and on again?"

With that said, I have zero hope that most enterprises will be able to get a decently solid failover strategy for their cloud applications because an awful lot of them are barely able to get anything to even one cloud provider setup at the most basic layout. Half the folks I'm aware of don't even have cross AZ redundancy let alone cross region failover. When people are turning off 3-4 VMs to save money, you're in no position to think about cross provider load balancing.

Most of the enterprises I'm familiar with would throw Verizon more and more money in the blind hope that it'll make their service more reliable. They'll do this to an incredible dollar amount partly because it'll still be cheaper than their old, traditional in house IT that cost multiple orders magnitude more but got way worse availability. Stuff like this scheduled maintenance shows how foolish it is to believe that someone will magically become ultra reliable by you virtue of you paying them to be.

specialk · on Jan 10, 2015

Two whole days of zero service. In this one moment I have lost all expectations of good service from Verizon's cloud.

It doesn't look like customers were given much warning either. This story was originally published on the 6th of Jan. Could you imagine trying to find alternate hosting setup by the weekend if you have any kind of availability expectations? It seems like madness to me. Even if you did move yourself to another host to cover this 48 hours of downtime how likely are you to move the majority of your business over to AWS, Google Cloud, Azure etc.

The lack of notice on this seems to be a bigger issue to me than the fact that Verizon is taking their whole cloud out of service for 48 straight hours.

jccooper · on Jan 10, 2015

Two days? You practically have to move to a different host. And if I did that, I wouldn't bother moving back. If they have any customers left after this it'll be a miracle.

Then again, I don't know why you'd think Verizon was a good hosting provider in the first place.

iancarroll · on Jan 10, 2015

At the start of Google Cloud they had planned outages lasting weeks. Now, you could (with a bit of hassle) move your instance to another region, but it was very annoying and off-putting.

obstinate · on Jan 10, 2015

Really? Is there a link about this? I can't find anything.

wmf · on Jan 10, 2015

http://blogs.gartner.com/lydia_leong/2013/11/14/google-compu...

This is a lesson about the difference between hosting your own apps and hosting customers. All internal Google apps are designed to survive failure of an entire datacenter, so it sounds like they reuse that mechanism for maintenance rather than the riskier practice of maintaining a datacenter while it's live.

Getting back to Verizon, I work in an "enterprise" and scheduled multi-day datacenter maintenance tends to happen about once a year. Many of Verizon's customers would probably have no problem taking down their infrastructure for a weekend, but they feel helpless when someone else does it to them.

iancarroll · on Jan 10, 2015

I can't find a source, this is only what I remember. I was an early beta tester of GCE and I think these were scheduled shortly after public launch.

obstinate · on Jan 10, 2015

According to the other person, it is that individual zones may be offline for a week. That makes sense. A whole cloud being offline for a week doesn't make sense, but yea, sometimes datacenters need new network fabrics.

tootie · on Jan 10, 2015

If I was a hosting company and needed a 2 day outage, I'd probably just shut down my business and offer to cover some expenses for users to migrate whilst apologizing profusely.

GFK_of_xmaspast · on Jan 10, 2015

Why would you ever have expectations of good service from anything Verizon.

lwhalen · on Jan 10, 2015

The big print giveth, and the small print taketh away. When the reality fails to live up to the marketing and hype, folks are naturally going to be upset. However, those of us who actually work in IT for a living expect downtime and architect for it. In a perfect world, Verizon's customer reaction should have been one of mild annoyance - "Oh well, guess we'll fail over to our secondary for that weekend". Anyone who drank the 'zero downtime' kool-aid will hopefully treat this as a (deserved) wake-up call that marketing does not excuse poor engineering on their part. Don't like the fact that your 'immortal' cloud provider goes down in planned (and sometimes unplanned) ways? Buy your own iron and do it yourself. It's the only way you can be completely 100% certain it's done right.

themartorana · on Jan 10, 2015

Two hours is sufferable. Two days is insane. Not everyone has the engineering talent or time to have auto-deploy multi-cloud failover.

Not to mention that if you buy in to certain services, say, any AWS architecture beyond EC2, failover to another provider becomes a lot closer to impractical if not impossible.

Two hours is sufferable...

ademarre · on Jan 10, 2015

> if you buy in to certain services, say, any AWS architecture beyond EC2, failover to another provider becomes a lot closer to impractical if not impossible.

Yes. I say this all the time and continue to be amazed that this is such a foreign idea to many cloud users. When you stick to the infrastructure-as-a-service offerings, like EC2, and steer clear of cloud vendors' proprietary platform services, then you evade costly vendor lock-in. And this comes with very little added engineering cost.

Granted, each organization will value vendor independence differently, but I suspect many organizations don't give enough consideration to worst-case scenarios.

lwhalen · on Jan 10, 2015

This is where good sysadmins, and teams thereof, earn their high salaries. You shouldn't have to suffer through a two-day outage, but it happens. A good sysadmin will insist on multi-provider hosting, be able to advise you on how not to get locked into $vendor's precious snowflake cloud implementation, and myriad other things.

falcolas · on Jan 10, 2015

Yup. Hurricane Sandy should have been enough to remind folks of this.

Natural (and unnatural) disasters happen, and you have to be prepared for this contingency. There are many tools out there which can help with this, but a competent sysadmin will help a hundredfold ensure that your business continues when the unthinkable happens.

Heck, only today one of Amazon's new datacenters on the east coast had a 3 alarm fire[1]. Didn't end up having any instances fail, but we did notice some of our services have problems that coincided with the fire, and we were ready to fail over the moment instances started dropping.

[1] http://money.cnn.com/2015/01/09/technology/amazon-data-cente...

ef4 · on Jan 10, 2015

> those of us who actually work in IT for a living expect downtime and architect for it.

Does Verizon not work in IT for a living? Why does their architecture not allow them to switch over to a backup despite having many days of warning?

lwhalen · on Jan 10, 2015

I'm not saying Verizon is blameless here, but it's a matter of 'would you rather be right, or happy?' I personally will take 'happy' more often than not. You're absolutely right, Verizon shouldn't find a 48h downtime window acceptable. However, their customers who defined their business requirements to include a separate-vendor DR platform sure are (comparatively) happy right now.

ctdonath · on Jan 10, 2015

Don't confuse 'zero downtime' with TWO DAYS. Nobody expects true 'zero downtime', and a few minutes/hours down occurrences throughout the year adding up to a couple days may be tolerated, but... two straight days? no.

Hey, let's try a little experiment on the subject.

I'll sit here and do nothing starting right now, simulating unscheduled downtime, for the next two minutes.

You sit there and do nothing starting right now, simulating unscheduled downtime, for the next two days.

ETA: two days downtime divided by 365 days a year is over 15 minutes per day, or almost 2 hours per week ... any provider down that much & often would lose customers fast.

duskwuff · on Jan 10, 2015

Or look at it this way: if Verizon's service functions perfectly for the next year, they'll still only have 99.4% uptime over that period. That's pretty bad -- even most commodity web hosting providers do better than that.

lmm · on Jan 10, 2015

The website I host on my home computer has better uptime than that.

snowwrestler · on Jan 10, 2015

The truth is that most websites and services actually can be down for two days and still survive.

The culture of technology rightfully abhors downtime, and that's a good thing because it's our job to keep things up. But in reality there are very few companies who cannot survive 2 days of downtime. Sony Pictures is still in business, for example, despite an effective downtime of weeks.

_delirium · on Jan 10, 2015

Multi-day planned downtime for maintenance is even fairly common in Big Old Companies, especially for customer- and employee-facing systems. They just turn the thing off for a weekend, warn users in advance, and that's that. My bank's online banking seems to use that strategy. A few times a year, it tells me: sorry, but due to planned maintenance you'll have to check back Sunday evening. It's kind of surprising to me every time, but seems to be common. Some big companies do it with employee infrastructure like corporate email or network drives too. Some kind of system migration (or even physical server move) will mean on a random Saturday there's no email or file access. Announced in advance, sure, but it's still down for the day.

Verizon telling you that some random servers are going to be down for two days admittedly might be worse than taking one of your own systems down for a maintenance window is, especially if you didn't expect in advance that such a thing was going to be a possibility, and made the mistake of putting really-can't-be-down stuff on Verizon's cloud.

tootie · on Jan 10, 2015

Doing a rollout for a Fortune 100 company pretty soon and they built in an outage of a few hours. And it's literally because they won't spring for 1 extra server running apache even with 6 months lead time.

leeoniya · on Jan 10, 2015

downtime costs money; frequently, a LOT of money. it's not a question of survival. severe downtime losses should be compensated, this is why SLAs exist. i can't put my team on involuntary furlough, citing shitty infrastructure choices.

res0nat0r · on Jan 10, 2015

> Although I never expect 100 percent uptime, planned outages aren't needed if the cloud platforms are designed correctly. It's quite possible to do live migrations without a server reset these days, yet the cloud providers seem to be missing the boat here.

It is slightly more complicated than that.

> There should be no cloud outages, ever. Got that?

Really? That's not how the real world works. This "article" is pretty crap.

specialk · on Jan 10, 2015

Yeah I agree the hyperbolic sentences are over the top. The only way for anyone to reliably get 100% uptime is to use two or three cloud providers or their own dedicated boxes somewhere. Even AWS don't provide 100% uptime guarantees. Though I think the author's underlying point that good design should lead to fewer/smaller planned outages. However, his statements are over the top in the extreme.

ctdonath · on Jan 10, 2015

Two days down, by a world-scale provider, is also over the top.

Nobody expects 100% uptime. They also don't expect 100% down for days.

res0nat0r · on Jan 10, 2015

Understood. His main point is that no cloud provider should ever have any downtime which is idiotic and no reputable company would ever promise zero downtime.

Rapzid · on Jan 10, 2015

True story. Amazon is all about telling people to plan for failure and building around it in your architecture.

Also, nearly nothing will give you 100% uptime. That's fault tolerance and almost nobody has it.

awor · on Jan 10, 2015

Two DAYS