GCE down in all regions

gtaylor · on April 12, 2016

I've been overall very impressed with the direction of Google Cloud over the last year. I feel like their container strategy is much better than Amazon's ECS in that the core is built on open source technology.

This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.

Ah well. Godspeed to anyone affected by this, including the SREs over at Google!

MichaelGG · on April 12, 2016

I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The fact the UI spits out API examples for doing what you're doing is really cool. And it's oh-so-fast. (From what I can tell, gcloud's SSD is 10x faster or 1/10th the cost of AWS.)

And this is coming from a guy that really dislikes Google overall. I was working on a project that might qualify for Azure's BizSpark Plus (they give you like $5K a month in credit), and I'd prefer to pay for gcloud than get Azure for free

louis-paul · on April 12, 2016

Same, was considering GCP for the future, but this is bad. I'm not using them without some kind of redundancy with another provider. I hope they write a good post-mortem, these are always interesting at large scale.

Gratsby · on April 12, 2016

How bad is it really? They started investigating at 18:51, confirmed a problem in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at 19:26.

They posted that they will share results of their internal investigation.

That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.

In this situation, I am thoroughly impressed with Google.

louis-paul · on April 12, 2016

It's bad because it concerns all their regions at the same time, while competing providers have mitigations against this in place. AWS completely isolates its regions for instance [1], so they can fail independently and not affect anything else. That Google let an issue (or even a cascade of problems) affect all its geographic points of presence really shows a lack of maturity of the platform. I don't want to make too many assumptions, and that specific problem could have affected AWS in the same way, so let's wait for more details on their part.

The response times are what's expected when you are running one of the biggest server fleets in the world.

1: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-re...

Gratsby · on April 12, 2016

Expecting that problems won't happen with a cloud provider that happen everywhere else is a pipe dream. They might be better at it because of scale, but no cloud provider can always be up. It happened at Amazon, now it's happened at Google. Eventually, finding a provider that never went down will be like finding the airline that never crashed.

Operating across regions decreases the chances of downtime, it does not eliminate them.

> The response times are what's expected when you are running one of the biggest server fleets in the world.

That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.

Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.

louis-paul · on April 12, 2016

> It happened at Amazon

AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.

Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.

Gratsby · on April 12, 2016

Here's an example from this thread: http://status.aws.amazon.com/s3-20080720.html

louis-paul · on April 12, 2016

I stand corrected, my statement was too broad.

AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.

1: https://aws.amazon.com/about-aws/global-infrastructure/

discodave · on April 12, 2016

> AWS completely isolates its regions

Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.

MichaelGG · on April 12, 2016

Our servers didn't go off, just lost connectivity. Same has happened to even big providers like Level3. Someone leaks routes or something and boom, all gone.

I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.

shoyer · on April 12, 2016

Actually, according to the status report, they confirmed that the issue affected all regions at 19:21 and resolved it by 19:27. That's six minutes of global outage.

Disclaimer: I work for Google (not on Cloud).

williamstein · on April 12, 2016

The outage took my site down (on us-central1-c) at 19:13, according to my logs, so it was already impacting multiple regions by 19:13. (I have been using GCP since 2012 and love it.)

Gratsby · on April 12, 2016

Thank you, I missed that on my first reading - I saw the status update was posted at 19:45, not the content within it stating the issue was resolved at 19:27. I updated my parent comment.

raymondh · on April 12, 2016

I concur. The response was first rate.

Behind the scenes, I'm sure they will iterate on failure prevention and risk analysis.

dcgoss · on April 12, 2016

Absolutely. GCP has been fantastic.

troymc · on April 12, 2016

Amazon S3 went down globally on July 20, 2008: http://status.aws.amazon.com/s3-20080720.html

BlackjackCF · on April 12, 2016

Sadly, I think a global outage was more acceptable in 2008 than it is now...

Knowing Google though, they'll learn their lesson on how to improve their entire workflow right quick.

nkristoffersen · on April 12, 2016

Keep in mind, S3 was still a very new project at that point. Launched March 2006 and the first of its kind.

https://en.wikipedia.org/wiki/Amazon_S3

sandGorgon · on April 12, 2016

Can you talk about this? I have been spectacularly unsuccessful at using ECS (and currently run my VMs on a vanilla Debian ec2 instance)

jordanthoms · on April 12, 2016

Switching from ECS to GKE (Google Container Engine) currently. Both seem overcomplex for the simpler cases of deploying apps (and provide a lot of flexibility in return), but I have found the performance of GKE (e.g. time for configuration changes to be applied, new containers booted, etc) to be vastly superior. The networking is also much better, GKE has overlay networking so your containers can talk to each other and the outside world pretty smoothly.

GKE has good commandline tools but the web interface is even more limited than ECS's is - I assume at some point they'll integrate the Kubernetes webui into the GCP console.

GKE is still pretty immature though, more so than I realized when I started working with it. The deployments API (which is a huge improvement) has only just landed, and the integration with load balancing and SSL etc is still very green. ECS is also pretty immature though.

merb · on April 12, 2016

The Problem is that GCP doesn't run an RDS service with PostgreSQL. And external vendors are mostly more costly than AWS RDS. Especially for some customer homepages where you want to run on managed stuff as cheap as possible.

brianwawok · on April 12, 2016

This is sad for sure. The new MySQL cloud 2.0 is really good, and if you use a DB agnostic ORM you can probably make MySQL work for quite a while. Sad to lose access to all the new PG features though, and I would love if Google expanded their cloud SQL offerings..

bpicolo · on April 12, 2016

This is what made me use aws as well

huslage · on April 12, 2016

I'm admittedly biased, but have you checked out Docker Cloud? http://cloud.docker.com

vhab · on April 12, 2016

While not Docker Cloud specifically, when we eyeballed UCP we found it very underwhelming when pitted against Kubernetes.

To us it appeared yet another in a sea of many orchestration tools that will give you a very quick and impressive "Hello World", but then fail to adapt to real world situations.

This is what Kubernetes really has going for it, every release adds more blocks and tools that are useful and composable targeting real world use (and allow many of us crazies to deal with the oddball and quirky behavior our fleet of applications may have), not just a single path of how applications would ideally work.

This generally has been a trend with Docker's tooling outside of Docker itself unfortunately. Similarly docker-compose is great for our development boxes, but nowhere near useful for production. And it doesn't help Docker's enterprise offerings still steer you towards using docker-compose and the likes.

akanet · on April 12, 2016

Not to bash, but the page you linked is classic Docker - it says literally nothing about what "Docker Cloud" is.

"BUILD SHIP & RUN, ANY APP, ANYWHERE" is the slogan they repeat everywhere, including here, and it means even less everytime they do it. What IS Docker Cloud? Is it like Swarm? Does it use Swarm? What kinds of customers is Docker Cloud especially good at helping? All these mysteries and more, resolved never.

jordanthoms · on April 12, 2016

I hadn't heard of it, actually. However it doesn't seem to support GCP which removes it from contention for us unfortunately.

sandGorgon · on April 12, 2016

so am I (I'm YC alumni) .. but RDS is too important for us to move away from it. Let me put it this way - if you had an RDS equivalent in Docker Cloud, lots of people would switch. Docker is more popular than you know.

Heroku should be an interesting learning example to the tons of new age cloud PAAS that I'm seeing. Heroku database hosting has always been key to adoption.. to an extent that lots of people continue to use it even after they move their servers to bare metal. The consideration and price sensitivity to data is very different than app servers.

kahwooi · on April 12, 2016

I believe this is tutum that they bought some time ago. I tried tutum before with Azure. After deleting the Containers from tutum portal, it does not clean everything from Azure. Today the storage created by tutum is still in my Azure storage. LOL.

smargolis · on April 12, 2016

Docker Cloud still requires BYO cloud, however.

dcgoss · on April 12, 2016

For the record, the Kubernetes dashboard comes pre-installed on all masters on GKE. So the UI is there, albeit not integrated into the Gcloud console.

sk5t · on April 12, 2016

Seconded--I can tell the ECS documentation is trying to help, but the foreign task/service/cluster model + crude console UI keeps telling me to let my workload ride on EC2 and maybe come back later.

sandGorgon · on April 12, 2016

what I figured out much later was that ECS is a thin layer on top of a number of AWS services - they use an AMI that I can use, ec2 VMs that I can run myself and Security Groups + IAMs that I can create by my own.

But the way they have built the ECS layer is very very VERY bad.. and I have an unusually high threshold for documentation pain.

nzoschke · on April 12, 2016

I work on Convox, an open source PaaS. Currently it is AWS only. It sets up a cluster correctly in a few minutes. Then you have a simple API - apps, builds, releases, environment and processes - to work with. Under the hood we deploy to ECS but you don't have to worry about it.

So I do agree that ECS is hard to use but with better tooling it doesn't have to be.

I'm also a big fan of how GKE is shaping up.

avolcano · on April 12, 2016

Spotify is down due to this, which is, uh, pretty hilarious https://news.spotify.com/us/2016/02/23/announcing-spotify-in...

ryankupyn · on April 12, 2016

If Spotify wanted to be really sneaky, some amount of downtime might be good for them financially.

The bulk of their revenue comes from customers who subscribe on a per-month basis, while they pay out royalties on a per-song-played basis. This outage is reducing the amount they have to pay, and if the outage-elasticity-of-demand is low enough they could (hypothetically) come out ahead!

flashman · on April 12, 2016

> while they pay out royalties on a per-song-played basis

I believe this is inaccurate. They pay out royalties on a share-of-all-plays basis, don't they?[0] So an outage wouldn't reduce the payout amount, it would just slightly alter the balance of payments for individual rightsholders.

[0]http://www.spotifyartists.com/spotify-explained/#royalties-i...: "That 70% is split amongst the rights holders in accordance with the popularity of their music on the service. The label or publisher then divides these royalties and accounts to each artist depending on their individual deals... Spotify does not calculate royalties based upon a fixed “per play” rate."

itsnotvalid · on April 12, 2016

Rather, as they don't have SLA to end-users, they get credits from Google, so they still earn money.

kahwooi · on April 12, 2016

But the reputation is damage.

Mafana0 · on April 12, 2016

I see your point. Based on @SpotifyStatus [0], it wasn't uncommon for Spotify to have service disruptions before they did the move though.

[0]: https://twitter.com/SpotifyStatus

outside1234 · on April 12, 2016

I bet they and Google will have a positive post written up about it tomorrow though. :)

eva1984 · on April 12, 2016

To be fair, AWS has some significant down time in the second half of last year.

cavisne · on April 12, 2016

Not across multiple regions though. It's not trivial to make an application cross region but at least there is a way to engineer around an outage, unlike this outage

NetStrikeForce · on April 12, 2016

I think it's a great idea to diversify not just regions, but providers.

The future is for per-application virtual networks that are agnostic to the underlying hosting provider. These networks work as an overlay, which means that your applications can be moved through providers without changing their architecture at all. You could even shut it down in provider A and start it in provider B without any changes at all.

At Wormhole[1] we have identified this problem and solved it.

[1]: https://wormhole.network

trevor-e · on April 12, 2016

I was getting really frustrated at the gym when the Spotify app wouldn't work. Didn't expect to find the answer here.

chris_wot · on April 12, 2016

And they won't be down again in a long, long time.

jasonjei · on April 12, 2016

Was Google services (search, email, drives, apps) impacted at all?

tempestn · on April 12, 2016

Google Custom Search was down for a similar time period. (Outage lasted longer than GCE, but seems likely related.)

asadm · on April 12, 2016

No they don't use gcloud to host their own apps. Yes that's ridiculous.

limelight · on April 12, 2016

Which is exactly why I won't use GCE. If Google isn't confident enough to use it for themselves, neither will I.

The fact that Amazon dogfoods AWS is a major advantage for them.

brown9-2 · on April 12, 2016

Even if Amazon.com uses AWS (the extent of which seems it may be a mixture of marketing hype and urban legend), there are many ways AWS could fail that affects customers but leaves Amazon.com unaffected.

kahwooi · on April 12, 2016

It would be hilarious if they use to host their services on AWS or Azure.

draw_down · on April 12, 2016

I remember people speculating that Spotify most likely received a big discount for doing this. Guess you get what you pay for ;)

trhway · on April 12, 2016

Google provides 99.95% - https://cloud.google.com/appengine/sla

That is like 17 of such 15 minutes breaks per year, i.e. an allowance for one small (or a large fixed quickly) screwdup/month :)

_0nac · on April 12, 2016

99.95% is correct, but the Compute Engine SLA is actually here: https://cloud.google.com/compute/sla

Today's incident did not impact App Engine at all.

(Disclaimer: I work in Google Cloud Support.)

ec109685 · on April 12, 2016

This allows unlimited small outages:

"Downtime Period" means, for an Application, a period of five consecutive minutes of Downtime. Intermittent Downtime for a period of less than five minutes will not be counted towards any Downtime Periods.

r0fls · on April 12, 2016

It seems to be working fine, I just tested.

BinaryIdiot · on April 12, 2016

Outage lasted about 16 minutes

sundvor · on April 12, 2016

Happy to say I didn't notice; I am using a 517 song offlined playlist ("EVE Online" by Michael Andrew) for my programming work.

bonestamp2 · on April 12, 2016

That sounds awesome, care to share?

sundvor · on April 14, 2016

Sorry late - link is https://open.spotify.com/user/1231239981/playlist/3ka1SYnv2b... . Hope you see this post and enjoy it.

I see from the down votes that my reply must have been seen as kind of off topic to the GCE issue, however since Spotify came up as a "victim", I did feel it prudent to mention that Spotify Premium has offline playlists to allow users to weather network issues of any kind. Also for me personally, big playlists of quality music like this one is fantastic for my work.

silverlight · on April 12, 2016

That was nuts. Interested to read the post-mortem on this one. Our site went down as well. What could cause a sudden all-region meltdown like that? Aren't regions supposed to be more isolated to prevent this type of thing?

Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...

ninkendo · on April 12, 2016

To quote @DEVOPS_BORAT:

To make error is human. To propagate error to all server in automatic way is devops.

chiph · on April 12, 2016

So, you're saying the servers are aladeen?

epmatsw · on April 12, 2016

ryangordon · on April 12, 2016

Our website was down as well for 16 minutes. My guess is that it was a bad route that was pushed out simultaneously (probably was not the intention). It happened once before, sometimes last year, if I remember correctly. We'll have to wait and see what the definitive cause was though.

NetStrikeForce · on April 12, 2016

Sounds like a routing issue.

Maybe someone pushed the wrong BGP routes, hence why the quick fix and the initial issue with Cloud VPN.

Source: Totally guessing.

taf2 · on April 12, 2016

Unless a bug was in all regions it's why it's good to consider multiple cloud services for failover

tracker1 · on April 12, 2016

And where do you put your system that directs traffic to one cloud and/or another... and what happens when that goes down?

DanielDent · on April 12, 2016

You get an AS number, and announce your own IP space. DNS failover only sort-of works.

Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.

Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.

So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.

But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.

tracker1 · on April 12, 2016

Kind of the point... adding more possibilities for failure, at increased complexity and expense isn't always worth it... and I'd say usually isn't.

jon-wood · on April 12, 2016

You put it in all your clouds, with low TTL DNS entries pointing at all those instances (or the closest one geographically maybe). Then if you're really paranoid you use redundant DNS providers as well.

packetslave · on April 12, 2016

And then you discover that there are a LOT of craptastic DNS resolvers, middle boxes, AND ISP DNS servers out there that happily ignore or rewrite TTLs. With a high-volume web service, you can have a 1 minute TTL, change your A records, and still see a lovely long tail of traffic hitting the old IP for HOURS.

tracker1 · on April 12, 2016

The point was that adding another point for potential failure still won't reduce the chance of failure... it's just something else that can and will break.

In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.

cookrn · on April 12, 2016

I wonder if it was an issue with their Maglevs[0]?

[0] http://research.google.com/pubs/pub44824.html

HorizonXP · on April 12, 2016

So in all seriousness, how do folks deal with this?

In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.

But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.

What do others do?

pjlegato · on April 12, 2016

Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud, because:

1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and

2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.

If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.

There really aren't any other options besides these two.

nemik · on April 12, 2016

This here is right on.

I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.

But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.

ghshephard · on April 12, 2016

If you are doing life support, military systems, or HA big finance, then you are quite likely to be running on dedicated equipment, with dedicated circuits, and quite often highly customized/configured non-stop hardware/operating systems.

You are unlikely to be running such systems on AWS or GCE.

creshal · on April 12, 2016

And that's why IBM is still in the server business: There's nothing like a mainframe when it comes to uptime.

ghshephard · on April 12, 2016

HP also has some good products in the highly available space - http://h20195.www2.hp.com/v2/getpdf.aspx/4aa4-2988enw.pdf , likely from their acquisition of Tandem.

creshal · on April 13, 2016

> likely from their acquisition of Tandem.

Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.

Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.

manigandham · on April 13, 2016

IBM also owns Softlayer which is a great cloud provider for the more traditional VM/dedicated servers architecture.

atemerev · on April 12, 2016

And have similar failure rates. Human errors are inevitable.

oxplot · on April 12, 2016

> even when hosting on a major cloud

Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.

At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.

ryanobjc · on April 12, 2016

Also, 20 minutes every 3 years is 5-9s anyways.

kuschku · on April 12, 2016

> Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud

If THAT is what I get for the prices of Google Cloud Engine, I could just as well use OVHs cloud -- uptime isn't worse, and price is a lot cheaper.

_delirium · on April 12, 2016

Depends entirely on your business, but what I do is just tolerate the occasional 15-minute outage. There's increasing cost to getting more 9's on your uptime, and for me, engineering a system that has better uptime than Google Cloud does, by doing multi-cloud failover, is way out of favorable cost/benefit territory.

kbenson · on April 12, 2016

That's the only sane thing to do.

It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.

All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.

dgacmu · on April 12, 2016

One answer is to evaluate the uptime you get with a single cloud provider and figure out if it meets your needs. This outage means that for the year, GCE will have at most a .999948 == four and a half "nines" of uptime. From a networkworld article in 2015: http://www.networkworld.com/article/2866950/cloud-computing/... And 2015: http://www.networkworld.com/article/3020235/cloud-computing/...

The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.

My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.

scurvy · on April 12, 2016

Most people and customers are tolerant of 15 minutes downtime here and there once or twice a year. Sure, there will be loudmouths who claim to be losing thousands of dollars in sales or business, but they're usually lying and/or not customers you want to have. They'll probably leave to save $1 per month with another vendor.

It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.

akbar501 · on April 12, 2016

IMHO, the next wave is likely multi-cloud. Enterprises that require maximum uptime will likely run infrastructure that spans multiple cloud providers (and optionally one or more company controlled data centers).

OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.

DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.

DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.

Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.

sdenton4 · on April 12, 2016

Get enough things running on multi-cloud, and you could potentially see multi cloud rolling failures, caused by (for example) a brief failure in service A leading to a gigantic load-shift to service B...

Matt3o12_ · on April 12, 2016

This assumes all the software your stack uses and you deploy is completely bug free. While rare, bugs can occure that have been in production for a long time and those will only occur when you hit a certain conditions. Now, all your services are down. 100% is impossible.

Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.

bpchaps · on April 12, 2016

By pretending it's the still the golden age of the internet and use physical servers in those locations. You might have to hire some admins, though ;).

taneq · on April 12, 2016

Hey now, we do want 100% uptime but let's not get hasty.

alecbaldwinlol · on April 12, 2016

Guys, I've got it- instead of locking ourselves in with one vendor's platform and being subject to their mismanagement and arbitrary rules, why don't we buy our own hardware and do it ourselves?

We can free up our OpEx budget too! My sales rep sent me a TCO that shows it is way cheaper to run a data center than to pay a cloud subscription!

I'm calling the CFO!

theDoug · on April 12, 2016

Great idea! Can't wait for our own mismanagement and arbitrary rules! I always wanted to play hobbyist technical infrastructure specialist.

magic_man · on April 12, 2016

Will your own hardware have the same reliability as google?

thawkins · on April 12, 2016

I think it was sarcasm......

simonebrunozzi · on April 12, 2016

AWS had several major outages in the past, especially between 2009 and 2012. In some cases, it was not only downtime, but also data loss, which is the hardest part. 8,760 hours in a year, if you are down for a total of less than 8.7 hours, you're in the >99.9% uptime category (also called "three 9s"). Four 9s (99.99%) is considered a nice plus. Very few businesses really need that.

However, uptime is one thing. Data loss is another different beast.

AFAIK, this one for Google is only downtime. Being able to maintain GCE up for most of the year, except a few hours, means more than 99.9% availability, which is what most customers need.

Operational excellence, or the ability to have your cloud up and running, comes only with a large customer set; Google is now gaining a lot of significant customers (Spotify, Snapchat, Apple, etc), and therefore I expect them to learn what's needed over the coming months.

2016 will be ok. 2017, in my view, will be near perfect.

If Google wants to differentiate them from AWS, they should offer an SLA on data integrity (at a premium, obviously). Here's how you can get thousands of enterprise customers.

Shameless plug: I've also extensively written about AWS, Azure and GCE here: https://medium.com/simone-brunozzi/the-cloud-wars-of-2016-3f...

ikeboy · on April 12, 2016

Great timing for that new book :)

louis-paul · on April 12, 2016

Seems kinda misleading from Google to claim repeatedly that they are hosted just on the same infrastructure of GCP, and not go down with it.

EDIT: Switched from "dishonest" to "misleading"; while it's abundantly clear that Google doesn't run on GCP, GCP feels like a second-class citizen to Google because you just cannot get Google uptime with it.

dsymonds · on April 12, 2016

Google and GCP run on the same infrastructure, but this was a GCP problem, not a problem with that common infrastructure.

remosi · on April 13, 2016

(I'm a Google SRE, I'm on the team that dealt with this outage)

This did impact common infrastructure. Some (non-cloud) Google services were impacted. We've spent years working on making sure gigantic outages are not externally visible for our services, but if you looked very closely at latency to some services you might have been able to see a spike during this outage.

My colleagues managed to resolve this before it stressed the non-cloud Google services to the point that the outage was "revealed". If this was not mitigated, the scope of the outage would have increased to include non-cloud Google services.

ryanobjc · on April 12, 2016

there's a lot of infrastructure at Google. And claim is correct - GCP and Google is on top of the same servers, same backend systems. Are they on the exactly same servers? No, of course not -- there's a few servers out there :-)

NetStrikeForce · on April 12, 2016

This was a global network outage, we can't talk about the "exact same servers". There seems to be an implication that Google runs on GCP and a global network outage can't affect all GCP's customers but one.

_0nac · on April 12, 2016

Back up as of 19:27 US/Pacific: https://status.cloud.google.com/incident/compute/16007

mrdrozdov · on April 12, 2016

I do not envy the current on-call rotation.

spyspy · on April 12, 2016

I don't envy anyone with an on-call job.

SixSigma · on April 12, 2016

I was so glad to get away after 5 years of 24/7/365. I had to drive home 5 hours from holiday once, leaving the rest of the family behind, spend 20 minutes sorting stuff out and drive back - the untold joy of pre-cloud startups :)

jethro_tell · on April 12, 2016

What do you recommend? I figure that if you're working on something without oncall no one probably cares about it any way. I prefer to have a good rotation than no rotation.

dredmorbius · on April 12, 2016

Staffing such that on-call is handled by presently-in-office staff. This is, as I understand, pretty much what Google does. When you're in the office, you're in the office, but when you're not, you're not. Having global coverage means ops in several timezones, and this is what Google accomplishes.

Not knowing when, at any time, your phone or pager will go off wears in interesting ways over time.

kyrra · on April 12, 2016

It depends on the team and type of oncall rotation for the service. My team (a SWE team) has its own oncall rotation as we don't have dedicated SREs for all of our services.

Since we US based only, it means the oncall person will have pager duty while they sleep. Our pager can be a bit loud at night due to the nature of our services, so it's definitely not for everyone (luckily it's optional).

dredmorbius · on April 12, 2016

Is this at Google?

I'll note you're SWE not SRE. I'm talking mostly about dedicated Ops crew on pager.

It's one thing if you're responding to pages resulting from other groups' coding errors or failure-to-build sufficiently robust systems. Another if you're self-servicing.

One of my own "take this job and shove it" moments came after pages started rolling in at 2am, bringing me on-site until 6am. I headed back for sleep, showed up that afternoon and commented on the failure of any of the dev team to answer calls/pages/texts (site falling over, I had exceptionally limited access capabilities and was new on team). Response was shrugs.

Mine was "That wasn't your ass being hauled out of bed. See ya."

kyrra · on April 12, 2016

The opinions stated here are my own, not necessarily those of Google.

Yes, it is at Google. Our important and high visibility bits have SREs that help monitor our services (SREs actually approached us to take over some bits that were more important).

Google has a lot of oncall people that aren't going to go into a data center (most googlers never see a data center). So there is lots of oncall rotations that still have an SLA that can be handled from their bed if it happens at 2am.

(I sadly can't give any examples)

packetslave · on April 12, 2016

This is not generally true for at least the big SRE-supported services at Google. I don't know what every team does, but my team's oncall shift (for example) is 10am-10pm, Mon-Thu or Fri-Sun. Another office covers the 10pm-10am part of the US day.

cavisne · on April 12, 2016

That's for first response ops though, what if a code change is needed to recover or something else that goes beyond the playbooks?

I guess today is a perfect example, I wonder how many out of hours engineers got paged.

dredmorbius · on April 12, 2016

Then Dev gets to deal with its own shit.

jethro_tell · on April 13, 2016

The magic of devOps is carting two pagers around.

kps · on April 12, 2016

There are still a few things in the world that aren't internet services.

distrill · on April 12, 2016

Only a few though.

simula67 · on April 12, 2016

Games, mobile apps, desktop apps like Photoshop, Office, Intellij etc and some shrink wrapped server side apps. But you are right, some of these products are starting to have an online component as well.

lucian1900 · on April 12, 2016

Interestingly, nowadays all but the last one tend to require a service to be available.

jsmeaton · on April 12, 2016

I recommend others doing on call so I don't have to. I'm not an ops person, though I probably wouldn't mind some of the job, and hate being on call. I did it for a year at my current workplace (as a dev). All the problems I was capable of fixing I automated away, and got really annoyed that others didn't do the same for their areas of expertise. In hindsight, we probably should have had separate rosters for separate areas to encourage ownership, but we were a very small team (6 or so).

geofft · on April 12, 2016

Developing software that other people deploy, as opposed to running a service people use directly, is pretty great for doing something people care about without being on call.

At my last job (comfortably small enterprise software shop), we had customers with more employees deploying and running our product than we ourselves had engineers. The only people who were first-line pageable were IT and the one engineer maintaining our demo server, which we eventually shut down.

avs733 · on April 12, 2016

There has to be a certain level of karma/schadenfreude of this happening in the week where they are pushing their SRE book...did they handle it well? It seems so, but a lot of their book is an ounce of prevention over a pound of on-call pagers going off.

packetslave · on April 12, 2016

Prevention is a big part of SRE, but an equally big part is formalizing a process to learn from the inevitable outages that come with running a large, complex, distributed system built by fallible humans.

You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.

paulsutter · on April 12, 2016

Ask HN: Is anyone using different cloud providers for failover and what's your DNS configuration?

Do any cloud providers allow announcing routes for anycast DNS?

pyvpx · on April 12, 2016

As a long time service provider network engineer I appreciate network clueful companies and recommend none and packet.net.

Vultr advertises and supports anycast, if you're looking for a multi-location vps provider. Others will do bgp but it's a sales process

anthony_franco · on April 12, 2016

Yes, I'm using two different providers and sync them up using master-master replication.

For DNS I use DNSMadeEasy's DNS Failover feature that automatically fails over to a different IP address when it's unable to ping the server.

retrogradeorbit · on April 12, 2016

I, too, would be interested in info on any cloud providers that support anycast.

louis-paul · on April 12, 2016

I'm no networking expert but packet.net has a page on this: https://www.packet.net/bare-metal/network/anycast/

NetStrikeForce · on April 12, 2016

I see benefit on using anycast for your DNS, but is anycast actually a better option than DNS load balancing for my site? The idea behind using anycast is to use at least two different providers, so having packet.net only doesn't really cut it. Also I can do DNS load balancing with any provider by using something like Azure's Traffic Manager, so I struggle to see advantages.

MichaelRenor · on April 12, 2016

It's very uncommon. In my experience the database becomes the issue.

tracker1 · on April 12, 2016

If you're using something like Cassandra (C*), it's pretty easy to have replicated data to multiple zones in multiple clouds. RethinkDB has a similar replication system... there are many others as well.

Not all data models fit into a Non-SQL database though, and may work better in a more relational.. caching and read-only partially up, can be another approach.

Designing your data around this may be impractical for some applications, and on the small scale, likely more costly than necessary. Most systems can tolerate 15-30 minutes of downtime every couple years because of an upstream provider.

Scarbutt · on April 12, 2016

So what Google services went down with this? looks like they are not eating their own dog food?

firloop · on April 12, 2016

Has Amazon.com ever been affected by a AWS outage?

edit: at least it's impacted some of their ancillary services before http://www.theregister.co.uk/2015/09/20/aws_database_outage/

discodave · on April 12, 2016

Amazon.com depends on AWS but not all the services and not necessarily in the critical path of serving up pages on the main website.

E.g. All the servers run on EC2 but they don't really use ELB or EBS. S3 is used heavily throughout the company but some of the newer services not so much.

ocdtrekkie · on April 12, 2016

A lot of Google's own services predate these public offerings. So my guess is even if they're on the same or similar technologies, they may be separate systems.

strombofulous · on April 12, 2016

Google is self-hosted, but they might not use the same hardware GCE uses.

benley · on April 12, 2016

different hardware isn't really part of the equation - it's more that most of google's internal systems aren't _on_ gce, but _adjacent to_ gce. There's a cloud beneath that cloud, so to speak.

dward · on April 12, 2016

most of google does not run on google cloud.

trhway · on April 12, 2016

may be the things like gmail have multicloud failover, say to AWS?

tempestn · on April 12, 2016

Google Custom Search also seems to have gone down globally today. Likely related, although GCE is back up, CSE is still out, leaving many sites without an international search feature.

kahwooi · on April 12, 2016

I did not choose because GCE does not have server at Asia Pacific, Microsoft Azure, DigitalOcean and AWS has one. Sorry, correction, What I mean is South East Asia.

gresrun · on April 12, 2016

They have three zones in Asia Pacific[0].

[0] https://cloud.google.com/compute/docs/zones#available

kahwooi · on April 12, 2016

Sorry, I mean South East Asia. https://azure.microsoft.com/en-us/status/

mkj · on April 12, 2016

kahwooi possibly meant South East Asia? Latency to GCE (Taiwan) from Australia is a fair bit worse than to AWS (Sydney/Singapore) for example.

salilpa · on April 12, 2016

i migrated from aws to google yesterday. fml

cellularmitosis · on April 12, 2016

Time Warner Cable's DNS server went down at roughly the same time (in Austin). I'm hoping that's just a coincidence.

novaleaf · on April 12, 2016

Correct me if i'm wrong, but it looks like Cloud VPN went down, not all of GCE.

FYI I run about 15 servers in Asia, USAEast, and Europe on GCE with external monitoring and didn't get a peep from my error checking emails during that timeframe.

qaq · on April 12, 2016

I know this will get downvoted but clouds suck and this is just one more manifestation of why they suck. Unless you have very spiky workload save yourself long term pain and don't go this route (applies if your monthly AWS/GCE/Azure bill is over few K)

jedberg · on April 12, 2016

If you're in a position where you can spin up servers and get the job done, that means you're just using the cloud as rented servers, in which case you are absolutely right.

If however you're using the cloud as intended, and using all of the services it actually provides, I highly doubt you could run 23 data centers around the world with databases and firewalls and streaming logging and all they other stuff they provide at even a fraction of the cost.

qaq · on April 12, 2016

In reality they provide remarkably little with a lot of strings attached :). Let's take such basic service as transport care to compare the cost ?

virmundi · on April 12, 2016

I'm not seeing why not. Your data center could go down for a myriad of reasons (ISP goes down, HDs, tripping on power cable, etc). If that happens you're pretty much screwed. You could compensate by having multiple data centers with different infrastructure providers. If you do, you're probably spending more than the few K you referenced in your post.

Yes, it's bad that apparently all of the regions failed. Google will hear about it. People will get in trouble. But a screw up at this level is rare. If you use cloud, or even a VPS provider like Linode, you get auto-fail over and someone that is contractually obligated to deal with failures.

thawkins · on April 12, 2016

Or the Fbi raids the coloc, and rips out everything that looks like a computer becuase another tenant was operating a silk road clone.

qaq · on April 12, 2016

You are paying penalty in complexity, latency and poor tenant isolation when running on "cloud infrastructure" and when things blow up you have no recourse.

Aoreias · on April 12, 2016

Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?

Cloud complexity is also lower because you don't have to worry about power, cooling, upstream connectivity, capacity budgeting, etc. If 99.9-99.95% availability is fine for your application then you probably don't have to worry about your provider either.

qaq · on April 12, 2016

AWS Netflix consumes enough resources that if they spike 40-50% everyone is screwed. The software required to run the cloud like AWS is orders of magnitude more complex then what avg project would need and results in major screwups. Both major AWS outages were due control plane issues second case was result of massive Netflix migration that triggered throttles for everyone in affected AZs. The throttles in the first place were put in due to the major outage that lasted for many hours.

throwaway_xx9 · on April 12, 2016

> Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?

I hate to feed a troll, but ...

Noisy neighbors are a problem all the way from sharing a server using VMS to top of rack switches.

An if you try hard enough, you can always escape your VM and "read somebody else's mail."

awinter-py · on April 12, 2016

five nines = 45?

fixermark · on April 12, 2016

No; 999.99% uptime.

awinter-py · on April 13, 2016

1000% = 10 days per year

max_ · on April 12, 2016

so much for the SRE they publishd last week

kyrra · on April 12, 2016

16-17 minutes of down time isn't all that bad if you consider the SLA for GCE is 99.95%: https://cloud.google.com/compute/sla

So they can have 262 minutes of down time a year and still be within their SLA.

chris_wot · on April 12, 2016

Hmmm... would this have affected Netflix?

tszming · on April 12, 2016

(rant) Yes, traffic to Netflix increased significantly as Spotify was down.

TheDong · on April 12, 2016

It's fairly well known that netflix runs primarily on AWS.

So probably no.

chris_wot · on April 12, 2016

Cheers - apologies for the ignorant question. (feel rather silly!)

fs111 · on April 12, 2016

how come netflix is working on IPv6 then, when aws does not offer it?

notpeter · on April 12, 2016

ELBs do IPv6 at the edge and everything else (ELB->EC2) is IPv4.

fidget · on April 12, 2016

Though notably, EC2Classic ELBs only

ra7 · on April 12, 2016

Netflix runs on AWS.