I've been overall very impressed with the direction of Google Cloud over the last year. I feel like their container strategy is much better than Amazon's ECS in that the core is built on open source technology.
This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.
Ah well. Godspeed to anyone affected by this, including the SREs over at Google!
I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The fact the UI spits out API examples for doing what you're doing is really cool. And it's oh-so-fast. (From what I can tell, gcloud's SSD is 10x faster or 1/10th the cost of AWS.)
And this is coming from a guy that really dislikes Google overall. I was working on a project that might qualify for Azure's BizSpark Plus (they give you like $5K a month in credit), and I'd prefer to pay for gcloud than get Azure for free
Same, was considering GCP for the future, but this is bad. I'm not using them without some kind of redundancy with another provider. I hope they write a good post-mortem, these are always interesting at large scale.
How bad is it really? They started investigating at 18:51, confirmed a problem in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at 19:26.
They posted that they will share results of their internal investigation.
That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.
In this situation, I am thoroughly impressed with Google.
It's bad because it concerns all their regions at the same time, while competing providers have mitigations against this in place. AWS completely isolates its regions for instance [1], so they can fail independently and not affect anything else. That Google let an issue (or even a cascade of problems) affect all its geographic points of presence really shows a lack of maturity of the platform. I don't want to make too many assumptions, and that specific problem could have affected AWS in the same way, so let's wait for more details on their part.
The response times are what's expected when you are running one of the biggest server fleets in the world.
Expecting that problems won't happen with a cloud provider that happen everywhere else is a pipe dream. They might be better at it because of scale, but no cloud provider can always be up. It happened at Amazon, now it's happened at Google. Eventually, finding a provider that never went down will be like finding the airline that never crashed.
Operating across regions decreases the chances of downtime, it does not eliminate them.
> The response times are what's expected when you are running one of the biggest server fleets in the world.
That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.
Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.
AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.
Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.
AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.
Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.
Our servers didn't go off, just lost connectivity. Same has happened to even big providers like Level3. Someone leaks routes or something and boom, all gone.
I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.
Actually, according to the status report, they confirmed that the issue affected all regions at 19:21 and resolved it by 19:27. That's six minutes of global outage.
The outage took my site down (on us-central1-c) at 19:13, according to my logs, so it was already impacting multiple regions by 19:13. (I have been using GCP since 2012 and love it.)
Thank you, I missed that on my first reading - I saw the status update was posted at 19:45, not the content within it stating the issue was resolved at 19:27. I updated my parent comment.
Switching from ECS to GKE (Google Container Engine) currently. Both seem overcomplex for the simpler cases of deploying apps (and provide a lot of flexibility in return), but I have found the performance of GKE (e.g. time for configuration changes to be applied, new containers booted, etc) to be vastly superior. The networking is also much better, GKE has overlay networking so your containers can talk to each other and the outside world pretty smoothly.
GKE has good commandline tools but the web interface is even more limited than ECS's is - I assume at some point they'll integrate the Kubernetes webui into the GCP console.
GKE is still pretty immature though, more so than I realized when I started working with it. The deployments API (which is a huge improvement) has only just landed, and the integration with load balancing and SSL etc is still very green. ECS is also pretty immature though.
The Problem is that GCP doesn't run an RDS service with PostgreSQL. And external vendors are mostly more costly than AWS RDS. Especially for some customer homepages where you want to run on managed stuff as cheap as possible.
This is sad for sure. The new MySQL cloud 2.0 is really good, and if you use a DB agnostic ORM you can probably make MySQL work for quite a while. Sad to lose access to all the new PG features though, and I would love if Google expanded their cloud SQL offerings..
While not Docker Cloud specifically, when we eyeballed UCP we found it very underwhelming when pitted against Kubernetes.
To us it appeared yet another in a sea of many orchestration tools that will give you a very quick and impressive "Hello World", but then fail to adapt to real world situations.
This is what Kubernetes really has going for it, every release adds more blocks and tools that are useful and composable targeting real world use (and allow many of us crazies to deal with the oddball and quirky behavior our fleet of applications may have), not just a single path of how applications would ideally work.
This generally has been a trend with Docker's tooling outside of Docker itself unfortunately.
Similarly docker-compose is great for our development boxes, but nowhere near useful for production.
And it doesn't help Docker's enterprise offerings still steer you towards using docker-compose and the likes.
Not to bash, but the page you linked is classic Docker - it says literally nothing about what "Docker Cloud" is.
"BUILD SHIP & RUN, ANY APP, ANYWHERE" is the slogan they repeat everywhere, including here, and it means even less everytime they do it. What IS Docker Cloud? Is it like Swarm? Does it use Swarm? What kinds of customers is Docker Cloud especially good at helping? All these mysteries and more, resolved never.
so am I (I'm YC alumni) .. but RDS is too important for us to move away from it.
Let me put it this way - if you had an RDS equivalent in Docker Cloud, lots of people would switch. Docker is more popular than you know.
Heroku should be an interesting learning example to the tons of new age cloud PAAS that I'm seeing. Heroku database hosting has always been key to adoption.. to an extent that lots of people continue to use it even after they move their servers to bare metal. The consideration and price sensitivity to data is very different than app servers.
I believe this is tutum that they bought some time ago. I tried tutum before with Azure. After deleting the Containers from tutum portal, it does not clean everything from Azure. Today the storage created by tutum is still in my Azure storage. LOL.
Seconded--I can tell the ECS documentation is trying to help, but the foreign task/service/cluster model + crude console UI keeps telling me to let my workload ride on EC2 and maybe come back later.
what I figured out much later was that ECS is a thin layer on top of a number of AWS services - they use an AMI that I can use, ec2 VMs that I can run myself and Security Groups + IAMs that I can create by my own.
But the way they have built the ECS layer is very very VERY bad.. and I have an unusually high threshold for documentation pain.
I work on Convox, an open source PaaS. Currently it is AWS only. It sets up a cluster correctly in a few minutes. Then you have a simple API - apps, builds, releases, environment and processes - to work with. Under the hood we deploy to ECS but you don't have to worry about it.
So I do agree that ECS is hard to use but with better tooling it doesn't have to be.
If Spotify wanted to be really sneaky, some amount of downtime might be good for them financially.
The bulk of their revenue comes from customers who subscribe on a per-month basis, while they pay out royalties on a per-song-played basis. This outage is reducing the amount they have to pay, and if the outage-elasticity-of-demand is low enough they could (hypothetically) come out ahead!
> while they pay out royalties on a per-song-played basis
I believe this is inaccurate. They pay out royalties on a share-of-all-plays basis, don't they?[0] So an outage wouldn't reduce the payout amount, it would just slightly alter the balance of payments for individual rightsholders.
[0]http://www.spotifyartists.com/spotify-explained/#royalties-i...: "That 70% is split amongst the rights holders in accordance with the popularity of their music on the service. The label or publisher then divides these royalties and accounts to each artist depending on their individual deals... Spotify does not calculate royalties based upon a fixed “per play” rate."
Not across multiple regions though. It's not trivial to make an application cross region but at least there is a way to engineer around an outage, unlike this outage
I think it's a great idea to diversify not just regions, but providers.
The future is for per-application virtual networks that are agnostic to the underlying hosting provider. These networks work as an overlay, which means that your applications can be moved through providers without changing their architecture at all. You could even shut it down in provider A and start it in provider B without any changes at all.
At Wormhole[1] we have identified this problem and solved it.
Even if Amazon.com uses AWS (the extent of which seems it may be a mixture of marketing hype and urban legend), there are many ways AWS could fail that affects customers but leaves Amazon.com unaffected.
"Downtime Period" means, for an Application, a period of five consecutive minutes of Downtime. Intermittent Downtime for a period of less than five minutes will not be counted towards any Downtime Periods.
I see from the down votes that my reply must have been seen as kind of off topic to the GCE issue, however since Spotify came up as a "victim", I did feel it prudent to mention that Spotify Premium has offline playlists to allow users to weather network issues of any kind. Also for me personally, big playlists of quality music like this one is fantastic for my work.
That was nuts. Interested to read the post-mortem on this one. Our site went down as well. What could cause a sudden all-region meltdown like that? Aren't regions supposed to be more isolated to prevent this type of thing?
Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...
Our website was down as well for 16 minutes. My guess is that it was a bad route that was pushed out simultaneously (probably was not the intention). It happened once before, sometimes last year, if I remember correctly. We'll have to wait and see what the definitive cause was though.
You get an AS number, and announce your own IP space. DNS failover only sort-of works.
Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.
Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.
So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.
But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.
You put it in all your clouds, with low TTL DNS entries pointing at all those instances (or the closest one geographically maybe). Then if you're really paranoid you use redundant DNS providers as well.
And then you discover that there are a LOT of craptastic DNS resolvers, middle boxes, AND ISP DNS servers out there that happily ignore or rewrite TTLs. With a high-volume web service, you can have a 1 minute TTL, change your A records, and still see a lovely long tail of traffic hitting the old IP for HOURS.
The point was that adding another point for potential failure still won't reduce the chance of failure... it's just something else that can and will break.
In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.
So in all seriousness, how do folks deal with this?
In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.
But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.
Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud, because:
1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and
2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.
If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.
There really aren't any other options besides these two.
I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.
But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.
If you are doing life support, military systems, or HA big finance, then you are quite likely to be running on dedicated equipment, with dedicated circuits, and quite often highly customized/configured non-stop hardware/operating systems.
You are unlikely to be running such systems on AWS or GCE.
Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.
Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.
Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.
At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.
Depends entirely on your business, but what I do is just tolerate the occasional 15-minute outage. There's increasing cost to getting more 9's on your uptime, and for me, engineering a system that has better uptime than Google Cloud does, by doing multi-cloud failover, is way out of favorable cost/benefit territory.
It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.
All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.
The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.
My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.
Most people and customers are tolerant of 15 minutes downtime here and there once or twice a year. Sure, there will be loudmouths who claim to be losing thousands of dollars in sales or business, but they're usually lying and/or not customers you want to have. They'll probably leave to save $1 per month with another vendor.
It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.
IMHO, the next wave is likely multi-cloud. Enterprises that require maximum uptime will likely run infrastructure that spans multiple cloud providers (and optionally one or more company controlled data centers).
OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.
DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.
DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.
Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.
Get enough things running on multi-cloud, and you could potentially see multi cloud rolling failures, caused by (for example) a brief failure in service A leading to a gigantic load-shift to service B...
This assumes all the software your stack uses and you deploy is completely bug free. While rare, bugs can occure that have been in production for a long time and those will only occur when you hit a certain conditions. Now, all your services are down. 100% is impossible.
Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.
By pretending it's the still the golden age of the internet and use physical servers in those locations. You might have to hire some admins, though ;).
Guys, I've got it- instead of locking ourselves in with one vendor's platform and being subject to their mismanagement and arbitrary rules, why don't we buy our own hardware and do it ourselves?
We can free up our OpEx budget too! My sales rep sent me a TCO that shows it is way cheaper to run a data center than to pay a cloud subscription!
AWS had several major outages in the past, especially between 2009 and 2012. In some cases, it was not only downtime, but also data loss, which is the hardest part. 8,760 hours in a year, if you are down for a total of less than 8.7 hours, you're in the >99.9% uptime category (also called "three 9s").
Four 9s (99.99%) is considered a nice plus. Very few businesses really need that.
However, uptime is one thing. Data loss is another different beast.
AFAIK, this one for Google is only downtime. Being able to maintain GCE up for most of the year, except a few hours, means more than 99.9% availability, which is what most customers need.
Operational excellence, or the ability to have your cloud up and running, comes only with a large customer set; Google is now gaining a lot of significant customers (Spotify, Snapchat, Apple, etc), and therefore I expect them to learn what's needed over the coming months.
2016 will be ok. 2017, in my view, will be near perfect.
If Google wants to differentiate them from AWS, they should offer an SLA on data integrity (at a premium, obviously). Here's how you can get thousands of enterprise customers.
Seems kinda misleading from Google to claim repeatedly that they are hosted just on the same infrastructure of GCP, and not go down with it.
EDIT: Switched from "dishonest" to "misleading"; while it's abundantly clear that Google doesn't run on GCP, GCP feels like a second-class citizen to Google because you just cannot get Google uptime with it.
(I'm a Google SRE, I'm on the team that dealt with this outage)
This did impact common infrastructure. Some (non-cloud) Google services were impacted. We've spent years working on making sure gigantic outages are not externally visible for our services, but if you looked very closely at latency to some services you might have been able to see a spike during this outage.
My colleagues managed to resolve this before it stressed the non-cloud Google services to the point that the outage was "revealed". If this was not mitigated, the scope of the outage would have increased to include non-cloud Google services.
there's a lot of infrastructure at Google. And claim is correct - GCP and Google is on top of the same servers, same backend systems. Are they on the exactly same servers? No, of course not -- there's a few servers out there :-)
This was a global network outage, we can't talk about the "exact same servers". There seems to be an implication that Google runs on GCP and a global network outage can't affect all GCP's customers but one.
I was so glad to get away after 5 years of 24/7/365. I had to drive home 5 hours from holiday once, leaving the rest of the family behind, spend 20 minutes sorting stuff out and drive back - the untold joy of pre-cloud startups :)
What do you recommend? I figure that if you're working on something without oncall no one probably cares about it any way. I prefer to have a good rotation than no rotation.
Staffing such that on-call is handled by presently-in-office staff. This is, as I understand, pretty much what Google does. When you're in the office, you're in the office, but when you're not, you're not. Having global coverage means ops in several timezones, and this is what Google accomplishes.
Not knowing when, at any time, your phone or pager will go off wears in interesting ways over time.
It depends on the team and type of oncall rotation for the service. My team (a SWE team) has its own oncall rotation as we don't have dedicated SREs for all of our services.
Since we US based only, it means the oncall person will have pager duty while they sleep. Our pager can be a bit loud at night due to the nature of our services, so it's definitely not for everyone (luckily it's optional).
I'll note you're SWE not SRE. I'm talking mostly about dedicated Ops crew on pager.
It's one thing if you're responding to pages resulting from other groups' coding errors or failure-to-build sufficiently robust systems. Another if you're self-servicing.
One of my own "take this job and shove it" moments came after pages started rolling in at 2am, bringing me on-site until 6am. I headed back for sleep, showed up that afternoon and commented on the failure of any of the dev team to answer calls/pages/texts (site falling over, I had exceptionally limited access capabilities and was new on team). Response was shrugs.
Mine was "That wasn't your ass being hauled out of bed. See ya."
The opinions stated here are my own, not necessarily those of Google.
Yes, it is at Google. Our important and high visibility bits have SREs that help monitor our services (SREs actually approached us to take over some bits that were more important).
Google has a lot of oncall people that aren't going to go into a data center (most googlers never see a data center). So there is lots of oncall rotations that still have an SLA that can be handled from their bed if it happens at 2am.
This is not generally true for at least the big SRE-supported services at Google. I don't know what every team does, but my team's oncall shift (for example) is 10am-10pm, Mon-Thu or Fri-Sun. Another office covers the 10pm-10am part of the US day.
Games, mobile apps, desktop apps like Photoshop, Office, Intellij etc and some shrink wrapped server side apps. But you are right, some of these products are starting to have an online component as well.
I recommend others doing on call so I don't have to. I'm not an ops person, though I probably wouldn't mind some of the job, and hate being on call. I did it for a year at my current workplace (as a dev). All the problems I was capable of fixing I automated away, and got really annoyed that others didn't do the same for their areas of expertise. In hindsight, we probably should have had separate rosters for separate areas to encourage ownership, but we were a very small team (6 or so).
Developing software that other people deploy, as opposed to running a service people use directly, is pretty great for doing something people care about without being on call.
At my last job (comfortably small enterprise software shop), we had customers with more employees deploying and running our product than we ourselves had engineers. The only people who were first-line pageable were IT and the one engineer maintaining our demo server, which we eventually shut down.
There has to be a certain level of karma/schadenfreude of this happening in the week where they are pushing their SRE book...did they handle it well? It seems so, but a lot of their book is an ounce of prevention over a pound of on-call pagers going off.
Prevention is a big part of SRE, but an equally big part is formalizing a process to learn from the inevitable outages that come with running a large, complex, distributed system built by fallible humans.
You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.
I see benefit on using anycast for your DNS, but is anycast actually a better option than DNS load balancing for my site?
The idea behind using anycast is to use at least two different providers, so having packet.net only doesn't really cut it. Also I can do DNS load balancing with any provider by using something like Azure's Traffic Manager, so I struggle to see advantages.
If you're using something like Cassandra (C*), it's pretty easy to have replicated data to multiple zones in multiple clouds. RethinkDB has a similar replication system... there are many others as well.
Not all data models fit into a Non-SQL database though, and may work better in a more relational.. caching and read-only partially up, can be another approach.
Designing your data around this may be impractical for some applications, and on the small scale, likely more costly than necessary. Most systems can tolerate 15-30 minutes of downtime every couple years because of an upstream provider.
Amazon.com depends on AWS but not all the services and not necessarily in the critical path of serving up pages on the main website.
E.g. All the servers run on EC2 but they don't really use ELB or EBS. S3 is used heavily throughout the company but some of the newer services not so much.
A lot of Google's own services predate these public offerings. So my guess is even if they're on the same or similar technologies, they may be separate systems.
different hardware isn't really part of the equation - it's more that most of google's internal systems aren't _on_ gce, but _adjacent to_ gce. There's a cloud beneath that cloud, so to speak.
Google Custom Search also seems to have gone down globally today. Likely related, although GCE is back up, CSE is still out, leaving many sites without an international search feature.
I did not choose because GCE does not have server at Asia Pacific, Microsoft Azure, DigitalOcean and AWS has one. Sorry, correction, What I mean is South East Asia.
Correct me if i'm wrong, but it looks like Cloud VPN went down, not all of GCE.
FYI I run about 15 servers in Asia, USAEast, and Europe on GCE with external monitoring and didn't get a peep from my error checking emails during that timeframe.
I know this will get downvoted but clouds suck and this is just one more manifestation of why they suck. Unless you have very spiky workload save yourself long term pain and don't go this route (applies if your monthly AWS/GCE/Azure bill
is over few K)
If you're in a position where you can spin up servers and get the job done, that means you're just using the cloud as rented servers, in which case you are absolutely right.
If however you're using the cloud as intended, and using all of the services it actually provides, I highly doubt you could run 23 data centers around the world with databases and firewalls and streaming logging and all they other stuff they provide at even a fraction of the cost.
I'm not seeing why not. Your data center could go down for a myriad of reasons (ISP goes down, HDs, tripping on power cable, etc). If that happens you're pretty much screwed. You could compensate by having multiple data centers with different infrastructure providers. If you do, you're probably spending more than the few K you referenced in your post.
Yes, it's bad that apparently all of the regions failed. Google will hear about it. People will get in trouble. But a screw up at this level is rare. If you use cloud, or even a VPS provider like Linode, you get auto-fail over and someone that is contractually obligated to deal with failures.
You are paying penalty in complexity, latency and poor tenant isolation when running on "cloud infrastructure" and when things blow up you have no recourse.
Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?
Cloud complexity is also lower because you don't have to worry about power, cooling, upstream connectivity, capacity budgeting, etc. If 99.9-99.95% availability is fine for your application then you probably don't have to worry about your provider either.
AWS Netflix consumes enough resources that if they spike 40-50% everyone is screwed. The software required to run the cloud like AWS is orders of magnitude more complex then what avg project would need and results in major screwups. Both major AWS outages were due control plane issues second case was result of massive Netflix migration that triggered throttles for everyone in affected AZs. The throttles in the first place were put in due to the major outage that lasted for many hours.
This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.
Ah well. Godspeed to anyone affected by this, including the SREs over at Google!