Hacker News new | past | comments | ask | show | jobs | submit login
Ongoing Incident in Google Cloud (cloud.google.com)
235 points by sd2k on Feb 27, 2023 | hide | past | favorite | 105 comments



This affected us starting at 4:57am US/Pacific with a significant drop in traffic through the HTTPS Global Load Balancer across all regions and Pub/Sub 502 errors but there was nothing on the status page for another 45 minutes. Things returned to normal by 5:05am from what I can tell.


Yup we saw the exact same symptoms with some GCLBs getting 100% 502 ( our upstream QPS graph looks scary with 5 mins of 0 QPS )


This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.


You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.


Right. You want regions to be fully independent, yet the software stacks they are running to be fully synchronized and consistent. So there’s a tension. If there’s a sleeper bug that wakes only after it has been rolled out to every region, you’ve got a global outage. Given the increasing complexity of these systems, it will always be possible to find all those.


Most of GCP’s customers can’t, but independent regions are one of the benefits that a well architected cloud provider can give you to build on.


do you mean the cloud provider can't, or the customer can't?


But you can have 3. Why did you choose 30?

In my company we are split in 3, US, EU, APAC, and we have the same issue with global outage for stuff we could have just managed regionally. For all the savings of the global architecture, they disappear each minute a client is down on a global outage because a guy thousands of kms away messed up.

You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?


GCP needs to support 30 regions because... they're a cloud provider.


Then they can do a glocal model with 10 regions grouped into 3 semi global groups, so when there s a global outage, it can only be on one of these ?


How does this fit in with upcoming EU data sovereignty laws?


The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!



AWS US east 1 had significant downtime last year so I'm not sure what you're trying to say with that link. Would you mind expanding on your thoughts?


One region failing (especially us-east-1) is common, but it's very rare to see an AWS global outage.


This. us-east-1 is the oldest region IIRC and it has its share of issues. Back when I used to work mostly on AWS zonal outages used to happen once in a while, but entire regions were rare, forget global outages.

The global outage thing seems to be a consistent "feature" of GCP - how are we supposed to architect our deployments if the regional isolation model is not a bulwark against high availability on GCP?


> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.


They had many many global outages through the years so that’s evidently not true. GCLB, iam, gcs and probably more Im missing just of the top of my head. Then there’s constant stream of regional networking borks where your latency is suddenly 5x which are not “global” but affect multiple regions


Anecdata, but in my experience Google Cloud has been MUCH more solid than my time spent on AWS.


While a fair point it's in no way a counter argument to what the person above was saying. Having fewer outages is not the same as having no global outages.


Historically that has not been my experience at all tho tbf gcp has cleaned up their act substantially in the past 1-2 years


They had lots of global outages in past years, but in recent years they have become increasingly rare, presumably because of a move away from global points of failure


My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?


Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.


The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before


Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.


I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.


Assuming good automation, most of the work comes in being able to do a second of something instead of just having one. The difference in work between “single point” and “multiple point” is a lot, but increasing the multiple points beyond that isn’t too bad.

Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.


If you get into the nitty gritty of it, it doesn't really make sense. Are you going to have 5 different load balancer software stacks, with 5 different config file languages, causing each client (say Gmail) to have to implement their config 5 different ways? That's insane.


My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.


gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.


From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.


> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.

Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.


You really want to go through every region to find what VMs are running? Why can this not be a single page with all VMs listed?


The AWS console already has a single page where you can see how many EC2/networking resources you have in every region.


  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.


But....cost.


Reliable, cheap, or ....? Pick N-1.


For reference / comparison, how many regional outages have there been? Did service outages get avoided due to running a workload in multiple regions?


Because copypasting from A to B is much safer...


there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.


https://packages.cloud.google.com/apt/doc/apt-key.gpg Even the public apt key for signing Google's cloud packages is unavailable (returns 500 for me). This is insane


This key was 500 some hours before the incident started, I hope it's unrelated.


inb4 it turns out an intern was tasked with updating the apt key, which brought a cascading outage of all their services


Downloading the key has been erroring since at least ~5pm PT yesterday, 2/27. It’s likely unrelated. Though I’d be unsurprised if the recent layoffs contributed to the situation.




As has happened many times throughout history (back to mainframes and thin clients of the 90s) there are swings/trends in how infrastructure is hosted.

Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.

IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

It also has the added benefit of de-centralizing the internet a bit (even if only a little).


As others have mentioned, there was no revenue drop, there's been a reduction in growth. AWS's 20% growth rate is still very respectable, more than double the 9% growth rate the company had overall.

I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.


parent still has a very strong point considering that a drop in growth (not revenue) quickly translates in projects / features being cancelled. That's a good thing to FailFast from a start-up pov but when me as a start-up needs to make a bet about building on top of certain features this adds to my cost/benefit calculation when deciding if I want to jump on new features (device-shadows, digital-twins, or whatever else is the latest innovation the cloud announces).

From that pov I expect my platform to behave like a utility (never change or only change with strict backward compatibility). That level of control simply is against the business model of the cloud.


But there are so many degrees of ratcheting back cloud costs before we get back to self-hosted.

Sure, companies are probably less interested in wacky new cloud features then they were before, but that means going back to basics like EC2 and RDS, which do function like utilities, not going back to their own data centers.


Ah yes, sorry, slower than expected growth was the data point. In my defense I had a screaming toddler in the car!

That said I think the point generally remains - one could argue slower than expected growth in cloud services is a revenue drop (in a way) vs expectations. The market responded accordingly[0] - "However, Azure growth is decelerating." Note that this is all including the explosion in "2023 AI hotness" which is almost certainly offsetting what would be larger losses due to the shift I'm arguing. As the All In Guys noted "you won't see a pitch deck without the letters AI in it" - and a good chunk of that is still going to cloud providers as (in my opinion) there are long tails to these changes and many existing solutions/applications getting "AI" slapped on them are effectively trapped in $BIGCLOUD.

Self-hosting AI is also significantly more difficult and upfront more expensive when you start looking at dealing with (typically) Nvidia hardware costs and software stack complexity. I can definitely see many of these "pivots" to "something AI, we need to throw AI in this" the more well understood and initially faster and "cheaper" utilization of cloud services will continue until the AI trend stabilizes.

From what I could hear (and process) over the screaming the All In Guys presented the argument I tend to agree with - a resurgence of self-hosted infrastructure.

Companies are also dialing back cloud spend because they're realizing for many applications it's very expensive relatively and can actually be limiting compared to self-hosting[1]. Per usual when the cheap money and economic boom retracts they start actually looking at costs they were once happy to just keep writing checks for.

I'd like to reiterate there's a lot of calculation and strategy when it comes down to selecting infrastructure hosting. Again, I think we're in a period where there's a bit of a sea change/wakeup from the past decade of "of course you always build and host everything in $BIGCLOUD" - without even remotely considering alternatives. It's been the default for a while and it isn't as much anymore - and I'd argue that trend is accelerating. There is no "one size fits all".

[0] - https://www.investors.com/news/technology/msft-stock-microso...

[1] - https://www.linkedin.com/pulse/snapchat-earnings-case-runawa...


I think you're still jumping to conclusions to think that the ratcheting back is going to take any significant portion of the market all the way back to self-hosted. I suspect that companies are less willing to invest in fancy new platform features that drive more revenue than VPSs and managed DBs, but I have a very hard time believing that EC2 or RDS are flagging.


I should have been more clear on this - in terms of total install base I don't know that it's going to be "significant" in terms of customer count.

However, I do think it will be at least "noticeable" in terms of individual customers with large spend. Total GCP revenue in 2022 was roughly 65 billion and Snap leaving alone is 1.5% of total revenue.

Especially looking at ML cases where cloud GPU pricing is wildly expensive - retail on-demand instance A100 pricing is at least $3/hr which practically speaking with the AWS pricing model is can be twice that all-in. This is for an instance with 32GB of RAM and 8 VCPUs - which for a lot of A100 use cases is useless. Need 32 vCPU and 256 GB of RAM? That's more like $20/hr.

A single A100 machine that's above and beyond more capable can be had from Dell for roughly $50k, which even factoring in hosting based on colo pricing I've seen has an ROI of ~15 months for constant usage. For the equivalent hardware (and still vastly improved performance - 32vCPU and 256GB of RAM) that ROI gets to less than six months.

Yes, the A100 is typically used for training (and cloud definitely still makes sense there) but more and more models require the performance and VRAM of a V100/H100 for inference (24/365 availability). Do it at any kind of scale/redundancy and ROI catches up even faster. An equivalent to this approach is reserved pricing, which over the 1-3yr term of a lease vs. reserved instance self-hosting becomes almost comically more cost and performance effective. With the extra benefit of actually being more flexible.

Financing and leasing is readily available and with various tax incentives (like Section 179 leasing) you can pretty quickly pay for a FTE to manage the infra for you - which is probably a wash anyway because at any kind of "real" scale or complexity you almost certainly already have dedicated human resources just to manage cloud. You don't even ever need for an employee to go to the hosting facility because most will rack and provision your hardware for free. Combined with remote hands and standard warranty support any (in my experience very rare) hardware failures just get handled.

I should note that this model almost eliminates the tendency for cloud spend to balloon to many X anticipated/budgeted spend - the all too common story of "sticker shock" from clouds on bandwidth alone that cloud has ridiculous markups on. Colo pricing and leases are fixed cost (with all you can eat port speed bandwidth included or so cheap at 95th percentile billing it's practically a rounding error).

I have significant experience at CTO level with both approaches (and hybrid, of course). In many situations the benefits of "self-hosting" vs cloud are dramatic.

The extremely effective marketing that has created and perpetuated an industry wide fear of self-hosting and hardware (especially with the "always cloud always" generation) is fading. I think the uptime and reliability promises of cloud are also fading - this thread started off with discussion of yet-another cloud outage. My background is in healthcare and telecom and I'm shocked at the cavalier attitude of just accepting these outages and being down while standing around helpless wondering when your big cloud will acknowledge, communicate, and resolve them. A few machines in a few rack units of space across a couple of facilities generally trounces cloud in reliability and uptime.

I love HN and the overall knowledge and quality of discussion here but when it comes to hardware and self-hosting many have completely drunk the cloud Kool-Aid and have zero experience with self-hosting - so no idea what they're talking about. Not saying you personally, just generally.


In isolated cases companies can reduce costs by self hosting. Usually this is a combination of very specialized requirements or shockingly technically competent early employees or founders.

However even in these exceptional cases there are hidden costs that will likely arise.

For most companies the very concept of self hosting is comical. This is a one way train.


By it's very nature HN is pretty startup focused (you're talking about founders). When I say "greenfield" I'm mostly talking about a startup (and beyond) that's survived the ~first year of chaos/infant mortality - which should be well before the significant technical debt of the solution and Hotel California nature of cloud take hold.

For an established real business there's almost no question.

Do you have any experience with self-hosting and examples of hidden costs? In my experience with both and hybrid approaches cloud has substantially more dramatic hidden costs. From a cost and pricing perspective cloud has many more foot guns - it's routine at this point for people to report exceeding their monthly budgets and billing notifications literally overnight.


The parent comment is neither factual nor advisable.

You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.

Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.


AWS is a 75 bln a year business still growing 20%+ YoY. It’ll break 100 bln this year. I would examine the numbers yourself.


I have - and the numbers show that much of the big cloud growth is in AI services. The "we need to throw in AI somewhere" concurrent trend is heavily bolstering what would other wise be much more drastic retractions in growth.

I would argue as the AI trend (eventually) wanes and many AI startups and projects within existing companies inevitably eventually fail to materialize the much longer and more general trend of migration out of $BIGCLOUD will be more drastic and obvious.

I don't buy individual stocks but I would happily bet a dinner on big cloud growth showing substantial reductions/losses in coming years as the overall situation stabilizes.


> I have - and the numbers show that much of the big cloud growth is in AI services. The "we need to throw in AI somewhere" concurrent trend is heavily bolstering what would other wise be much more drastic retractions in growth.

Can you share where you got this? Which numbers? I didn't think AWS (or any cloud provider) released details of their operation at that level of granularity.


While the clouds don't break-out revenue numbers at that level of granularity the communication from many of the big cloud providers (to the investment/financial communities at least) acknowledges lower than expected revenue growth overall while pointing to the explosion in "AI" (ML) as a rapidly growing area that will (hopefully, to them) turn that ship around - with already realized promise. I tend to agree with this (obviously).

As one example, the headline here is actually "Microsoft points to AI to drive the next wave of cloud, revenues"[0]. While not hard revenue breakout numbers Microsoft is investing heavily in OpenAI and rapidly (already) integrating it everywhere they can - search, Azure, etc. The investment, utilization, and published pricing are the numbers I'm talking about here.

The real hard numbers and position that supports my thesis (unsurprisingly) comes from Nvidia[1]. Since at least May 2022 Nvidia has been touting their revenues attributed to big cloud demand. Interestingly, Nvidia is rapidly moving to disintermediating big cloud in launching their own "AI cloud"[2]. It's going to be interesting to see how that shakes out...

Google has also been caught somewhat flat-footed and is simultaneously investing even more in "AI" with what will almost certainly be significant revenue opportunity with GCP. I don't follow AWS as closely but I don't see how they could be excluded from this trend - other than having first-mover advantage in cloud services with many more (at this point legacy) customers forever trapped in AWS.

I tend not to try to prognosticate on things like this but it's one of the rare instances I'm very confident in my thesis here. Obviously I'm just some random HN guy but like I said I'd make a friendly bet here.

People need to remember AWS is over 20 years old. There's not a good historical track record of any computing platform/architecture/approach maintaining a strangle hold much longer than that (except maybe Windows, which I'm not sure is comparable).

[0] - https://archive.is/tKLOz

[1] - https://www.reuters.com/technology/nvidia-forecasts-fourth-q...

[2] - https://www.crn.com/news/components-peripherals/nvidia-tease...


> IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.

While that's a bit snarky, the reasoning is similar. You can:

* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc * Use bigcloud, but just VM or storage * Rent VMs from a smaller provider * Rent actual servers * Buy your servers and ship to a colo * Buy your servers and build a datacenter

Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.

What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.

One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.

It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.

From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.

But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).


> When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.

Yes, people do. They install solar panels and use them to generate at least some of their own power. Near future battery tech might allow them to generate all of it if they get enough sunlight, in which case this will become a genuine question to answer: how much to install and maintain the panels and batteries over their lifetime, vs expected cost of purchasing power from utilities.

In a similar manner, cloud vs self hosting is a valid consideration that changes over time. We now have docker and similar tools which make managing your own infrastructure much easier than it was ten years ago. I fully expect even better tools will come out in the future so this consideration does change over time. Maybe in another ten years there'll be almost no benefit to using the cloud (except maybe as a CDN).


> In a similar manner, cloud vs self hosting is a valid consideration that changes over time. We now have docker and similar tools which make managing your own infrastructure much easier than it was ten years ago. I fully expect even better tools will come out in the future so this consideration does change over time.

Excellent point. AWS is 21 years old. Docker (essentially the foundation for most self-hosting these days) is 10 years old. I think we're going to see many more self-hosted K8s control planes (as one example). This isn't considering even more modern tools built on these fundamental components that make self-hosting even easier.


All in podcast mentioned growth slowing, but not revenue dropping.


Revenue drop? Google Cloud is still growing 30-40% year on year.


AWS also had 20% revenue growth last quarter.


Yeah. Google Cloud was the only one I knew off the top of my head. I'm sure other clouds (not just MSFT) are growing too. Maybe the second derivative is going down, but it's absolutely not even close to a drop.

Public cloud will grow a lot more. I'd expect a slowdown when they're all ~10x what they are now.


Ouch some pain at google today then. I hate to wake up on a Monday morning to this.

<3 To the engineers trying to fix it at the moment.


Google has follows-the-sun on-call rotations for large rotations, so this hit the UK team just after lunch.


Ah so the rotation rotates to match the current rotation. Very smart.


I like the mental image of this being a very precise matching -- as the sun traces across the sky, the responsibility of on-call passes from desk to desk, town to town, country to country; two engineers on a boat in the Atlantic race to keep up with their rotation...


The logical conclusion is SREpiercer


The sun never sets on the Google empire


This is why any criticism of AWS reliability is meaningless to me. All the cloud providers go down - all of them. Either you are multi-cloud, or you run your own hardware, but these events are inevitable.


The amount of time you are down vs. up dictates your SLOs and SLAs. Criticism of how reliable one vs. another is is not only valid, it's backed by hundreds of millions of contractual dollars and credits every year. We spend tens of millions on AWS per year. We have several SLAs with them. Our Elasticache SLA was breached once (localized to us - not whole customer base) and we got credits which were commensurate with the amount of business we lost during that downtime period.

If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.

On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.


You definitely need to look at your SLA with your customers, but in my experience, multi-cloud isn't worth it. It's easier to be slightly less reliable, and throw your top-three cloud provider under the bus in the public post mortem. You'll probably cause bigger outages on your own in between provider outages, and multi-cloud adds another layer of complexity for things to go wrong.

Multi-cloud is saying you think you can manage Kafka across two or three clouds better than GCP can manage Pub/Sub.


This is why any criticism of AWS > reliability is meaningless to me.

Er, we absolutely can and should compare rates of problems and overall reliability.


> Either you are multi-cloud, or you run your own hardware

If you run your own hardware these events are inevitable too.


I've seen skepticism about GCP and AWS availability from people with a single 2U in a closet somewhere.

I know it's just a psychological thing about giving up "control", but I have to stifle a chuckle every time.


One aspect of that is the box in the closet is (in my experience anyway) either up or down. It fails more often, but it fails simpler.

In the cloud, even very small scale apps can run into weird situations like the app server is up, the database is down, and the cache is responding about 50% of the time.

If you don't account for that from the beginning, it can lead to your app displaying some bizarre stuff to users.

I haven't run a server locally in 13 years but I can see why some people would miss it.


I've worked in companies that had everything on prem and cloud companies. There are many nice things about cloud, but reliability is not one of them. Everything is a lot simpler on prem and fails a lot less in my experience. The downside being that scaling is harder. And it can be more expensive, depending on your size.


Right? I can pay extra to have two ISPs for upstream connection, but I have no idea how I'd get a second, totally redundant power connection to the closet in my basement. A UPS with a battery's only going to last so long, so is generator fuel.


Inevitable != immune to criticism


> This is why any criticism of AWS reliability is meaningless to me.

Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.


There's Cloudharmony, https://cloudharmony.com/status


> you run your own hardware

in multiple datacenters?


05:41 - 06:26 PT, 45 min total.

Not great, not terrible.


Yep. Of course there's no detail yet so we don't know what exactly was affected. All we can see is "Multiple services are being impacted globally" and a list of services (Build, Firestore, Container Registry, BigQuery, Bigtable, Networking, Pub/Sub, Storage, Compute Engine, Identity and Access Management) but there's no indication of what specifically was impacted. Could you still see status for your VMs, but not launch new ones? Was it mostly affecting only a couple regions? No idea. All we know is they're now below four nines in February for a handful of critical services.

Let's take a gander at incident history: https://status.cloud.google.com/summary

Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.

Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.

Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.

Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.

BigQuery had three multi-hour incidents this year, many in fall/winter last year.

Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)

Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.

Compute Engine has had five multi-hour incidents this year, many last fall/winter.

GKE had 3 incidents this year, multiple the past winter.

Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?


Ex-GCP here.

This is a pretty reductionist summary, e.g. the 8-day Cloud Networking incident root cause:

> Description: Our engineering team continues to investigate this issue and is evaluating additional improvement opportunities to identify effective rerouting of traffic. They have narrowed down the issue to one regional telecom service provider and reported this to them for further investigation. The connectivity problems are still mostly resolved at this point although some customers may observe delayed round trip time or longer latency or sporadic packet loss until fully resolved.

Still a big problem product-wise, but you're looking at a global incident history view without any region/severity filters.

The corresponding AWS service health dashboard makes it much harder to view this level of detail, but is also actually useful for someone asking "is product $xyz which I depend on in region $abc currently down or not"


It's weird, I did a cursory search and can't find people complaining about that 8 day long networking issue. I wonder if the latency was just barely out of SLO so people didn't notice? Or since it was a telecom problem, maybe it was part of one of the recent undersea cable outages so people weren't surprised enough to remark on it? Or maybe I'm just not searching well.

(full disclosure, work at Google but not on cloud stuff)


Outages at the hyperscalers can have a huge blast radius, is anyone encountering other services with outages because they're built on GCP?


We are in us-central1 and didn't have an outage, so it appears not to have affected everyone.


I fantasize that it's a three-letter agency with a warrant making them either start shitting their chat logs or pulling drives and recovering them the hard way.

https://arstechnica.com/tech-policy/2023/02/us-says-google-r...


Sounds painful.


mail.google.com showed error messages for me intermittently during the past hour.


https://www.google.com/appsstatus/dashboard/incidents/5ML14k...

They claim the Gmail specific issues are resolved. We shall see...

Feb 27, 2023 2:03 PM UTC We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.



Is it likely this outage still would've have occurred even without their 12,000 layoffs in January?


Certainly a problem with a BGP misconfiguration. :)


Nope. A BGP misconfiguration would manifest in more broad/different ways.


Our workloads are fully functional, DK/EU


[flagged]


Solving this sort of thing is not about throwing more people at it. That would be brute force and not strategic. Instead, you want to architect systems like these in a way that strikes a good balance between resilience and things like cost/efficiency/etc.


Sure, but laying off a number of good SREs from the Internet traffic team that is responsible for the Google Load Balancer team can't be helping this situation.


How many such people were laid off and what were their unique expertise?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: