Hacker News new | past | comments | ask | show | jobs | submit login
Cloudflare incident on October 30, 2023 (cloudflare.com)
93 points by todsacerdoti 11 months ago | hide | past | favorite | 30 comments



This is an incredibly detailed and honest incident report. Good on CloudFlare for getting things running again so quickly, and identifying so many problem areas to improve.


Heh. They needed Workers KV running to get Workers KV running.


They say they needed to use a manual/break glass process to get things going again. Kinda scary to think Cloudflare could brick itself.


> Kinda scary to think Cloudflare could brick itself.

Never forget that essentially all interesting systems may brick themselves and will at least catastrophically fail every now and then. For all the sales pitches about "the cloud" providing resiliency, it's fundamentally centralizes an ungodly share of the internet onto a handful of failure points.

What had been a frothy background noise of independent hosting and data center failures has become a global blackout risk, where worldwide operations break down and failures cascade through interdependent systems.

On a business level, it's hard to care about that, as you're not going to be held to blame if everything else goes into crisis at the same time as you, but on a societal level, it's kinda f---'d.

It's like moving all the agriculture onto the flood plain. You get maximum abundance most of the time, but when it fails in those inevitable floods, you at least better have kept some stockpiles stored uphill somewhere.


They don't say what the manual/break glass process was exactly. But I likened it to something like a `helm rollback` for a kubernetes helm release. Manual/break-glass doesn't have to mean a team of engineers are logged into VM instances `vi` editing config files back to a previous state and `scp`-ing older binaries over, it just probably means there wasn't an automated build/deploy tool process to perform it.

> Kinda scary to think Cloudflare could brick itself.

I mean.. They are providers of software that iterate on their product suite in a SaaS environment. I think this sentiment holds for pretty much any SaaS company. And I guess if everything is scary then nothing is really all that scary. Do you disagree?


I meant "brick" as distinct from "break", in the sense that their apparent circular dependencies could result in an unrecoverable failure scenario. Purely speculation of course, it's probably not a real risk.

That being said, it's a scary thought because they are different from other SaaS in that the run a significant portion of the internet.


There were no apparent circular dependencies. Services that Cloudflare offer depend on their KV store service. A KV store (or key-value store) is a type of database. If you have an application that depends on a database, that's a one-way dependency, not a circular one.


I mean near brick situations happen with DNS/VM/Auth situations quite often. It's not hard to imagine where some dumbass creates a system that also contains its own decryption keys in some system and it "will be fine because all auth systems will never be down".


You know how cell sites have gas generators and the gas line uses cell to communicate status? I've always wondered about that one.

Plus would a diesel seller be able to take cash if a data center needed it to start up again. There's plenty more where that thoughts from


I don't know about cell towers, but all AT&T central offices have generators with massive fuel tanks that can run even if the gas supply is cut off. I would be surprised if the backup for cell towers didn't have similar systems.


My partner's father works for AT&T and he has related a story where at least one cell tower went offline because the office was ignoring the NO LINE POWER/BACKUP GEN FUEL LOW warning duo[0] and assumed all was good when the warnings went silent when, in fact, it was now dark.

[0]A very rural installation that often experienced power cuts, so they had become numb to alarm 1. Somewhere in the mystical ticket/service system, the schedule to refuel the tower had go awry so it gradually sucked down its fuel over the course of many power interruptions and viola.


I worry a lot about those things. I am sending this to my colleagues.

It's too easy to set up new sysop tools on the shiny new cluster/filesystem/cloud that makes everything easy. However how do you fix that shiny but complex thing if your tools are down?


The Ask thread about it while it was occurring: https://news.ycombinator.com/item?id=38074906


A continuing theme in these broad outages that keep happening is that these company's are dogfooding their own services to provide the same service to their customers.

I'm thinking of the AWS lambda outage taking out the control plane as a recent example.

I'm not sure or saying they're wrong for doing this, I just keep noticing this same pattern of "One of our products broke, now all of our tooling is failing." This seems.. wrong somehow?


So they should use another tool with its own set of failure conditions?


No, but maybe they should consider infrastructure where if a customer-facing service goes down their internal control mechanisms don’t fail and the only backup mechanism is break-the-glass procedures by on-call engineers that take half an hour to go live.

It’s a big deal when CF services go down, this definitely impacted me, and I just expected more redundancy.

It seems like their rollback mechanism failed as well.


But isn’t this just a natural part of how distributed systems work? If one part fails it can have downstream effects that can cross multiple boundaries like end user clients crashing, or another business’s server crashing, or your own server crashing.


I'm not as concerned about a large part of internet traffic going through Cloudflare as many here are, but I think when a service like CF becomes more and more popular, it has to grow, and there is the irresistible impulse to do more complex things. Inevitably this complexity leads to outages/issues.


The honest, detailed, and nerd-centered write-ups from CF is why I keep investing in them and using their products. They have a long history of being extremely transparent. All I see from each "incident" is growth. This just demonstrated the security practices in place. Imagine if that was anything other than a 401 they were getting when a request tried to cross the production plane.

I suppose no response would be better (and harder to troubleshoot), but at least this shows they had some forethought.

Also, c'mon 30 minutes is a pretty damn good response time. I can't even get a pizza delivered that fast.


I’d be interested to know more about the “break glass” mechanism. Anyone know of any blog posts from CF on this topic?


Likely involves their most senior devs bypassing normal deployment procedures / access controls. In this case, to be able to edit their live production environment instance instead relying on Workers KV / other impacted internal services (which a lot of their infrastructure relies on) Which would then enable a proper rollback. The aim being to restore service as quickly as possible to minimize the customer impact.


Working with some large companies this sounds likely. They'll typically have all changes occur in a testing environment and get pushed to prod via automation.


I assume when they said that they mean a hacky, manual, editing config files means of redirecting traffic.


> make build && ENV=prod make deploy


Our service was badly impacted, almost 100% outage. I'm not happy with Cloudflare declaring it as a minor incident and changing it to major after it was resolved. Why not be truthful and start with the correct status/severity?


You're assuming that the incident's impact is perfectly known from the very start. Suffice to say, that's not the case.


"This was a frustrating incident, made more difficult by Cloudflare’s reliance on our own suite of products."

I'm glad this was the case. We would have seen a different response from Cloudflare if they were not eating their own dogfood.


Cloudflare slowly becomes the blackwall.


> Workers KV is our globally distributed key-value store.

IOW our single point of failure failed


Incident report was solid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: