Hacker News new | past | comments | ask | show | jobs | submit login

More general questions I would consider asking:

1. It appears there was a safe path with more safety and scrutiny, and a fast path with less. In this case, over time, the fast path became routine. Are there other places where this pattern could develop or has already developed? Is this tradeoff between speed and scrutiny actually necessary? (ie could you have urgent updates reach production faster but actually receive more scrutiny/more testing, even if that happens after the fact?)

2. In a similar vein, if the system has a failsafe configuration (eg only changes that have passed the full barnyard, or configurations that have been running safely for more than a certain amount of time), would it be plausible to automatically roll servers back to that configuration if they remain unresponsive for a certain amount of time?

3. It seems as though there are multiple points (big WAF refactor, credential expiry, internal services dependent on working prod) where a sufficiently cynical engineer would say "I bet there's something here that could, if not bring down the site, at least ruin someone's day". Is there a suitable voice for this kind of cynicism? Eg, a red team or similar? If you were Murphy's Law incarnate, messing with Cloudflare's systems to achieve maximum mischief, where would you start?

4. I get the sense that there are many reliable and well-tested layers of safety, but is it common to test what happens if they fail anyway? Eg: let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do? Or let's say our staged rollout system gets completely bypassed because of solar flares, how bad is it? Beyond developing a procedure or training for these kinds of situations, are they actively simulated or practiced?

If anything, I'd guess the root root cause here is a success failure, where the system has been so reliable for so long that the main reactions to it failing are disbelief and unpreparedness. I'm sure it wasn't funny at the time, but it gives me a chuckle to imagine the SREs speculating about Mossad quantum-tunnelling 0days or something because the idea of everything falling over on its own is so unthinkable. Meanwhile, those of us without so many 9s would jump straight to "I probably broke it again."




Overall I think these are very thoughtful, and I upvoted your comment.

However, I don't think this question is very fruitful:

> let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do?

The way you solve a production issue is you identify its cause and then contain, mitigate, or fix it. I don't think you'd learn anything useful from a drill where there's no specific cause.

Perhaps along similar lines to what you're thinking of, something I could see being useful is to look at components that you've already thought to implement a 'global kill' for, like WAF, for instance. Maybe you could run drills where every machine running WAF starts blackholing packets, or maxing out RAM, or (as happened here) maxing out CPU, the kind of thing where you'd want to execute the 'global kill' in the first place. That way, you can ensure that the 'global kill' switches are actually useful in practice. Something like that seems more grounded to me, making the assumption that something specific is going wrong and not just "magic", while still avoiding too-specific assumptions about what can and can't go wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: