More general questions I would consider asking: 1. It appears there was a safe p...

laughinghan · on July 13, 2019

Overall I think these are very thoughtful, and I upvoted your comment.

However, I don't think this question is very fruitful:

> let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do?

The way you solve a production issue is you identify its cause and then contain, mitigate, or fix it. I don't think you'd learn anything useful from a drill where there's no specific cause.

Perhaps along similar lines to what you're thinking of, something I could see being useful is to look at components that you've already thought to implement a 'global kill' for, like WAF, for instance. Maybe you could run drills where every machine running WAF starts blackholing packets, or maxing out RAM, or (as happened here) maxing out CPU, the kind of thing where you'd want to execute the 'global kill' in the first place. That way, you can ensure that the 'global kill' switches are actually useful in practice. Something like that seems more grounded to me, making the assumption that something specific is going wrong and not just "magic", while still avoiding too-specific assumptions about what can and can't go wrong.