I've certainly deliberately downed an enormous number of tasks, though, as part ...

jeffbee · on Oct 28, 2020

The way we approached this on my SRE team was semi-manual with improved ergonomics. We embedded the live traffic graph in the turndown tool, so it would be right in your face before you took the destructive action. Of course it was always possible to go one level down on the tooling and do everything manually, but it wasn't the usual way.

scottlamb · on Oct 28, 2020

Seems reasonable, but as you might have seen, rossjudson did accidentally-ish go to a lower layer: he wrote "never 'borg' when you meant to 'borgcfg'". And you're still relying on someone actually looking at the graph in their face which isn't as sure a thing as it'd be if they had to echo something back as Rachel is advocating for.

(For the benefit of non-Googlers/Xooglers: borg is a lower-level tool mostly used when everything else has gone wrong and borgcfg is a higher-level, more routine tool. These days people often layer things on top of that as well, because we love piling up abstraction layers. This approach is completely successful because abstraction layers never leak and solve every problem without making anything hard to debug at all. /s)

In my ideal world, even the lowest layer a human ever uses would do safety checks by default. Eg, imagine if the job specification included "query this safety check service on change" and the borg tool (as part of querying the existing job on a cancel/rm command) discovered that and honored it. Most people/jobs would use a safety check that fails taking down a job unless the load balancer reports all relevant services have that job drained. The safety check service could also specify a confirmation prompt (similar to what Rachel is advocating) that could be customizable (like qps or percent of global capacity rather than just number of tasks). The safety check would be effective no matter what layer you use, and there'd be no good reason to use one that would cause prompt fatigue. The outage rossjudson described (and I know he's not the only one who has done exactly this!) would have been avoided.

jeffbee · on Oct 28, 2020

I really agree with your philosophy here but I've never been able to perfect it in practice. The imperfection comes from the way there is inevitably some mapping of things to other things by name. I can ask a load balancer whether clients of a service are being sent to a named capacity or not (i.e. is the thing I want to remove "drained") but that doesn't rule out the possibility that another service maps a different name to the same backend and I forgot to integrate that name with my automation. Also impossible to rule out that a client exists which bypasses or ignores the advice of the load balancer. Having visibility into caller identity helps a lot with this kind of problem but outside of Google there is a scary word called "cardinality" which prevents people from monitoring the whole caller×server space.

scottlamb · on Oct 28, 2020

I agree you can never reach perfection. I expect there'd still be postmortems with "Our safety check was missing/bad" in the "what went wrong" section for various project-specific technical reasons. But I'd expect there to be (a) fewer such postmortems, and (b) an action item to fix the job's safety check service specification and audit the team's other ones, rather than the rather inexcusable IMHO "this tool doesn't support those, /shruggie, maybe schedule more training about which tool to use".