In another comment, I pointed out a mistake of mine that was a major factor in an outage.
I also screw up all the time in ways that would cause outages, except we have automated tests, tsan/asan, code reviews, a staging environment, various safety checks, experiment gates, pre-mortems, slow rollout procedures, an alert on-duty SWE and on-call SRE, etc.
Today one of my mistakes was caught early in the prod phase of our push. That's much later than I would like but still before it did any real damage. I submitted the bad code last Wednesday and have been out sick with the flu (and caring for my preschool-aged kids) since then, so my awesome team handled my problem for me.
I also screw up all the time in ways that would cause outages, except we have automated tests, tsan/asan, code reviews, a staging environment, various safety checks, experiment gates, pre-mortems, slow rollout procedures, an alert on-duty SWE and on-call SRE, etc.
Today one of my mistakes was caught early in the prod phase of our push. That's much later than I would like but still before it did any real damage. I submitted the bad code last Wednesday and have been out sick with the flu (and caring for my preschool-aged kids) since then, so my awesome team handled my problem for me.