It reads to me like the bug required a particular feature to be enabled, and the feature wasn't enable when the software was rolling out, then the feature was globally knife-switched to the "on" position without an orderly staged rollout.
I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.
It's sort of like that extra disk in your raid 1 setup at home.
Compared to a single SSD the performance improvement doesn't really show (for desktop loads)...
However when things come to a rare (but somewhat inevitable) and screeching halt and that mirror has one of the copies shatter beyond recognition... that's when the doubled price proves that it was worth it.
Such a dashboard would invariably also add load and complexity (both failure points) to the system, but outwardly most users would be unaware of their existence.
I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.