There's a global status page, and then there's a local update for people with instances on an affected host --- past some threshold of hosts, the probability of having an issue on some random host gets pretty high just because math. The local status thing happened for people with instances on that machine.
Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.
The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.
Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.
The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.