Don't read into announcements like this too much. Status pages and outage notice...

oooyay · on Sept 15, 2023

I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.

Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.

I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.

I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

re-thc · on Sept 15, 2023

> I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

The status page is tied to public SLAs = impact on $$$. Internally you can track anything. What's public is the problem.

oooyay · on Sept 15, 2023

No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

re-thc · on Sept 15, 2023

> No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

And how can I as a customer calculate this? We're not going to sue each time there's a breach of SLA to get the real data. Whatever the status page says will trigger customers to decide if they should claim SLA credits. A lower number (delayed update of the status page) will skip payouts or reduce it.

> The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

That's what you assume and that's what it's supposed to be. It's long been abused otherwise. Amazon for example will require explicit approval to update the page. They and others have famously delayed updating the status page as late as they can get away with often attempting to not even call an outage. It will say something like "increased error rates".

oooyay · on Sept 15, 2023

Five nines of availability calculates to 5 minutes, you can calculate up and down from there. If you don't want to do the conversion from percentage to minutes there's lots of calculators like this one: https://uptime.is/five-nines

I wasn't assuming what status pages are used for, I was speaking to my experience working in reliability engineering. I can't speak to Amazons practices as I've not worked there, but when I've seen this happen it's because we struggled to identify customer impact. The systems you're talking about are vaste and a single or even subset of applications reporting errors doesn't mean there's going to be customer impact. That's why I mentioned it usually takes a human that knows that system and it's upstreams to know if there'll be customer impact from a particular error.

I'd encourage you to read the wording of an SLA in a contract. They're often very specific in terms of time and the features they cover. Increased error rates tells me you'll probably run into retry scenarios, which depending on your contract may not actually affect an SLA. Error rates are generally an SLO or an SLI, which are not contractually actionable.

mdekkers · on Sept 15, 2023

> And how can I as a customer calculate this?

Either your shit works, or it doesn’t. You do monitor, don’t you?

re-thc · on Sept 15, 2023

> Either your shit works, or it doesn’t. You do monitor, don’t you?

That then becomes a he said she said problem with the vendor you're claiming against. Does everyone have time for it? You will submit the SLA credit claim and chances are unless it's WAY off you'll accept the vendor's nerfed version and move on. Something is better than nothing.

zug_zug · on Sept 15, 2023

I'm an SRE and I've seen it firsthand at multiple companies.

eddythompson80 · on Sept 15, 2023

Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.

If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.

aldarisbm · on Sept 15, 2023

it's also not only bad PR, but CSPs are subject to SLAs.