Hacker News new | past | comments | ask | show | jobs | submit login

They never attach it to the monitoring because monitoring systems usually generate a lot of false positives which affect their published SLA.



Then they should have a "?" status that can be triggered by automated systems that acknowledge that it looks to be an issue but that they are manually investigating.

If it's a false positive they just resolve it without it affecting SLA and if it's a real problem then us customers wouldn't have to debug our own stack for 2 hours before Microsoft informs us that they are the problem.

EDIT: Wonder how many man-years of extra debugging work their non-working status page have caused the customers.


They never attach it to the monitoring because monitoring systems usually generate a lot of correct positives which affect their published SLA.

Works equally well. See the point?


Which means if one were to require monitoring and status pages to be connected, one of two things happen (for each monitored component):

(1) The monitoring system would be altered to ignore tests that return false positives (at the expense of missing the alert when it represents an outage).

(2) Fixing the monitoring. It wasn't working for the sysadmins/operators, anyway, since it had so many false positives that their "mental model" was essentially based on (1), anyway.

At least, where I've forced the issue of doing just this, that's exactly what happened. At the end of the day, especially since SLAs took a hit and that affected bonus payouts, monitoring got a lot better -- as did overall team function when we truly realized how bad things were -- we stopped doing workarounds and started fixing problems at a more fundamental level which led to SLAs that were both accurate and excellent.

It helped bring attention to a hidden problem which resulted in time being allocated to fix tests that dropped constant false-positives and to evaluate each for whether or not it should exist in the first place.


Which impacts economics because some customers surely got deals guaranteeing some amount of credits based on up/downtime as reported by the status page.

And so updates to the status page become political and locked behind senior management approvals.. like AWS.


Yeah, that's why SLA reports never include <30m downtimes, convenient truth bending.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: