> Why involve a human at all? To make a judgement call on whether the issue is s...

viraptor · on March 10, 2022

Also prevents monitoring flukes and planned/transient-but-no-impact issues from showing up in dashboards.

jrockway · on March 10, 2022

If you're just going to lie, why have an SLA at all? It's like doing a clinical trial for a drug; a bunch of your patients die and you say "well they were going to die anyway it has nothing to do with the drug." If it's one person, maybe you can get away with that. When it's everyone in the experimental group, people start to wonder.

I have two arguments in favor of honest SLAs. One is, if customers expect that something is down, it can give them a piece of data with which to make their mitigation decisions. "A lot of our services are returning errors", check the auto-updated status page, "there may be an issue with network routes between AZs A and C". Now you know to drain out of those zones. If the status page says "there are no problems", now you spend hours debugging the issue, costing yourself far more money in your time than you spend on your cloud infrastructure in the first place. If having an SLA is the cause of that, it would be financially in your favor to not have the SLA at all. The SLA bounds your losses to what you pay for cloud resources, but your losses can actually be much higher; lost revenue, lost time troubleshooting someone else's problem, etc.

The second is, SLA violations are what justify reliability engineering efforts. If you lose $1,000,000 a year to SLA violations, and you hire an SRE for $500,000 a year to reduce SLA violations by 75%, then you just made a profit of $250,000. If you say "nah there were no outages", then you're flushing that $500,000 a year down the toilet and should fire anyone working on reliability. That is obviously not healthy; the financial aspect keeps you honest and accountable.

All of this gets very difficult when you are planning your own SLAs. If everyone is lying to you, you have no choice but to lie to your customers. You can multiply together all the 99.5% SLAs of the services you depend on and give your customers a guarantee of 95%, but if the 99.5% you're quoted is actually 89.4%, then you can't actually meet your guarantees. AWS can afford to lie to their customers (and Congress apparently) without consequences. But you, small startup, can't. Your customers are going to notice, and they were already taking a chance going with you instead of some big company. This is hard cycle to get out of. People don't want to lie, but they become liars because the rest of the industry is lying.

Finally, I just want to say I don't even care about the financial aspect, really. The 5 figures spent on cloud expenses are nothing compared to the nights and weekends your team loses to debugging someone else's problem. They could have spent the time with their families or hobbies if the cloud provider just said "yup, it's all broken, we'll fix it by tomorrow". Instead, they stay up late into the night looking for the root cause, finding it 4 hours later, and still not being able to do anything except open a support ticket answered by someone who has to lie to preserve the SLA. They'll never get those hours back! And they turned out to be a complete waste of time.

It's a disaster, and I don't care if it's a hard problem. We, as an industry, shouldn't stand for it.

viraptor · on March 11, 2022

It's not just a hard problem, it's impossible to go from metrics to automatic announcements. Give me a detected scenario and I'll give you an explanation for seeing errors which are unrelated to the service health as seen by customers. From basic "our tests have bugs", to "reflected DDoS on internal systems causes rate limit on test endpoint", to "deprecated instance type removal caused a spike in failed creations", to "bgp issues cause test endpoints failures, but not customer visible ones", to...

You can't go from a metric to diagnosis for a customer - there's just no 1:1 mapping possible, with errors going both ways. AWS sucks with their status delays, but it's better than seeing their internal monitoring.

jrockway · on March 11, 2022

I don't agree. I'm perfectly happy seeing their internal metrics.

I remember one night a long time ago while working at Google, I was trying a new approach for loading some data involving an external system. To get the performance I needed, I was going to be making a lot of requests to that service, and I was a little concerned about overloading it. I ran my job and opened up that service's monitoring console, poked around, and determined that I was in fact overloading it. I sent them an email and they added some additional replicas for me.

Now I live in the world of the cloud where I pay $100 per X requests to some service, and they won't even tell me if the errors it's returning are their fault or my fault? Not acceptable.