Assume away, but to play devil's advocate, how much testing do you think is done...

kstrauser · on July 6, 2023

I don’t accept that. If those procedures are failing, then someone should be getting 5xx alerts. Before they started failing, unit and integration tests should’ve been raising alarms.

A site that remains persistently broken has a lot more wrong with it than just some software bugs.

dylan604 · on July 6, 2023

Just because a 5xx alert is sent doesn't mean that someone will do anything about it, even if they didn't just lay off the entire staff. I'm sure some PM triaged the 5xx notices and anything coming from close account just gets pushed to the bottom of the queue. Obviously, I have no knowledge if that's what actually happened. I'm just continuing the advocacy of the devil

kstrauser · on July 6, 2023

While that’s clearly true here, it’s a giant red flag if a PM is ignoring… a giant red flag.

csydas · on July 6, 2023

Why? It's extremely common that massive 5xx errors across a service are largely just ignored unless it reaches a certain threshold. This is what I've observed in virtually every single large project I've worked on or with, and it's already quite bad for non-subscription services. If it can be dismissed as "only single digit % of users are affected rarely", it doesn't get attention.

I'm not saying this is a good thing, but it's so common with paid software offerings that it's no longer a red flag for me; I just assume that blocking issues that can be presented as only affecting a small number of users will not be addressed, especially if they're considered temporary issues. AWS and Azure certainly aren't paying out a ton when they have outages or system errors that for all intents and purposes render the services offered or machines running on their infrastructure whenever they have issues.†

For me it's hard to escape the conclusion that many services are over-subscribing their Systems and Developer staff across too many things presented as a single service; identity management, compute services, storage services, payment processing, free tiers, and so much more under a single service. A systems error or login problem or the classic Google problem of a single flagged account irradiating everything it touches on the platform is all it takes to completely take down your operations.

And the SLAs/TOS/EULA are quite complex on these platforms; it's hard for me to think that most persons can truly understand what they are agreeing to when signing up for these platforms, including the companies offering the platforms; in my own experience dealing with Storage-aaS vendors who provide S3 storage, it's very hard to get straight answers on outages or massive 5xx situations. (Embarrassingly, said Storage-aaS vendors' support teams have said in no ambiguous way that a 5xx error responses is a _client side error_ that needs to be investigated by the maintainer of the client accessing the S3 services...and I maintain it's incorrect for said vendors to use 503/500 as a response for "please slow down" when we have the 504 "Slow Down" response with S3, or the classic 429 HTTP response could be served instead, both of which are actionable for client applications)

The services are too big and doing too much, and not always very well. And ignoring the "small" interruptions that basically prevent the users of the platforms from doing their work is the norm; if the issue doesn't hit at least double-digit percentages of users, my experience is that the platforms do not budge or move since it can be hand waved away. And the user recourse in such situations is basically non-existent; maybe you'll get some credit, maybe you won't, who knows? The platforms sure don't, despite the monstrous EULAs they ask you to agree to.

I don't know what the answer is, but I really cannot consider many of the platforms, regardless of whether its for work or personal social purposes, as reliable. If a platform is going to present itself as a backbone of modern internet/computing, it's really trying to claim it's a utility, but it doesn't want to behave like a utility, it wants to get more spending from the users; as long as this is the case, where the user capture efforts take higher priority than maintaining the services and using plausibly deniable tactics to eschew that responsibility, I cannot get excited or interested in platforms; I will use them as the projects I work on require the platforms, but if it were up to me, I'd not put everything on platforms and diversify as much as possible.

† You can [0] submit a request for credit, but I'm not sure how much credit is being issued this way; I will give AWS a small nod to the fact that at least for me, their SLA Guarantee page is "fairly" easy to read, but my issue is that it looks like they only will offer credit for a fairly small amount of specific situations, and it's unclear how many of these requests are honored. A very brief search for a few terms on how long/often AWS refunds occur returns mostly the official documentation pages without any statistics, and also a few posts from AWS forums with users asking; the first post on the few threads I checked though were mostly "here's what you could have done to avoid this" scenarios; maybe the users indeed were doing something quite wrong, but at the same time it's unnerving to not quickly find stories about the process working well and efficiently.

0 - https://aws.amazon.com/compute/sla/

shadowgovt · on July 6, 2023

Absolutely true, but cross reference: the new owning company is laying off Evernote's staff.

So in this case, yes, there is a lot more wrong.

intelVISA · on July 6, 2023

Tests..? For a SaaS? Surely not?