Zen and the Art of Reliability

amdelamar · on March 10, 2022

> 4. Fail small, reducing blast radius

One more thing not mentioned here, is that using a microservice architecture can naturally help isolate outages to small parts of your app/website. Rather than take it down entirely.

My team supports a large microservice system, and while there are definite drawbacks to the architecture, one of the major benefits is that its never 100% down at any given time. Usually a prod incident will make one particular button flakey or one view/page fail to load. Some users won't even notice theres an outage. Oncall is paged and can quickly rollback the squeaky microservice to a previously deployed version, and let an engineer investigate the root cause in a test environment later.

kaycebasques · on March 9, 2022

> In other words, we decided to measure only the systems we controlled. In retrospect that was naive. Our customers don’t care if we run the service that fails, or a vendor we use runs the service that fails. They care that they can’t use Zendesk to do their job.

Yes!

magicalhippo · on March 10, 2022

We have this at work all the time.

Our users doesn't care if it's their ERP that's failing to send data to our system, if it's the gov't system that fails to receive or reply, or it actually is our system.

Regardless they call our support first, cause that's where they feel the pain.

blakesterz · on March 9, 2022

Interesting to see how they do things. Of all the many things that we use at work, ZenDesk is my favorite. It never gets in my way, does things exactly the way I want them, it's just great. GitHub is probably a close second. Slack and Basecamp somewhere in them middle. With anything from Atlassian always being my least favorite.

gboss · on March 10, 2022

Interesting, we’ve definitely had problems with Zendesk, on the implementation side. We’re really frustrated with the lack of customization of their chat client and just how massive it is, in terms of asset sizes. Simple things like showing the actual name of the agent you’re about to chat with just isn’t possible. Having phone, chat and email support all in one place is nice though and why we’re stuck with them.

blakesterz · on March 11, 2022

Ah, interesting. We don't use the chat thinger at all. We use the 2nd cheapest plan and don't use much beyond the tickets.

ram_rar · on March 10, 2022

Cloud services have come a long way. Not trying to diss the article, but scaling CRUD app for 250k/sec is not as difficult as it used to be. It mainly comes down to how you manage state in your architecture.

Back when I was @ yahoo, serving 10k concurrent request from single server used to be such a big deal. Now, hardly anyone thinks about. Most of the reliability/fault tolerance/auto scaling features comes from underlying AWS/GCP services. We just need to write decent microservice to glue these things together and voila!