I used to get called all the time because our in house infrastructure would fall...

majewsky · 2024-08-26T08:08:15 1724659695

Or, to be more specific, you don't like your company's implementation of their idea.

We have the same setup in my org, but we get to define alerts ourselves. All our own alerts are built so that they don't go off if the underlying infra is borked, and only if there's something we can actually do on our level. We are being kept honest because there is a big kerfuffle when an incident is reported by customers first (instead of alerting).

potamic · 2024-08-26T09:07:11 1724663231

What metrics do you alert on? How do you distinguish between error due to faulty database client vs error due to database disk failure?

majewsky · 2024-08-26T09:48:12 1724665692

Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.

perfect_wave · 2024-08-26T14:39:02 1724683142

i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.

silisili · 2024-08-26T19:04:13 1724699053

You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.

dullcrisp · 2024-08-26T09:30:25 1724664625

Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?

sgarland · 2024-08-26T10:42:49 1724668969

If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.

latexr · 2024-08-26T11:38:34 1724672314

Pretty sure your parent poster meant a small overall team. As in, the company is small enough that everyone knows who everyone else is and there’s little to no bureaucracy to reach the right person.

Doesn’t seem like your case at all.