Hacker News new | past | comments | ask | show | jobs | submit login

I used to get called all the time because our in house infrastructure would fall over, and my apps would crash. It didn’t matter how many times I explained my apps couldn’t run without good infra, and that I wasn’t on that team, and that I had no access or authority to do anything… when my apps when down I got called.

So actually, I really don’t like your idea.




Or, to be more specific, you don't like your company's implementation of their idea.

We have the same setup in my org, but we get to define alerts ourselves. All our own alerts are built so that they don't go off if the underlying infra is borked, and only if there's something we can actually do on our level. We are being kept honest because there is a big kerfuffle when an incident is reported by customers first (instead of alerting).


What metrics do you alert on? How do you distinguish between error due to faulty database client vs error due to database disk failure?


Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.


i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.


You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.


Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?


If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.


Pretty sure your parent poster meant a small overall team. As in, the company is small enough that everyone knows who everyone else is and there’s little to no bureaucracy to reach the right person.

Doesn’t seem like your case at all.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: