I used PagerDuty for more than a decade at my previous job. I didn't care much f...

compumike · on April 23, 2023

Anyone who has done real engineering would realize that this problem a consequence of a design flaw with PagerDuty (or other alternatives with a similar API, where alerting is only triggered directly by a webhook).

If your design requires that the alerting service can receive a one-off affirmative "something's broken" packet, then yes, you are asking an inherently unreliable distributed system (i.e. the Internet!) to reliably deliver a critical message at a time when you know something is broken. Good luck. :)

Instead, if you use something like a periodic heartbeat (also known as a dead man's switch, inbound liveness monitor, or outbound HTTP probe -- all of which we support at Heii On-Call https://heiioncall.com/ out of the box), you can tolerate some occasional lost messages, regardless of whose end they are on.

Real reliable systems (for example, embedded systems) use periodic heartbeats and watchdogs, and are usually designed to be lenient to the occasional missed heartbeat. If the system being monitored is truly down, then enough consecutive heartbeats will be missed that some threshold is reached and the on-call person can be alerted (or a watchdog timer can reboot a system, etc).

lokar · on April 23, 2023

Also, the system at google is not in the path of the first page (that is direct from the alert infra), the more complex system is only needed for escalation.

traceroute66 · on April 23, 2023

> But you know why PagerDuty does so well? Basically bulletproof reliability. 99% uptime won't cut it. 99.9% uptime won't cut it. You need to be as close as possible to 100% uptime, no excuses.

Ah, but as always, the million dollar question is ..... What does the PagerDuty small print say ?

My guess is the small print is not "bulletproof reliability" or five-nines. I betcha their contract is full of exclusions and get-outs.

It's a bit like the famous Verizon 100% SLA. Any idiot knows its not technically feasible, but the reason you pay the Verizon-tax is so the've got some cash to pay you the inevitable SLA claims.

folmar · on April 23, 2023

The small print is irrelevant as the amount of money you get for SLA breach is usually less or equal to what you pay.

What matters is the execution and word of mouth. And PagerDuty is widely known to be rock solid.

kakwa_ · on April 25, 2023

Also, the cost of the monitored service being down far exceed (hopefully) any compensation for SLA breach.

Honestly, PD is probably the service I would the least complain about in our (large org) monitoring chain.

At least, it's reliable, which cannot be said of Newrelic for example.

jorts · on April 24, 2023

It's actually pretty straightforward without a lot of get-outs. See Service Credits in the Service Level Agreement [1].

1: https://www.pagerduty.com/standard-service-level-agreement/

SkyPuncher · on April 23, 2023

Sure, but contracts just establish the formal baselines for "when we have to talk".

You still want reliability that's far higher than your contract baselines.

closeparen · on April 23, 2023

We use PagerDuty, but we also have an internal email/SMS based "paging backup" service and I have seen it in action 4 or 5 times in 7 years. PagerDuty isn't bulletproof.

dilyevsky · on April 23, 2023

Yeah I’ve seen a number of outages there so dunno what TP is about.

mads_quist · on April 23, 2023

That's a very helpful comment. Thanks. I might consider adding SMS / Calls to the notification channels. I didn't look from the perspective that Folks described here.

Regarding Uptime 99.99%. That's very true and also a good hint.

My current stack is very robust - tried and tested in many other products that I worked in.

Some other HN user also suggested to include an uptime status page to create this kind of confidence.

perlgeek · on April 23, 2023

> My current stack is very robust - tried and tested in many other products that I worked in.

For uptime you also need to consider the availability of your hosting provider, you might even have to have a fallback installation at a different provider, something along these lines.