I used PagerDuty for more than a decade at my previous job. I didn't care much for the UI. But you know why PagerDuty does so well? Basically bulletproof reliability. 99% uptime won't cut it. 99.9% uptime won't cut it. You need to be as close as possible to 100% uptime, no excuses. Pagerduty isn't perfect, but it was one of the most reliable services we ever used.
I sincerely wish you luck with allquiet. I just want to make very sure you are aware why people still pay for Pagerduty. To compete, you need to be looking at 99.99% uptime or better (ideally 99.999%, 5 minutes of downtime a year) where 'uptime' is defined as the ability to exercise the entire notification stack. The moment someone's site has an outage and you aren't able to deliver the notification, you lose the customer and everyone they talk to.
I also worry about in-app notifications, but that's well-covered by everyone else's comments.
Pagerduty is vulnerable. Their UI is garbage. But you need to have bullet-proof uptime to take them down. It's a tough challenge and I wish you luck!
Anyone who has done real engineering would realize that this problem a consequence of a design flaw with PagerDuty (or other alternatives with a similar API, where alerting is only triggered directly by a webhook).
If your design requires that the alerting service can receive a one-off affirmative "something's broken" packet, then yes, you are asking an inherently unreliable distributed system (i.e. the Internet!) to reliably deliver a critical message at a time when you know something is broken. Good luck. :)
Instead, if you use something like a periodic heartbeat (also known as a dead man's switch, inbound liveness monitor, or outbound HTTP probe -- all of which we support at Heii On-Call https://heiioncall.com/ out of the box), you can tolerate some occasional lost messages, regardless of whose end they are on.
Real reliable systems (for example, embedded systems) use periodic heartbeats and watchdogs, and are usually designed to be lenient to the occasional missed heartbeat. If the system being monitored is truly down, then enough consecutive heartbeats will be missed that some threshold is reached and the on-call person can be alerted (or a watchdog timer can reboot a system, etc).
Also, the system at google is not in the path of the first page (that is direct from the alert infra), the more complex system is only needed for escalation.
> But you know why PagerDuty does so well? Basically bulletproof reliability. 99% uptime won't cut it. 99.9% uptime won't cut it. You need to be as close as possible to 100% uptime, no excuses.
Ah, but as always, the million dollar question is ..... What does the PagerDuty small print say ?
My guess is the small print is not "bulletproof reliability" or five-nines. I betcha their contract is full of exclusions and get-outs.
It's a bit like the famous Verizon 100% SLA. Any idiot knows its not technically feasible, but the reason you pay the Verizon-tax is so the've got some cash to pay you the inevitable SLA claims.
We use PagerDuty, but we also have an internal email/SMS based "paging backup" service and I have seen it in action 4 or 5 times in 7 years. PagerDuty isn't bulletproof.
That's a very helpful comment. Thanks. I might consider adding SMS / Calls to the notification channels. I didn't look from the perspective that Folks described here.
Regarding Uptime 99.99%. That's very true and also a good hint.
My current stack is very robust - tried and tested in many other products that I worked in.
Some other HN user also suggested to include an uptime status page to create this kind of confidence.
> My current stack is very robust - tried and tested in many other products that I worked in.
For uptime you also need to consider the availability of your hosting provider, you might even have to have a fallback installation at a different provider, something along these lines.
I sincerely wish you luck with allquiet. I just want to make very sure you are aware why people still pay for Pagerduty. To compete, you need to be looking at 99.99% uptime or better (ideally 99.999%, 5 minutes of downtime a year) where 'uptime' is defined as the ability to exercise the entire notification stack. The moment someone's site has an outage and you aren't able to deliver the notification, you lose the customer and everyone they talk to.
I also worry about in-app notifications, but that's well-covered by everyone else's comments.
Pagerduty is vulnerable. Their UI is garbage. But you need to have bullet-proof uptime to take them down. It's a tough challenge and I wish you luck!