Implementing webhook deliveries is one of those things that's way harder than yo...

ratorx · 2024-09-01T23:19:22 1725232762

I think retries are sketchy. Sometimes they really are necessary, but sometimes the API can be designed around at-most-once semantics.

Even with a retry scheme, unless you are willing to retry indefinitely, there is always the problem of missed events to handle. If you need to handle this anyway, it might as well use this assumption to simplify the process and not attempt retries at all.

From a reliability perspective, it is hard to monitor whether the delivery is working if there is a variable event rate. Instead, designing it to have a synchronous endpoint the web hook server can call periodically to “catch up” has better reliability properties. Incidentally, this scheme handles at-most-once semantics pretty well, because the server can periodically catch up on all missed events.

simonw · 2024-09-02T00:25:11 1725236711

That's a good call: having a "catch me up" API you can poll is a very robust way of dealing with missed events.

Implementing robust retries is a huge favor you can do for your clients though: knowing that they'll get every event delivered eventually provided they return 200 codes for accepted events and error codes for everything else can massively simplify the code they need to write.

eminano · 2024-09-02T08:42:38 1725266558

Hi there! I'm one of the authors of the pgstream project and found this thread very interesting.

I completely agree, retry handling can become critical when the application relies on all events being eventually delivered. I have created an issue in the repo (https://github.com/xataio/pgstream/issues/67) to add support for notification retries.

Thanks for everyone's feedback, it's been very helpful!

maxbond · 2024-09-01T20:04:24 1725221064

Really any time you're using a distributed system, you should ask yourself, do I need retries? Do I need circuit breakers? Do I need rate limits?

The answers are probably either "yes" or "yes, but not yet."

kgeist · 2024-09-01T21:23:56 1725225836

>a good webhooks system will queue things up for re-delivery with an exponential backoff and try a few more times before giving up completely.

Microsoft's Graph API also requires consumers to re-register webhooks every N days, a kind of "heartbeat" to make sure the webhooks aren't sent to a dead server forever.

eminano · 2024-09-02T09:00:07 1725267607

This is definitely an interesting idea. Pgstream does offer the ability to unsubscribe from the notifications (https://github.com/xataio/pgstream/blob/main/pkg/wal/process...), but it relies on the receiver actively making that request.

It can be annoying to have to re-subscribe every N days when your application is working as expected though. I wonder if one way to work around this could be to "blacklist" servers that have failed consistently for an amount of time. They could be deleted from the subscriptions table if they remain inactive once they've been blacklisted.

I have opened an issue in the repo to track this (https://github.com/xataio/pgstream/issues/68), since it would be a critical feature to avoid denial-of-service attacks.

Thanks for your feedback!

tasn · 2024-09-01T20:51:39 1725223899

They should defo use Svix[1], I'll reach out to the blog author, this looks like a cool blog post and use-case.

1: https://www.svix.com