I think retries are sketchy. Sometimes they really are necessary, but sometimes the API can be designed around at-most-once semantics.
Even with a retry scheme, unless you are willing to retry indefinitely, there is always the problem of missed events to handle. If you need to handle this anyway, it might as well use this assumption to simplify the process and not attempt retries at all.
From a reliability perspective, it is hard to monitor whether the delivery is working if there is a variable event rate. Instead, designing it to have a synchronous endpoint the web hook server can call periodically to “catch up” has better reliability properties. Incidentally, this scheme handles at-most-once semantics pretty well, because the server can periodically catch up on all missed events.
That's a good call: having a "catch me up" API you can poll is a very robust way of dealing with missed events.
Implementing robust retries is a huge favor you can do for your clients though: knowing that they'll get every event delivered eventually provided they return 200 codes for accepted events and error codes for everything else can massively simplify the code they need to write.
Hi there! I'm one of the authors of the pgstream project and found this thread very interesting.
I completely agree, retry handling can become critical when the application relies on all events being eventually delivered. I have created an issue in the repo (https://github.com/xataio/pgstream/issues/67) to add support for notification retries.
Thanks for everyone's feedback, it's been very helpful!
Even with a retry scheme, unless you are willing to retry indefinitely, there is always the problem of missed events to handle. If you need to handle this anyway, it might as well use this assumption to simplify the process and not attempt retries at all.
From a reliability perspective, it is hard to monitor whether the delivery is working if there is a variable event rate. Instead, designing it to have a synchronous endpoint the web hook server can call periodically to “catch up” has better reliability properties. Incidentally, this scheme handles at-most-once semantics pretty well, because the server can periodically catch up on all missed events.