Postgres Webhooks with Pgstream

jitl · 2024-09-01T19:43:38.000000Z

The GitHub repo README gives a better sense of the capabilities of the pgstream system: https://github.com/xataio/pgstream?tab=readme-ov-file#archit...

For deployments at a more serious scale, it seems they support buffering WAL events into Kafka, similar to Debezium (the current leader for change data capture), to de-couple the replication slot reader on the Postgres side throughput from the event handlers you deliver the events to.

Pgstream seems more batteries-included compared to using Debezium to consume the PG log; Kafka is optional with pgstream and things like webhook delivery and OpenSearch indexing are packaged in, rather than being a “choose your own adventure” game with Kafka Streams middleware ecosystem jungle. If their offerings constraints work for your use-case, why not prefer it over Debezium since it seems easier? I’d rather write Go than Java/JVM-language if I need to plug into the pipeline.

However at any sort of serious scale you’ll need the Kafka in there, and then I’m less sure you’d make use of the other plug and play stuff like the OpenSearch indexer or webhooks at all. Certainly as volume grows webhooks start to feel like a bad fit for CDC events at the level of a single row change; at least in my brief Debezium CDC experience, my consumer pulls batches of 1000+ changes at once from Kafka.

The other thing I don’t see is transaction metadata, maybe I shouldn’t worry much about it (not many people seem to be concerned) but I’d like my downstream consumer to have delayed consistency with my Postgres upstream, which means I need to consume record changes with the same transactional grouping as in Postgres, otherwise I’ll probably never be consistent in practice: https://www.scattered-thoughts.net/writing/internal-consiste...

eminano · 2024-09-02T09:44:42.000000Z

That's a great summary!

You're right, at scale, Kafka allows you to parallelise the workload more efficiently, as well as supporting multiple consumers for your replication slot. However, you should still be able to use the plug and play output processors with Kafka just as easily (implementation internally is abstracted, it works the same way with/without Kafka). We currently use the search indexer with Kafka at Xata for example. As long as the webhook server has support for rate limiting, it should be possible to handle a large amount of event notifications.

Regarding the transaction metadata, the current implementation uses the wal2json postgres output plugin (https://github.com/eulerto/wal2json), and we don't include the transactional metadata. However, this is something that would be easy to enable, and integrate into the pgstream pipeline if needed, to ensure transactional consistency on the consumer side.

Thanks for your feedback!

simonw · 2024-09-01T19:58:54.000000Z

Implementing webhook deliveries is one of those things that's way harder than you would initially imagine. The two things I always look for in systems like this are:

1. Can it handle deliveries to slow on unreliable endpoints? In particular, what happens if the external server is deliberately slow to respond - someone could be trying to crash your system by forcing it to deliver to slow-loading endpoints.

2. How are retries handled? If the server returns a 500, a good webhooks system will queue things up for re-delivery with an exponential backoff and try a few more times before giving up completely.

Point 1. only matters if you are delivering webhooks to untrusted endpoints - systems like GitHub where anyone can sign up for hook deliveries.

2. is more important.

https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... shows that the HTTP client can be configured with a timeout (which defaults to 10s https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... )

From looking at https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... it doesn't look like this system handles retries.

Retrying would definitely be a useful addition. PostgreSQL is a great persistence store for recording failures and retry attempts, so that feature would be a good fit for this system.

ratorx · 2024-09-01T23:19:22.000000Z

I think retries are sketchy. Sometimes they really are necessary, but sometimes the API can be designed around at-most-once semantics.

Even with a retry scheme, unless you are willing to retry indefinitely, there is always the problem of missed events to handle. If you need to handle this anyway, it might as well use this assumption to simplify the process and not attempt retries at all.

From a reliability perspective, it is hard to monitor whether the delivery is working if there is a variable event rate. Instead, designing it to have a synchronous endpoint the web hook server can call periodically to “catch up” has better reliability properties. Incidentally, this scheme handles at-most-once semantics pretty well, because the server can periodically catch up on all missed events.

simonw · 2024-09-02T00:25:11.000000Z

That's a good call: having a "catch me up" API you can poll is a very robust way of dealing with missed events.

Implementing robust retries is a huge favor you can do for your clients though: knowing that they'll get every event delivered eventually provided they return 200 codes for accepted events and error codes for everything else can massively simplify the code they need to write.

eminano · 2024-09-02T08:42:38.000000Z

Hi there! I'm one of the authors of the pgstream project and found this thread very interesting.

I completely agree, retry handling can become critical when the application relies on all events being eventually delivered. I have created an issue in the repo (https://github.com/xataio/pgstream/issues/67) to add support for notification retries.

Thanks for everyone's feedback, it's been very helpful!

maxbond · 2024-09-01T20:04:24.000000Z

Really any time you're using a distributed system, you should ask yourself, do I need retries? Do I need circuit breakers? Do I need rate limits?

The answers are probably either "yes" or "yes, but not yet."

kgeist · 2024-09-01T21:23:56.000000Z

>a good webhooks system will queue things up for re-delivery with an exponential backoff and try a few more times before giving up completely.

Microsoft's Graph API also requires consumers to re-register webhooks every N days, a kind of "heartbeat" to make sure the webhooks aren't sent to a dead server forever.

eminano · 2024-09-02T09:00:07.000000Z

This is definitely an interesting idea. Pgstream does offer the ability to unsubscribe from the notifications (https://github.com/xataio/pgstream/blob/main/pkg/wal/process...), but it relies on the receiver actively making that request.

It can be annoying to have to re-subscribe every N days when your application is working as expected though. I wonder if one way to work around this could be to "blacklist" servers that have failed consistently for an amount of time. They could be deleted from the subscriptions table if they remain inactive once they've been blacklisted.

I have opened an issue in the repo to track this (https://github.com/xataio/pgstream/issues/68), since it would be a critical feature to avoid denial-of-service attacks.

Thanks for your feedback!

tasn · 2024-09-01T20:51:39.000000Z

They should defo use Svix[1], I'll reach out to the blog author, this looks like a cool blog post and use-case.

1: https://www.svix.com

canadiantim · 2024-09-01T20:18:22.000000Z

Xata putting out some cool pg things. Is PG roll still looking good these days? Seemed pretty sexy last time I looked at it

gulcin_xata · 2024-09-02T16:56:33.000000Z

pgroll is under active development, feel free to let us know if you have any feature requests here: https://feedback.xata.io/

adamcharnock · 2024-09-01T19:45:40.000000Z

A quick google didn’t reveal much about pgstream’s delivery semantics. Do these webhooks get called at most once, or at least once? I hope the latter, but in absence of information I’d guess it is the former.

simonw · 2024-09-01T19:55:22.000000Z

It looks to me like it's at-most-once. This code here https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... logs an error if a delivery fails but does not appear to queue it for a second attempt.

jitl · 2024-09-01T19:57:48.000000Z

Sad! Suitable for driving “best effort” use cases like reactive UI updates or stale-while-revalidate cache refreshes, but not usable for “business logic” like updating user access lists, billing, sending emails, or maintaining a secondary index of any sort.

eminano · 2024-09-02T09:22:35.000000Z

The current implementation of the webhook notifier will call the webhook URL at most once. There's no real reason for it beyond it being a first iteration of the webhook output plugin. I've created an issue in the repo to support retries , since this is a valuable feature (https://github.com/xataio/pgstream/issues/67).

Worth mentioning that the retry strategy is defined on a processor level. The opensearch indexer for instance does handle retries, making it "at least once" delivery.

Thanks for your feedback!

newb76 · 2024-09-02T12:52:35.000000Z

This is very nice! I came across this through an internal ref. It appears that there is a lot of movement in the CDC space lately - pgstream, sequinstream, bemi, pg_flo (Saw some guy on X posting videos of CDC tools too)

Are there real world use cases being solved with this, or is it just hobby tool?

whizzter · 2024-09-02T13:34:13.000000Z

We maintain a two projects for a customer where data for the applications is held in secondary databases, can't count how many sync issues we've had with these projects. Right now there's is a continuous sync that processes tons of redundant data (with a certain latency due to the amount of data).

CDC would enable the processes to focus what's changed instead of continually processing the entire database.

I feel that DB support isn't necessarily the biggest issue for adopting these things, rather it's that it's somewhat more complicated to debug event driven systems for "regular" developers rather than request-response or loop based code.

In that sense delivering the events via webhooks is a good idea since it allows them to work with their regular request-response debugging tools.

tazu · 2024-09-01T20:50:33.000000Z

Curious if there's a reason this extension is written in Go and not Zig? The blog posted [1] a few months ago about their cool Zig library for making extensions and I've been playing around with it. Is Zig just not mature enough?

[1]: https://xata.io/blog/introducing-pgzx

gobblegobble2 · 2024-09-01T22:53:57.000000Z

It's not an extension. It's a standalone app that connects to postgres and listens to a replication stream.

oars · 2024-09-01T21:31:37.000000Z

Postgres webhooks could be very usefulness for me. Thanks for sharing.