Hacker News new | past | comments | ask | show | jobs | submit login

The GitHub repo README gives a better sense of the capabilities of the pgstream system: https://github.com/xataio/pgstream?tab=readme-ov-file#archit...

For deployments at a more serious scale, it seems they support buffering WAL events into Kafka, similar to Debezium (the current leader for change data capture), to de-couple the replication slot reader on the Postgres side throughput from the event handlers you deliver the events to.

Pgstream seems more batteries-included compared to using Debezium to consume the PG log; Kafka is optional with pgstream and things like webhook delivery and OpenSearch indexing are packaged in, rather than being a “choose your own adventure” game with Kafka Streams middleware ecosystem jungle. If their offerings constraints work for your use-case, why not prefer it over Debezium since it seems easier? I’d rather write Go than Java/JVM-language if I need to plug into the pipeline.

However at any sort of serious scale you’ll need the Kafka in there, and then I’m less sure you’d make use of the other plug and play stuff like the OpenSearch indexer or webhooks at all. Certainly as volume grows webhooks start to feel like a bad fit for CDC events at the level of a single row change; at least in my brief Debezium CDC experience, my consumer pulls batches of 1000+ changes at once from Kafka.

The other thing I don’t see is transaction metadata, maybe I shouldn’t worry much about it (not many people seem to be concerned) but I’d like my downstream consumer to have delayed consistency with my Postgres upstream, which means I need to consume record changes with the same transactional grouping as in Postgres, otherwise I’ll probably never be consistent in practice: https://www.scattered-thoughts.net/writing/internal-consiste...




That's a great summary!

You're right, at scale, Kafka allows you to parallelise the workload more efficiently, as well as supporting multiple consumers for your replication slot. However, you should still be able to use the plug and play output processors with Kafka just as easily (implementation internally is abstracted, it works the same way with/without Kafka). We currently use the search indexer with Kafka at Xata for example. As long as the webhook server has support for rate limiting, it should be possible to handle a large amount of event notifications.

Regarding the transaction metadata, the current implementation uses the wal2json postgres output plugin (https://github.com/eulerto/wal2json), and we don't include the transactional metadata. However, this is something that would be easy to enable, and integrate into the pgstream pipeline if needed, to ensure transactional consistency on the consumer side.

Thanks for your feedback!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: