Bottled Water – Real-Time Integration of PostgreSQL and Kafka

SEJeff · on Nov 1, 2016

FYI I've used bottledwater and it can cause segfaults that take down the postgres database hard.

I ran into both of these:

https://github.com/confluentinc/bottledwater-pg/issues/53

https://github.com/confluentinc/bottledwater-pg/issues/61

They're fixed now, but this is very much beta stuff. It wasn't as magical as we'd hoped so we ended up ripping it out and switching the application to use SSE[1] which streamed into a kafka producer.

[1] https://en.wikipedia.org/wiki/Server-sent_events

threeseed · on Nov 1, 2016

Not really sure why you would do it like this.

Most systems do cache/index invalidation or HDFS archiving from the web app or a set of microservices that gets notified from the web app. So instead of pushing raw DB rows into Kafka they will push generic events e.g. "user deleted account" and let the various services work out how to respond to it.

In this model you have tightly coupled both the choice of DB as well as the DB schema to the events system.

Sounds like a bit of a nightmare if your iterate a lot on your architecture.

kasey_junk · on Nov 1, 2016

We are doing something very similar to enable new use cases on a large database that has many hard to change applications already talking to it. That is its a legacy migration approach. It gets a 'generic' form of the data into an eventing system all at once that would have taken an unreasonable amount of dev resources to get any other way. New producers can talk to the events system in a more straightforward way and consumers can get all the data they need, legacy and modern, from the same place.

That said, I'll warn you, we didn't find Bottled Water to be anywhere near robust enough for our use cases. It seems to make assumptions about the size of your database, the size of your transactions, and the downside risks of logical replication backing up that certainly didn't meet our requirements.

Jweb_Guru · on Nov 1, 2016

The architecture you describe is going to be extremely sensitive to all the usual asynchronous system problems--message omission, message duplication, and message reordering (you could fix all of them with enough effort and potential slowdown, but probably don't). Having tried it, I'd categorize your system as a nightmare too :)

olalonde · on Nov 1, 2016

https://en.wikipedia.org/wiki/Change_data_capture

It's mostly useful when you can't change the application code.

manojlds · on Nov 1, 2016

This ebook goes into the details of the why - http://www.confluent.io/wp-content/uploads/2016/08/Making_Se...

(authored by the creator of this tool)

bladecatcher · on Nov 1, 2016

This makes sense for e.g, when you want to make your database content available for search (via elasticsearch) - for this you may want to push the raw DB rows.

threeseed · on Nov 1, 2016

Again it doesn't really make sense because you are tying the physical DB schema to the search engine schema.

Your application domain model should be at the centre of your architecture not the physical database model. For example storing a User object rather than a row from a User table.

I understand why a database company would see a database as the centre of the world. But it really should be your application. Especially if you want to use PostgreSQL and InfluxDB for different domain types and yet have both indexed in ElasticSearch.

shawn-butler · on Nov 1, 2016

No.

The application domain model can be buggy and full of holes. The data store is the source of truth.

zepolen · on Nov 1, 2016

Couldn't agree more.

It's also pointless to try and convince someone who thinks otherwise either, I found it better to let the those developers shoot themselves in the foot and learn the hard way. It's the only way they'll learn.

rpedela · on Nov 1, 2016

You are right that the ES and DB schemas shouldn't be directly tied together. That is why I have specific DB functions that output the correct format for ES and only those functions need to change if the ES or DB schemas change. It is a much simpler way to do it then to have application worry about it in my opinion.

zepolen · on Nov 1, 2016

You're gonna have a lot of fun once you transcend beyond 'single central application' and have multiple apps interfacing with your data.

netghost · on Nov 1, 2016

This is probably also quite useful if you have a lot of logic in Postgres itself (ie: triggers), which can't be captured by the calling code.

anentropic · on Nov 1, 2016

and cache invalidation is notoriously easy to get right...

ZenoArrow · on Nov 1, 2016

Previous discussion: https://news.ycombinator.com/item?id=9427441

IIRC this works closely with Samza. There's a decent video by the author knocking around somewhere.

EDIT: This is the video I was thinking of:

https://m.youtube.com/watch?v=fU9hR3kiOK0

agentgt · on Nov 1, 2016

I thought about doing this approach but it looked like too much work (I was going to use the listen notify event support in Postgres).

Instead we just write directly to durable queues on RabbitMQ (there are some exceptions where extreme consistency and transactions are needed).

That's is we do web->queue->databases.

There are serious cons to this bus approach as well as pros like improvements in latency.

The bottle approach is nice because you can write old school CRUD style.

Maybe I can use parts of bottle to write to our rabbit bus for the parts that are CRUD because of consistency reasons.

pram · on Nov 1, 2016

Confluent has a newish program for taking streaming data from Kafka and sending it to multiple different types of stores called Kafka Connect: http://docs.confluent.io/3.0.0/connect/ https://github.com/confluentinc/kafka-connect-jdbc

It supports HDFS and ElasticSearch as well.

pimeys · on Nov 1, 2016

It's funny. We're in a need of something like this. I was thinking my options this morning and went to Hacker News just to see the link posted here. We already have the latest PostgreSQL and Kafka running. What a coincidence!

eddd · on Nov 1, 2016

MySQL has its own solution as well. https://github.com/zendesk/maxwell

zodPod · on Nov 1, 2016

Really? It's called Bottled Water? Are we really getting that ridiculous with our naming?

crudbug · on Nov 1, 2016

I am with you here .. I just made up "Krater"

raarts · on Nov 1, 2016

Relevant username.

zodPod · on Nov 2, 2016

Haha well it's not "City Building" or some ridiculously common English word. At least, not as far as I know.. lol it was a random series of noises that popped out of my head which, I'd argue, is better than naming something "Bottled Water" or "Hydrogen" or "Rust" lol