Streaming has been the story for the past couple of decades already. But once yo...

benjaminwootton · on Oct 17, 2022

It’s been discussed for a long time, but the ecosystem is still spotty. Many databases have some kind of “live query” feature for instance but with limitations or not intended for production use.

A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.

I do think a lot of work has been done, but it all needs to move up a few levels of abstraction.

lmm · on Oct 18, 2022

> A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.

Personally I find it much easier to write the code explicitly than try to understand what a query planner is doing, especially if performance is relevant. (That's not to say there's not plenty of room for improvement in the streaming world - but I'd rather have a helper library that I can use on top of the low-level API, than have to go through a parser and planner for every query even when I know exactly what I want to do) But I seem to be an outlier in this regard.

tango12 · on Oct 17, 2022

Yep. Exactly this.

Our focus (Hasura) is on the last mile so that innovations on the data side (eg: materialize, ksql, timescale continuous aggregates) are “just obvious” to start building applications and services against.

cultofmetatron · on Oct 17, 2022

maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.

Plus you can stream db changes from postgres to kafka for those edge cases where you really need it

TLDR: if you're in a startup and thinking of building a distributed system.. DONT. stick with a monolith and spin out services as NEEDED.

lmm · on Oct 18, 2022

> maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.

Funny, my experience is the exact opposite. Materialized views and triggers are death by a thousand cuts with all their edge cases. Whereas if you just use Kafka for everything from day 1 then everything works consistently and you have no problems.

cultofmetatron · on Oct 18, 2022

So who's managing the kafka cluster? are you running scheduled backups of it? what happens if the system is down while ingesting data? I can only imagine you have someone thats already veery experienced in kafka if you find that easier to manage than an rdbms which already has established hosted services available.

lmm · on Oct 21, 2022

Running Kafka is fiddly but a lot easier than running a true master-master RDBMS setup (which none of my employers have ever managed; at best some have had a read replica and an untested shell script from five years ago that supposedly promotes that read replica to master if the master goes down). Backups are the same problem either way (and, while it's not a full replacement for actual backups, the fact that Kafka is clustered means you're protected from some failure scenarios out of the box in a way that you aren't with something like Postgres). And there are plenty of established hosted kafka services if that's your preferred approach.