Very solid article. Even with where it’s published, it’s jolly sensible. I would...

chrisjc · on Oct 8, 2021

Are you perhaps talking about something like https://materialize.com/ ? (btw, dbt now has some materialize compatibility)

Maybe Pravega and Beam working together? https://pravega.io/docs/v0.6.0/key-features/

Another option is something like Snowflake with tasks and streams. https://docs.snowflake.com/en/user-guide/tasks-intro.html

Or Snowflake with change streams, dbt and scheduler in combination with lambda views. https://discourse.getdbt.com/t/how-to-create-near-real-time-...

dominotw · on Oct 9, 2021

>2. Run dbt in micro-batches

>Just don’t do it. Because dbt is primarily designed for batch-based data processing, you should not schedule your dbt jobs to run continuously. This can open the door to unforeseeable bugs.

why not though. you can inplement incremental models and run them continously. sure its more work but what bugs does this cause?

chrisjc · on Oct 11, 2021

Totally agree. While not using DBT specifically, I've done this on tables with billions of rows and it works perfectly. And even this can be combined with a Lambda view giving you the best of both worlds. Combining overcomes any latency from the incremental process since it can take time.

But I did end up questioning why I needed to continuously microbatch when the lambda views are able to bridge the gap. It turned out that the lambda views were good enough that we could reduce the microbatching back to ever 24hrs, and that was just being overly cautious. 48hrs or more might have been good enough, maybe more.

It turned out that the costly part of the microbatching was really merging (inserts and update, not append only) the delta data back into the prepared table. Selecting, combining and consolidating new and historic data is extremely fast.

Arimbr · on Oct 8, 2021

Oh, you should check Materialize. I feel Materialize is like dbt but with an ingestion layer and real-time materialized views.

To deliver that you need to centralize data on their the Materialize database, which is may main caveat. With dbt you can use any data warehouse.

wara23arish · on Oct 8, 2021

Are you talking about Apache Beam? I happened to land a job where I had learned how to create dataflow pipelines using Apache Beam (java) on gcp.

Im a little worried that I might be investing my time on a skill/tool that isn’t that much in demand. (Instead of common backend/frontend development)

joeswartz · on Oct 8, 2021

Apache Beam is a quite nice and flexible tool. If you prefer Scala, then you can try Scio: https://github.com/spotify/scio.