Hacker News new | past | comments | ask | show | jobs | submit login

Very solid article. Even with where it’s published, it’s jolly sensible.

I would like to jump in and say use Beam instead of DBT, but tbh that’s bad advice. What the world needs is something open source with the incremental model of beam, a fast incremental backend (thinking htap storage that mixes columns and rows automagically) and the ease and maintainability of DBT. There is just this massive hole. If some combination of tools could fill it, that would be the new LAMP stack for data.




Are you perhaps talking about something like https://materialize.com/ ? (btw, dbt now has some materialize compatibility)

Maybe Pravega and Beam working together? https://pravega.io/docs/v0.6.0/key-features/

Another option is something like Snowflake with tasks and streams. https://docs.snowflake.com/en/user-guide/tasks-intro.html

Or Snowflake with change streams, dbt and scheduler in combination with lambda views. https://discourse.getdbt.com/t/how-to-create-near-real-time-...


>2. Run dbt in micro-batches

>Just don’t do it. Because dbt is primarily designed for batch-based data processing, you should not schedule your dbt jobs to run continuously. This can open the door to unforeseeable bugs.

why not though. you can inplement incremental models and run them continously. sure its more work but what bugs does this cause?


Totally agree. While not using DBT specifically, I've done this on tables with billions of rows and it works perfectly. And even this can be combined with a Lambda view giving you the best of both worlds. Combining overcomes any latency from the incremental process since it can take time.

But I did end up questioning why I needed to continuously microbatch when the lambda views are able to bridge the gap. It turned out that the lambda views were good enough that we could reduce the microbatching back to ever 24hrs, and that was just being overly cautious. 48hrs or more might have been good enough, maybe more.

It turned out that the costly part of the microbatching was really merging (inserts and update, not append only) the delta data back into the prepared table. Selecting, combining and consolidating new and historic data is extremely fast.


Oh, you should check Materialize. I feel Materialize is like dbt but with an ingestion layer and real-time materialized views.

To deliver that you need to centralize data on their the Materialize database, which is may main caveat. With dbt you can use any data warehouse.


Are you talking about Apache Beam? I happened to land a job where I had learned how to create dataflow pipelines using Apache Beam (java) on gcp.

Im a little worried that I might be investing my time on a skill/tool that isn’t that much in demand. (Instead of common backend/frontend development)


Apache Beam is a quite nice and flexible tool. If you prefer Scala, then you can try Scio: https://github.com/spotify/scio.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: