Having used a variety of modern ETL frameworks in the past years, I consider wri...

knite · on May 30, 2020

There are too many different tools in the space. I've been heavily researching workflow / ETL frameworks this week, and even after culling the ones that seemed like poor fits, I'm still left with:

- https://github.com/getpopper/popper

- https://docs.pachyderm.com/

- https://github.com/lyft/flyte

- https://aws.amazon.com/step-functions/

- https://github.com/spotify/luigi

- https://docs.metaflow.org/

- https://github.com/dagster-io/dagster

- https://github.com/argoproj/argo

- https://github.com/prefecthq/prefect

ramraj07 · on May 29, 2020

Primary problem for me is spending so much time setting up these monsters to do what's basically a set of cronjobs. What's the most simplest system out there that can be highly available and deployed as easily as possible?

Another question is, I strongly feel like the definition of pipelines should not be in code, but in the database. I keep coming back to that design pattern every time I start coding my own simple scheduling solution. Is there merit to this thought?

throwaway7281 · on May 29, 2020

Yes, cron is a bit undervalued in that for one off (well locked) tasks it's perfectly fine to create a crontab entry. And simplicity is king. I feel people throw frameworks at problems where a simple shell/go script in a cron would be just enough.

As for the pipeline definition. One goal is to have a notion of pipelines that is both comprehensive and declarative.

As for a database, what would you store there? Container image to run? Past execution data (e.g. output path, time, errors)?

The software world has many pipeline-y things, such as CI definitions and these definitions usually live in configuration files.

What is difficult at time is the tracking of done tasks. Is the output a file or a new row in some database or many files or many rows or anything else?