There are too many different tools in the space. I've been heavily researching workflow / ETL frameworks this week, and even after culling the ones that seemed like poor fits, I'm still left with:
Primary problem for me is spending so much time setting up these monsters to do what's basically a set of cronjobs. What's the most simplest system out there that can be highly available and deployed as easily as possible?
Another question is, I strongly feel like the definition of pipelines should not be in code, but in the database. I keep coming back to that design pattern every time I start coding my own simple scheduling solution. Is there merit to this thought?
Yes, cron is a bit undervalued in that for one off (well locked) tasks it's perfectly fine to create a crontab entry. And simplicity is king. I feel people throw frameworks at problems where a simple shell/go script in a cron would be just enough.
As for the pipeline definition. One goal is to have a notion of pipelines that is both comprehensive and declarative.
As for a database, what would you store there? Container image to run? Past execution data (e.g. output path, time, errors)?
The software world has many pipeline-y things, such as CI definitions and these definitions usually live in configuration files.
What is difficult at time is the tracking of done tasks. Is the output a file or a new row in some database or many files or many rows or anything else?
If I may ask, what questions do you find most difficult to solve in the context of real-world ETL setups?