Hi there, Burak here. I built an open-source data copy tool called ingestr (
https://github.com/bruin-data/ingestr)
I did build quite a few data warehouses both for the companies I worked at, as well as for consultancy projects. One of the more common pain points I observed was that everyone had to rebuild the same data ingestion bit over and over again, and each in different ways:
- some wrote code for the ingestion from scratch to various degrees
- some used off-the-shelf data ingestion tools like Fivetran / Airbyte
I have always disliked both of these approaches, for different reasons, but never got around to working on what I'd imagine to be the better way forward.
The solutions that required writing code for copying the data had quite a bit of overhead such as how to generalize them, what language/library to use, where to deploy, how to monitor, how to schedule, etc. I ended up figuring out solutions for each of these matters, but the process always felt suboptimal. I like coding but for more novel stuff rather than trying to copy a table from Postgres to BigQuery. There are libraries like dlt (awesome lib btw, and awesome folks!) but that still required me to write, deploy, and maintain the code.
Then there are solutions like Fivetran or Airbyte, where there's a UI and everything is managed through there. While it is nice that I didn't have to write code for copying the data, I still had to either pay some unknown/hard-to-predict amount of money to these vendors or host Airbyte myself which is roughly back to square zero (for me, since I want to maintain the least amount of tech myself). Nothing was versioned, people were changing things in the UI and breaking the connectors, and what worked yesterday didn't work today.
I had a bit of spare time a couple of weeks ago and I wanted to take a stab at the problem. I have been thinking of standardizing the process for quite some time already, and dlt had some abstractions that allowed me to quickly prototype a CLI that copies data from one place to another. I made a few decisions (that I hope I won't regret in the future):
- everything is a URI: every source and every destination is represented as a URI
- there can be only one thing copied at a time: it'll copy only a single table within a single command, not a full database with an unknown amount of tables
- incremental loading is a must, but doesn't have to be super flexible: I decided to support full-refresh, append-only, merge, and delete+insert incremental strategies, because I believe this covers 90% of the use-cases out there.
- it is CLI-only, and can be configured with flags & env variables so that it can be automated quickly, e.g. drop it into GitHub Actions and run it daily.
The result ended up being `ingestr` (https://github.com/bruin-data/ingestr).
I am pretty happy with how the first version turned out, and I plan to add support for more sources & destinations. ingestr is built to be flexible with various source and destination combinations, and I plan to introduce more non-DB sources such as Notion, GSheets, and custom APIs that return JSON (which I am not sure how exactly I'll do but open to suggestions!).
To be perfectly clear: I don't think ingestr covers 100% of data ingestion/copying needs out there, and it doesn't aim that. My goal with it is to cover most scenarios with a decent set of trade-offs so that common scenarios can be solved easily without having to write code or manage infra. There will be more complex needs that require engineering effort by others, and that's fine.
I'd love to hear your feedback on how can ingestr help data copying needs better, looking forward to hearing your thoughts!
Best,
Burak
I've been getting a huge amount of useful work done over the past few years sucking data from other systems into SQLite files on my own computer - I even have my own small db-to-sqlite tool for this (built on top of SQLAlchemy) - https://github.com/simonw/db-to-sqlite