Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built an open-source data copy tool called ingestr (github.com/bruin-data)
156 points by karakanb 9 months ago | hide | past | favorite | 48 comments
Hi there, Burak here. I built an open-source data copy tool called ingestr (https://github.com/bruin-data/ingestr)

I did build quite a few data warehouses both for the companies I worked at, as well as for consultancy projects. One of the more common pain points I observed was that everyone had to rebuild the same data ingestion bit over and over again, and each in different ways:

- some wrote code for the ingestion from scratch to various degrees

- some used off-the-shelf data ingestion tools like Fivetran / Airbyte

I have always disliked both of these approaches, for different reasons, but never got around to working on what I'd imagine to be the better way forward.

The solutions that required writing code for copying the data had quite a bit of overhead such as how to generalize them, what language/library to use, where to deploy, how to monitor, how to schedule, etc. I ended up figuring out solutions for each of these matters, but the process always felt suboptimal. I like coding but for more novel stuff rather than trying to copy a table from Postgres to BigQuery. There are libraries like dlt (awesome lib btw, and awesome folks!) but that still required me to write, deploy, and maintain the code.

Then there are solutions like Fivetran or Airbyte, where there's a UI and everything is managed through there. While it is nice that I didn't have to write code for copying the data, I still had to either pay some unknown/hard-to-predict amount of money to these vendors or host Airbyte myself which is roughly back to square zero (for me, since I want to maintain the least amount of tech myself). Nothing was versioned, people were changing things in the UI and breaking the connectors, and what worked yesterday didn't work today.

I had a bit of spare time a couple of weeks ago and I wanted to take a stab at the problem. I have been thinking of standardizing the process for quite some time already, and dlt had some abstractions that allowed me to quickly prototype a CLI that copies data from one place to another. I made a few decisions (that I hope I won't regret in the future):

- everything is a URI: every source and every destination is represented as a URI

- there can be only one thing copied at a time: it'll copy only a single table within a single command, not a full database with an unknown amount of tables

- incremental loading is a must, but doesn't have to be super flexible: I decided to support full-refresh, append-only, merge, and delete+insert incremental strategies, because I believe this covers 90% of the use-cases out there.

- it is CLI-only, and can be configured with flags & env variables so that it can be automated quickly, e.g. drop it into GitHub Actions and run it daily.

The result ended up being `ingestr` (https://github.com/bruin-data/ingestr).

I am pretty happy with how the first version turned out, and I plan to add support for more sources & destinations. ingestr is built to be flexible with various source and destination combinations, and I plan to introduce more non-DB sources such as Notion, GSheets, and custom APIs that return JSON (which I am not sure how exactly I'll do but open to suggestions!).

To be perfectly clear: I don't think ingestr covers 100% of data ingestion/copying needs out there, and it doesn't aim that. My goal with it is to cover most scenarios with a decent set of trade-offs so that common scenarios can be solved easily without having to write code or manage infra. There will be more complex needs that require engineering effort by others, and that's fine.

I'd love to hear your feedback on how can ingestr help data copying needs better, looking forward to hearing your thoughts!

Best, Burak




I was surprised to see SQLite listed as a source but not as a destination. Any big reasons for that or is it just something you haven't got around to implementing yet?

I've been getting a huge amount of useful work done over the past few years sucking data from other systems into SQLite files on my own computer - I even have my own small db-to-sqlite tool for this (built on top of SQLAlchemy) - https://github.com/simonw/db-to-sqlite


I do use the dlt library to support as many source & destinations as possible and they do not support SQLite as of today. I am interested in supporting SQLite simply because I love it as well, so that's definitely in the roadmap.

db-to-sqlite looks lovely, I'll see if I can learn a thing or two from it!


looks like dlt doesn't support it as a destination (which this is a wrapper around)

https://dlthub.com/docs/dlt-ecosystem/destinations/


one of the dltHub founders here - we aim to address this in the coming weeks


I used sqlite-utils to create a tool that can merge SQLITE files and split them:

https://github.com/chapmanjacobd/library?tab=readme-ov-file#...


one of the dltHub founders here - we aim to address this in the coming weeks


Firstly, congrats :) (Generalized) ingestion is a very hard problem because any abstraction that you come up with will always some limitations where you might need to fallback to writing code and have full access to the 3rd party APIs. But definitely in some cases generalized ingestion is much better then re-writing the same ingestion piece especially for complex connectors. Take a look at CloudQuery (https://github.com/cloudquery/cloudquery) open source high performance ELT framework powered by Apache Arrow (so you can write plugins in any language). (Maintainer here)


couldn't agree more! I see ingestr more as a common-scenario solution rather than a general solution that solves all cases, kinda like how I treat shell oneliners instead of writing an applicataion in another language. I guess there's space for both approaches.

I'll definitely take a look at CloudQuery, thanks a lot for sharing!


Hi Burak. I have been testing ingestr using a source and destination Postgres database. What I'm trying to do is copy data from my Prod database to my test database. I find when using replace I get additional dlt columns added to the tables as hints. It also does not work for a defined primary key only natural keys. Composite keys do not work. Can you tell me the basic, minimal that it supports. I would love to use it to keep our Prod and Test databases in sync, but it appears that the functionality I need is not there. Thanks very much.


Hi there, thanks a lot for your comment and trying it out. Do you mind joining our Slack community via the link in the readme or create a github issue so that we can dive into this? I'd love to understand what doesn't work and provide fixes.


This looks pretty cool! What was the hardest part about building this?


hey, thanks!

I guess there were a couple of things that I found as tricky:

- deciding on the right way to represent sources and destinations was hard, before landing on URIs I thought of using config files but that'd also add additional complexity etc

- the platforms had different quirks concerning different data types

- dlt stores state on its own, which means that re-runs are not running from scratch after changing the incremental strategy, and they require a full refresh, it took me quite some time to figure out how exactly to work with it

I think among these the hardest part was to get myself to build and release it, because I had it in my mind for a long time and it took me a _long while_ to build and share it :)


Do you think you'll add local file support in the future? Also, do you have any plans on making the reading of a source parallel? For example, connectorx uses an optional partition column to read chunks of a table concurrently. Cool how it's abstracted.


I have just released v0.1.2 which supports CSV destinations with the URI format `csv://path/to/file.csv`, hope that's helpful!


I am working on file support right now as a destination to begin with. I believe I should get local files as well as S3-compatible sources going by tonight.

Reading the sources in parallel is an interesting idea, I'll definitely take a look at it. ingestr supports incremental loads by a partitioned column, but there's no parallelized partition reading at the moment.

Thanks a lot for your comment!


I second this!


Looks interesting. Clickhouse seems to be conspicuously missing as source and destination. Although I suppose clickhouse can masquerade as postgres: https://clickhouse.com/docs/en/interfaces/postgresql

Ed: there's an issue already: https://github.com/bruin-data/ingestr/issues/1


I am very interested in data ingestion. I develop a desktop data wrangling tool in C++ ( Easy Data Transform ). So far it can import files in various formats (CSV, Excel, JSON, XML etc). But I am interested in being able to import from databases, APIs and other sources. Would I be able to ship your CLI as part of my product on Windows and Mac? Or can someone suggest some other approach to importing from lots of data sources without coding them all individually?


hmm, that's an interesting question, I don't know the answer to be honest. are you able to run external scripts on the device? if so, you might be able to install & run ingestr with a CSV destination (which I released literally 2 mins ago), but that seems like a lot of work as well, and will probably be way slower than your C++ application.

Maybe someone else has another idea?


I can start a CLI as a separate process. But ingesting to CSV and then reading the CSV would be slow. Maybe it would be better to ingest into DuckDB or in memory in Arrow memory format. If anyone has any other suggestions, I am all ears.


I like the idea of encoding complex connector configs into URIs!


Perhaps OP re-invented it, but it's been around for a long time in the java world via jdbc urls. See, for example this writeup: https://www.baeldung.com/java-jdbc-url-format


I don't think I invented anything tbh, I just relied on SQLAlchemy's URI formats, and I decided to abuse it slightly for even more config.


Glad to hear that! I am not 100% sure if it’ll look pretty for all platforms but I hope it’ll be an okay base to get started!


This looks awesome. I had this exact problem just last week and had to write my own tool to perform the migration in go. After creating the tool I thought this must be something others would use- glad to see someone beat me to it!

I think it’s clever keep the tool simple and only copy one table at a time. My solution was to generate code based on an sql schema, but it was going to be messy and require more user introspection before the tool could be run.


thanks a lot for your comment, glad to hear we converged on a similar idea! :)


This looks pretty cool. Is there any schema management included or do schema changes need to be in place on both sides first?


It does handle schema evolution wherever it can, including inferring the initial schema automatically based on the source and destination as well, which means there's no need for manual schema changes anywhere and it will keep them in sync wherever it can.


Any thought on how this compares to Meltano and also their Singer SDK? We use it at $DAYJOB because it gives us a great hybrid of standardizing so we don’t have to treat it differently downstream while still letting us customize,


If you can add source and destination as csv, it will increase the usefulness of this product manifold.

There are many instances where people either have a csv that they want to load into a database or get a specific database table exported into csv.


I have just released v0.1.2 which supports CSV destinations with the URI format `csv://path/to/file.csv`, hope that's helpful!


I agree, I am looking into it right now!


Similarly, Google Sheets might also be a popular endpoint.


on it!


Also released local CSV file as a source in v0.1.3.


Looks really interesting and definitely a use case I face over and over again. The name just breaks my brain, I want it to be an R package but it’s Python. Just gives me a mild headache.


Looks great Burak! Appreciate your contribution to Open Source Data ecosystem!


thanks a lot Peter!


Is there a reason CSV (as a source) isn't supported? I've been looking for exactly this type of tool, but that supports CSV.

CSV support would be huge.

Please please please provide CSV support. :)


I have released v0.1.2 with the destination literally minutes ago, I'll take a look at CSV as a source!

just so that I have a better understanding, do you mind explaining your usecase?


I have just released supporting local CSV file as a source in v0.1.3, let me know if this helps! :)


Sweet! Will take a look at this immediately!


Hi Burak, I saw cx_Oracle in the requirements.txt but the support matrix did not mention it. Does this mean Oracle is coming? Or a typo?


I added it as an experimental source a few hours ago, but I haven't had the chance to test it, that's why I haven't put it into the support matrix yet. Do you mind trying it out if you do use Oracle?


I'd love to see support for odbc, any plans?


Do you mean SQL Server? If that's the case, ingestr is already able to connect to Microsoft SQL Server and use it both as a source and a destination.


Db2 like not existing db in the real world


it's on my roadmap for sure!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: