Show HN: I built an open-source data copy tool called ingestr

simonw · 2024-02-27T18:16:27 1709057787

I was surprised to see SQLite listed as a source but not as a destination. Any big reasons for that or is it just something you haven't got around to implementing yet?

I've been getting a huge amount of useful work done over the past few years sucking data from other systems into SQLite files on my own computer - I even have my own small db-to-sqlite tool for this (built on top of SQLAlchemy) - https://github.com/simonw/db-to-sqlite

karakanb · 2024-02-27T18:21:16 1709058076

I do use the dlt library to support as many source & destinations as possible and they do not support SQLite as of today. I am interested in supporting SQLite simply because I love it as well, so that's definitely in the roadmap.

db-to-sqlite looks lovely, I'll see if I can learn a thing or two from it!

mritchie712 · 2024-02-27T18:23:04 1709058184

looks like dlt doesn't support it as a destination (which this is a wrapper around)

https://dlthub.com/docs/dlt-ecosystem/destinations/

MatthausK · 2024-02-27T21:58:14 1709071094

one of the dltHub founders here - we aim to address this in the coming weeks

xk3 · 2024-02-28T16:13:13 1709136793

I used sqlite-utils to create a tool that can merge SQLITE files and split them:

https://github.com/chapmanjacobd/library?tab=readme-ov-file#...

MatthausK · 2024-02-27T21:58:52 1709071132

one of the dltHub founders here - we aim to address this in the coming weeks

yevpats · 2024-02-27T23:18:37 1709075917

Firstly, congrats :) (Generalized) ingestion is a very hard problem because any abstraction that you come up with will always some limitations where you might need to fallback to writing code and have full access to the 3rd party APIs. But definitely in some cases generalized ingestion is much better then re-writing the same ingestion piece especially for complex connectors. Take a look at CloudQuery (https://github.com/cloudquery/cloudquery) open source high performance ELT framework powered by Apache Arrow (so you can write plugins in any language). (Maintainer here)

karakanb · 2024-02-28T11:48:34 1709120914

couldn't agree more! I see ingestr more as a common-scenario solution rather than a general solution that solves all cases, kinda like how I treat shell oneliners instead of writing an applicataion in another language. I guess there's space for both approaches.

I'll definitely take a look at CloudQuery, thanks a lot for sharing!

sascjm · 2024-03-05T21:16:53 1709673413

Hi Burak. I have been testing ingestr using a source and destination Postgres database. What I'm trying to do is copy data from my Prod database to my test database. I find when using replace I get additional dlt columns added to the tables as hints. It also does not work for a defined primary key only natural keys. Composite keys do not work. Can you tell me the basic, minimal that it supports. I would love to use it to keep our Prod and Test databases in sync, but it appears that the functionality I need is not there. Thanks very much.

karakanb · 2024-03-06T07:43:09 1709710989

Hi there, thanks a lot for your comment and trying it out. Do you mind joining our Slack community via the link in the readme or create a github issue so that we can dive into this? I'd love to understand what doesn't work and provide fixes.

matijash · 2024-02-27T16:24:05 1709051045

This looks pretty cool! What was the hardest part about building this?

karakanb · 2024-02-27T16:29:16 1709051356

hey, thanks!

I guess there were a couple of things that I found as tricky:

- deciding on the right way to represent sources and destinations was hard, before landing on URIs I thought of using config files but that'd also add additional complexity etc

- the platforms had different quirks concerning different data types

- dlt stores state on its own, which means that re-runs are not running from scratch after changing the incremental strategy, and they require a full refresh, it took me quite some time to figure out how exactly to work with it

I think among these the hardest part was to get myself to build and release it, because I had it in my mind for a long time and it took me a _long while_ to build and share it :)

kipukun · 2024-02-27T17:47:20 1709056040

Do you think you'll add local file support in the future? Also, do you have any plans on making the reading of a source parallel? For example, connectorx uses an optional partition column to read chunks of a table concurrently. Cool how it's abstracted.

karakanb · 2024-02-27T21:19:08 1709068748

I have just released v0.1.2 which supports CSV destinations with the URI format `csv://path/to/file.csv`, hope that's helpful!

karakanb · 2024-02-27T17:50:59 1709056259

I am working on file support right now as a destination to begin with. I believe I should get local files as well as S3-compatible sources going by tonight.

Reading the sources in parallel is an interesting idea, I'll definitely take a look at it. ingestr supports incremental loads by a partitioned column, but there's no parallelized partition reading at the moment.

Thanks a lot for your comment!

adawg4 · 2024-02-27T18:23:22 1709058202

I second this!

e12e · 2024-02-28T01:10:52 1709082652

Looks interesting. Clickhouse seems to be conspicuously missing as source and destination. Although I suppose clickhouse can masquerade as postgres: https://clickhouse.com/docs/en/interfaces/postgresql

Ed: there's an issue already: https://github.com/bruin-data/ingestr/issues/1

hermitcrab · 2024-02-27T20:57:46 1709067466

I am very interested in data ingestion. I develop a desktop data wrangling tool in C++ ( Easy Data Transform ). So far it can import files in various formats (CSV, Excel, JSON, XML etc). But I am interested in being able to import from databases, APIs and other sources. Would I be able to ship your CLI as part of my product on Windows and Mac? Or can someone suggest some other approach to importing from lots of data sources without coding them all individually?

karakanb · 2024-02-27T21:17:44 1709068664

hmm, that's an interesting question, I don't know the answer to be honest. are you able to run external scripts on the device? if so, you might be able to install & run ingestr with a CSV destination (which I released literally 2 mins ago), but that seems like a lot of work as well, and will probably be way slower than your C++ application.

Maybe someone else has another idea?

hermitcrab · 2024-02-27T21:52:03 1709070723

I can start a CLI as a separate process. But ingesting to CSV and then reading the CSV would be slow. Maybe it would be better to ingest into DuckDB or in memory in Arrow memory format. If anyone has any other suggestions, I am all ears.

jrhizor · 2024-02-27T17:14:22 1709054062

I like the idea of encoding complex connector configs into URIs!

javajosh · 2024-02-27T20:46:11 1709066771

Perhaps OP re-invented it, but it's been around for a long time in the java world via jdbc urls. See, for example this writeup: https://www.baeldung.com/java-jdbc-url-format

karakanb · 2024-02-27T21:51:12 1709070672

I don't think I invented anything tbh, I just relied on SQLAlchemy's URI formats, and I decided to abuse it slightly for even more config.

karakanb · 2024-02-27T17:27:33 1709054853

Glad to hear that! I am not 100% sure if it’ll look pretty for all platforms but I hope it’ll be an okay base to get started!

parkcedar · 2024-02-27T21:28:09 1709069289

This looks awesome. I had this exact problem just last week and had to write my own tool to perform the migration in go. After creating the tool I thought this must be something others would use- glad to see someone beat me to it!

I think it’s clever keep the tool simple and only copy one table at a time. My solution was to generate code based on an sql schema, but it was going to be messy and require more user introspection before the tool could be run.

karakanb · 2024-02-27T21:52:36 1709070756

thanks a lot for your comment, glad to hear we converged on a similar idea! :)

chinupbuttercup · 2024-02-27T18:52:10 1709059930

This looks pretty cool. Is there any schema management included or do schema changes need to be in place on both sides first?

karakanb · 2024-02-27T19:07:44 1709060864

It does handle schema evolution wherever it can, including inferring the initial schema automatically based on the source and destination as well, which means there's no need for manual schema changes anywhere and it will keep them in sync wherever it can.

andenacitelli · 2024-02-28T22:31:40 1709159500

Any thought on how this compares to Meltano and also their Singer SDK? We use it at $DAYJOB because it gives us a great hybrid of standardizing so we don’t have to treat it differently downstream while still letting us customize,

ab_testing · 2024-02-27T19:32:14 1709062334

If you can add source and destination as csv, it will increase the usefulness of this product manifold.

There are many instances where people either have a csv that they want to load into a database or get a specific database table exported into csv.

karakanb · 2024-02-27T21:18:38 1709068718

I have just released v0.1.2 which supports CSV destinations with the URI format `csv://path/to/file.csv`, hope that's helpful!

karakanb · 2024-02-27T19:51:54 1709063514

I agree, I am looking into it right now!

pax · 2024-02-27T23:34:46 1709076886

Similarly, Google Sheets might also be a popular endpoint.

karakanb · 2024-02-28T11:48:49 1709120929

on it!

karakanb · 2024-02-27T21:50:29 1709070629

Also released local CSV file as a source in v0.1.3.

infotropy · 2024-02-29T15:49:33 1709221773

Looks really interesting and definitely a use case I face over and over again. The name just breaks my brain, I want it to be an R package but it’s Python. Just gives me a mild headache.

PeterZaitsev · 2024-02-27T20:56:40 1709067400

Looks great Burak! Appreciate your contribution to Open Source Data ecosystem!

karakanb · 2024-02-27T21:18:04 1709068684

thanks a lot Peter!

ijidak · 2024-02-27T21:21:32 1709068892

Is there a reason CSV (as a source) isn't supported? I've been looking for exactly this type of tool, but that supports CSV.

CSV support would be huge.

Please please please provide CSV support. :)

karakanb · 2024-02-27T21:23:42 1709069022

I have released v0.1.2 with the destination literally minutes ago, I'll take a look at CSV as a source!

just so that I have a better understanding, do you mind explaining your usecase?

karakanb · 2024-02-27T21:51:34 1709070694

I have just released supporting local CSV file as a source in v0.1.3, let me know if this helps! :)

ijidak · 2024-02-27T23:21:14 1709076074

Sweet! Will take a look at this immediately!

skanga · 2024-02-27T19:33:21 1709062401

Hi Burak, I saw cx_Oracle in the requirements.txt but the support matrix did not mention it. Does this mean Oracle is coming? Or a typo?

karakanb · 2024-02-27T19:51:35 1709063495

I added it as an experimental source a few hours ago, but I haven't had the chance to test it, that's why I haven't put it into the support matrix yet. Do you mind trying it out if you do use Oracle?

Phlogi · 2024-02-27T20:22:52 1709065372

I'd love to see support for odbc, any plans?

karakanb · 2024-02-27T20:42:13 1709066533

Do you mean SQL Server? If that's the case, ingestr is already able to connect to Microsoft SQL Server and use it both as a source and a destination.

yanko · 2024-02-27T21:48:00 1709070480

Db2 like not existing db in the real world

karakanb · 2024-02-27T21:52:02 1709070722

it's on my roadmap for sure!