Hacker News new | past | comments | ask | show | jobs | submit login
DuckDB 1.0.0 (duckdb.org)
187 points by nnx 9 months ago | hide | past | favorite | 31 comments



> DuckDB Labs, the company that employs DuckDB’s core contributors, has not had any outside investments, and as a result, the company is fully owned by the team. Labs’ business model is to provide consulting and support services for DuckDB, and we’re happy to report that this is going well. With the revenue from contracts, we fund long-term and strategic DuckDB development with a team of almost 20 people. At the same time, the intellectual property in the project is guarded by the independent DuckDB Foundation. This non-profit foundation ensures that DuckDB will be around long-term under the MIT license.

This seems like an excellent structure for long-term protection of the open source project. What other projects have taken this approach?


I thought this exact thing after reading the post. I can't imagine a better, practical structure:

DuckDB Labs: The core contributors. Instead of developing features that will be behind a paywall, they provide support and consulting. Awesome.

DuckDB Foundation: A non-profit that ensures DuckDB remains MIT licensed. Perfect.

We actually just published a post on how to replace your warehouse with DuckDB. It's certainly not a good move for every company using something like Snowflake, but it was the right move for us.

https://www.definite.app/blog/duckdb-datawarehouse


How do you ensure shares don't get diluted as people leave and new ones get hired?

I remember this was an issue at an ESOP I worked for. They had to pressure former employees and buy back shares from them. If you have too many shareholders in the US then your corporate status changes

Granted this is a nonprofit so maybe the rules are different, but the fundamental problem remains


A 10% option pool, set aside at the beginning, should help resolve this, but employees also need to be educated that dilution is just part of life in a startup.


Do any data scientists here use duckdb daily? Keen to hear your experiences and comparisons to other tools you used before it.

I love tools that make life simpler. I've been toying with the idea of storing 1TB of data in S3 and querying it using duckdb on an EC2. That's really old/boring infrastructure, but is hugely appealing to me, since it's so much simpler than than what I currently use.

Would love to hear of others' experiences with duckdb.


We just wrote a post[0] very similar to what you're thinking. Let me know if you have any questions.

0 - https://www.definite.app/blog/duckdb-datawarehouse


Storing as parquet, and using hive path partitioning, you can get passable batch performance assuming you queries are not mostly scans and aggregates across large portions of the data. On the order of ten seconds for regex matching on columns for example.

I thought it was a great way to give analysts direct access to data to do adhoc queries while they were getting familiar with the data.


We use it for simple use-cases and experimentation, especially because it works so well with various data formats and polars. Personally, I prefer to maintain proper SQL queries against duckdb including re-usable views, and then use polars for the remaining fiddling/exploration.

In combination with sqlalchemy, it is trivial to lift an app to other OLAP systems.


Wouldn't that be pretty slow? :) Why not store it on the EC2 instance directly? Or locally on your computer?

1 TB is .. not a lot of data .. I rent a server with 15 TB for ~50$ per month and buying a new 2TB disk is less than 100$.


DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.

Disclaimer: I work at DuckDB Labs.

[1] https://duckdb.org/docs/data/parquet/overview#partial-readin...


Does DuckDB supports partial reading of .duckdb files hosted externally?


oooh, cool


> Why not store it on the EC2 instance directly?

Since the data in S3 receives updates every few hours, querying it directly ensures queries are on up-to-date data, whereas creating a copy of the dataset on the EC2 would necessitate periodically checking for updates and moving the deltas across to the EC2 (not insurmountable; but complexity worth avoiding if possible).


ah, I see. I though you decided t put it there (also, what the sibling said makes it a better idea)


You should give us at MotherDuck a try for this use case.

(Co-founder)


I suggest you try using StarRocks to query the data lake directly. I know many large companies, including Tencent and Pinterest, are doing this. StarRocks has a truly mature vectorized execution engine and a robust distributed architecture. It can provide you with impressive query performance.


One cool feature of duckdb is that you can directly run sql against a pandas dataframe/arrow table.[1] The seamless integration is amazing.

[1]: https://duckdb.org/docs/api/python/overview.html#dataframes


Congrats to the team! I feel like I see lots of posts here on HN and go "wow, I didn't know DuckDB could do that". It seems like a very powerful tool, which I haven't had the pleasure of using yet.

Due to policies at work it's unlikely we would use this in production, but as I understand it, it's still pretty useful for exploring and poking around local data. Is that right? Does anyone have examples of problems they've used it on to digest local files or logs or something?


Have they fixed the incredibly slow queries on indexed columns?

https://www.lukas-barth.net/blog/sqlite-duckdb-benchmark/


Howdy! Thanks for your benchmarking!

Your blog does a great job contrasting the two use cases. I don't think too much has changed on your main use case, however here are a few ideas to test out!

DuckDB can read SQLite files now! So if you like DuckDB syntax or query optimization, but want to use the SQLite format / indexes, that may work well.

Since DuckDB is columnar (and compressed), it frequently needs to read a big chunk of rows (~100K) just to get 1 row out and decompressed. Mind trying to store your data uncompressed? Might help in your case! (PRAGMA force_compression='uncompressed')

Links: https://duckdb.org/docs/extensions/sqlite


> The differences between DuckDB and SQLite are just always so large that plotting them together completely hides all other detail.

(From your blog post). When values span orders of magnitude, that’s when log plots are useful.


They say they have lots of benchmarks running for it, so it might be a good idea to add a similar benchmark directly to be able to track it?

By the way, I like your blog's style! Even the html of an article is clean and readable.


I've been wanting to explore using DuckDB for in-process aggregation and windowing in stream processing with Golang, as I think it would be a great solution.

Curious if anyone else is using DuckDB for something similar? Does anyone have an example?


I havent had a chance to really use it yet, but I know duckdb is in my future. Being able to connect it to all the different data sources to run analytical queries, plus the support for parquet.


This seems like a good model for sustaining open source, but raises some questions.

Does anybody know how the DuckDB foundation works? The sponsors are MotherDuck, Voltron, and Posit, which are heavily venture-funded. Do DuckDB Lab employees work on sponsored projects for the foundation?

I am also curious if anyone can shed light on what kind of contract work DuckDB does to align its work with the open source project. This has always seemed like the holy grail, but it is difficult to do in practice.


may I ask, what is the value of using DuckDB vs loading data from parquet directly into pandas? Apart from the fact with duck db you can load part of data rather than the entire file into memory?


The main value proposition is being able to query the data in SQL. Personally I find that more ergonomic than Pandas for a lot of tasks.

It also works seamlessly with Pandas in Jupyter notebooks. You can query dataframes seamlessly and store query results back into dataframes for plotting etc.


Amazing. So glad this project came along!


Does DuckDB have any graph capabilities?


I've used it along with jupyter and it worked pretty well.


not yet




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: