> DuckDB Labs, the company that employs DuckDB’s core contributors, has not had any outside investments, and as a result, the company is fully owned by the team. Labs’ business model is to provide consulting and support services for DuckDB, and we’re happy to report that this is going well. With the revenue from contracts, we fund long-term and strategic DuckDB development with a team of almost 20 people. At the same time, the intellectual property in the project is guarded by the independent DuckDB Foundation. This non-profit foundation ensures that DuckDB will be around long-term under the MIT license.
This seems like an excellent structure for long-term protection of the open source project. What other projects have taken this approach?
I thought this exact thing after reading the post. I can't imagine a better, practical structure:
DuckDB Labs: The core contributors. Instead of developing features that will be behind a paywall, they provide support and consulting. Awesome.
DuckDB Foundation: A non-profit that ensures DuckDB remains MIT licensed. Perfect.
We actually just published a post on how to replace your warehouse with DuckDB. It's certainly not a good move for every company using something like Snowflake, but it was the right move for us.
How do you ensure shares don't get diluted as people leave and new ones get hired?
I remember this was an issue at an ESOP I worked for. They had to pressure former employees and buy back shares from them. If you have too many shareholders in the US then your corporate status changes
Granted this is a nonprofit so maybe the rules are different, but the fundamental problem remains
A 10% option pool, set aside at the beginning, should help resolve this, but employees also need to be educated that dilution is just part of life in a startup.
Do any data scientists here use duckdb daily? Keen to hear your experiences and comparisons to other tools you used before it.
I love tools that make life simpler. I've been toying with the idea of storing 1TB of data in S3 and querying it using duckdb on an EC2. That's really old/boring infrastructure, but is hugely appealing to me, since it's so much simpler than than what I currently use.
Would love to hear of others' experiences with duckdb.
Storing as parquet, and using hive path partitioning, you can get passable batch performance assuming you queries are not mostly scans and aggregates across large portions of the data. On the order of ten seconds for regex matching on columns for example.
I thought it was a great way to give analysts direct access to data to do adhoc queries while they were getting familiar with the data.
We use it for simple use-cases and experimentation, especially because it works so well with various data formats and polars. Personally, I prefer to maintain proper SQL queries against duckdb including re-usable views, and then use polars for the remaining fiddling/exploration.
In combination with sqlalchemy, it is trivial to lift an app to other OLAP systems.
DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.
Since the data in S3 receives updates every few hours, querying it directly ensures queries are on up-to-date data, whereas creating a copy of the dataset on the EC2 would necessitate periodically checking for updates and moving the deltas across to the EC2 (not insurmountable; but complexity worth avoiding if possible).
I suggest you try using StarRocks to query the data lake directly. I know many large companies, including Tencent and Pinterest, are doing this. StarRocks has a truly mature vectorized execution engine and a robust distributed architecture. It can provide you with impressive query performance.
Congrats to the team! I feel like I see lots of posts here on HN and go "wow, I didn't know DuckDB could do that". It seems like a very powerful tool, which I haven't had the pleasure of using yet.
Due to policies at work it's unlikely we would use this in production, but as I understand it, it's still pretty useful for exploring and poking around local data. Is that right? Does anyone have examples of problems they've used it on to digest local files or logs or something?
Your blog does a great job contrasting the two use cases. I don't think too much has changed on your main use case, however here are a few ideas to test out!
DuckDB can read SQLite files now! So if you like DuckDB syntax or query optimization, but want to use the SQLite format / indexes, that may work well.
Since DuckDB is columnar (and compressed), it frequently needs to read a big chunk of rows (~100K) just to get 1 row out and decompressed. Mind trying to store your data uncompressed? Might help in your case! (PRAGMA force_compression='uncompressed')
I've been wanting to explore using DuckDB for in-process aggregation and windowing in stream processing with Golang, as I think it would be a great solution.
Curious if anyone else is using DuckDB for something similar? Does anyone have an example?
I havent had a chance to really use it yet, but I know duckdb is in my future. Being able to connect it to all the different data sources to run analytical queries, plus the support for parquet.
This seems like a good model for sustaining open source, but raises some questions.
Does anybody know how the DuckDB foundation works? The sponsors are MotherDuck, Voltron, and Posit, which are heavily venture-funded. Do DuckDB Lab employees work on sponsored projects for the foundation?
I am also curious if anyone can shed light on what kind of contract work DuckDB does to align its work with the open source project. This has always seemed like the holy grail, but it is difficult to do in practice.
may I ask, what is the value of using DuckDB vs loading data from parquet directly into pandas? Apart from the fact with duck db you can load part of data rather than the entire file into memory?
The main value proposition is being able to query the data in SQL. Personally I find that more ergonomic than Pandas for a lot of tasks.
It also works seamlessly with Pandas in Jupyter notebooks. You can query dataframes seamlessly and store query results back into dataframes for plotting etc.
This seems like an excellent structure for long-term protection of the open source project. What other projects have taken this approach?