Hacker News new | past | comments | ask | show | jobs | submit login

DuckDB looks very interesting and I'm quite excited to examine it more closely in the next few days!

I just wanted to add to the discussion that an unchanging file format, or at least a backwards compatible one, is a key feature of sqlite. See for example Richard Hipp's comments here [1] (I think he also mentioned earlier in the talk that the file format has become a limiting factor now in terms of some of the refactoring that they can do). The file format therefore seems likely to be a major factor in the long term success of this project and I am glad to see that you are taking your time before settling on any architecture here.

Given that you are targeting the data science and analytics space, what are your plans for integration with arrow and the feather file format? From a purely user/developer perspective, arrow's aim of shared memory data structures across different analytics tools, seems like a great goal. I know Wes McKinney and Ursa Labs have also spent quite some time at the file storage part of this, see for example the Feather V2 announcement [2].

What are your thoughts on the tradeoffs they considered and how do you see the requirements of DuckDB in relation to theirs?

From the Carnegie Mellon DuckDB talk [3], I saw that you already have a zero-copy reader to the pandas memory data structures, so the vision I have is that DuckDB could be the universal SQL interface to arrow datasets which can then also be shared with more complex ML models. Is that something that we can hope for or are there obstacles to this?

[1] https://youtu.be/Jib2AmRb_rk?t=3150

[2] https://ursalabs.org/blog/2020-feather-v2/

[3] https://www.youtube.com/watch?v=PFUZlNQIndo




Thanks for your detailed reply! As you mentioned - lessons learned by SQLite here are indeed crucial. We are very carefully trying to craft a storage format before fixing it in-place, specifically to try to avoid these problems. Backwards compatibility is a must, but backwards compatibility to a sane format is massively preferable :) No doubt we will end up making some mistakes in hindsight, though.

We already support reading from the arrow in-memory format [1], therefore it is already possible to use DuckDB to perform SQL-on-Arrow. More support there is definitely possible, and I think this is a very promising direction.

[1] https://github.com/cwida/duckdb/pull/866


Thanks. That's great and I see you even support the new Arrow streaming model!


It's also possible to return a result set as an Arrow table, so round trip SQL on Arrow queries is possible (Arrow to DuckDB to Arrow)! It's not 100% zero-copy for strings, but it should work pretty well!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: