DuckDB – An embeddable SQL database like SQLite, but supports Postgres features

simonw · on Sept 20, 2020

I spent a while looking at this today. It's really interesting.

It's not based on SQLite at all (except for borrowing the SQLite shell implementation) but it looks very much like SQLite, in particular:

- It's designed to work as an embedded library, eliminating the network overhead you usually get when talking to a database

- Each database is a single file on disk

- It ships as an "amalgamation" build - a single giant C++ file (SQLite is a single giant C file)

Impressively, if you run "pip install duckdb" it Just Works - you can then "import duckdb" and start using it, with an interface that looks very similar to the sqlite3 module that ships with Python.

The key reason this exists is that it's a column store, with vectorized operations across columns - making it ideal for analytical workloads. This blog entry has some benchmarks that illustrate how well it works in that regard: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

It's also backed up by some strong computer science. It's by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers: https://duckdb.org/docs/why_duckdb#standing-on-the-shoulders...

It's a really interesting piece of software, and unlike many other "new databases" it feels like it fills a very genuine gap in my toolbox.

gijzelaerr · on Sept 20, 2020

And MonetDB took some inspiration back from DuckDB to develop an embedded version also:

https://github.com/MonetDBSolutions/MonetDBe-Python/

Disclaimer: i'm working on this.

Firerouge · on Sept 20, 2020

Could you elaborate on some of the similarities and differences between DuckDB and MonetDB?

mkersten · on Sept 20, 2020

Disclaimer: i am working on MonetDB and I established the CWI database architectures group

DuckDB is designed as an experimental system after heavy exposure to the technniques deployed in MonetDB (open-source), Hyper, and Vectorwise.

The properties of the embedded version of MonetDB can be found here https://monetdbe.readthedocs.io/en/latest/introduction.html#

Some difference between MonetDB and DuckDB can be found here https://monetdbe.readthedocs.io/en/latest/migrations.html

and the blogpost mentioned above is covered in https://twitter.com/MonetDB/status/1282412295235280901?s=20

webmobdev · on Sept 20, 2020

If everything you said is accurate, somebody should make another pitch for WebSQL ( https://en.wikipedia.org/wiki/Web_SQL_Database ) with a custom version of DuckDb!

Every browser maker was interested in implementing it but the W3C couldn't go ahead with it because everyone chose to implement it using SQLite, where as W3C required more than one db back-end implementation to move forward.

wffurr · on Sept 20, 2020

Having sqlite included in all browsers with standard bindings would have been great.

jansommer · on Sept 20, 2020

It would! But it would also be a security nightmare. Exploit SQLite and you could use it in Safari, Chrome and Firefox - everybody could be affected.

andyfleming · on Sept 20, 2020

How would that be different from someone editing something in window.localStorage?

jansommer · on Sept 20, 2020

If that's all you need, it wouldn't make sense to include SQLite in the browser. If you're going to include a relational database you might as well use it

CRConrad · on Sept 20, 2020

I think that meant: How would that be more of a security nightmare than some malicious site editing your browser's whatever.localstorage?

hajile · on Sept 21, 2020

Ironically, all the IndexDB implementations were just thin layers on top of sqlite last I checked.

WorldMaker · on Sept 21, 2020

The original slow IndexedDB implementations were done that way, but my understanding is that anything fast and "modern" probably isn't in 2020 as most of the browsers moved to their own IndexedDB storage implementations that are not SQLite-based. IndexedDB has very different expected performance characteristics than a pure relational DB (as it is a key/value store with indexes closer to most "NoSQL" DBs than to SQLite) and it did not (and does not) make long term sense to use SQLite below the hood.

afiori · on Sept 21, 2020

This could work if there is a subset of SQL where duckDB and SQLite behave almost identically

webmobdev · on Sept 22, 2020

I often wondered why someone didn't do it with BerkleyDb too? I know Oracle owns it, but it is open source and someone like Mozilla could have implemented it in their browser just to get the W3C rolling on WebSQL.

afiori · on Sept 23, 2020

The problem with WebSQL would remain the same, it would be necessary to specify a subset of SQL as a web standard.

Abishek_Muthian · on Sept 20, 2020

> - Each database is a single file on disk

Are these architecture independent? i.e. If I create the database on x86_64 and move it to ARM64; would it work seamlessly?

mytherin · on Sept 20, 2020

One of the authors here - portability of the storage is indeed one of our goals. We test that the same file can be loaded everywhere on different systems, including between ARM and x86.

I should mention the storage is still very much a work-in-progress. We are actively working on redesigning the storage to add compression and other extra features, meaning the storage format can change and be incompatible between different DuckDB versions. We plan to stabilize the storage format with V1.0, after which every subsequent DuckDB version should be able to read database files created by previous versions.

snthpy · on Sept 20, 2020

DuckDB looks very interesting and I'm quite excited to examine it more closely in the next few days!

I just wanted to add to the discussion that an unchanging file format, or at least a backwards compatible one, is a key feature of sqlite. See for example Richard Hipp's comments here [1] (I think he also mentioned earlier in the talk that the file format has become a limiting factor now in terms of some of the refactoring that they can do). The file format therefore seems likely to be a major factor in the long term success of this project and I am glad to see that you are taking your time before settling on any architecture here.

Given that you are targeting the data science and analytics space, what are your plans for integration with arrow and the feather file format? From a purely user/developer perspective, arrow's aim of shared memory data structures across different analytics tools, seems like a great goal. I know Wes McKinney and Ursa Labs have also spent quite some time at the file storage part of this, see for example the Feather V2 announcement [2].

What are your thoughts on the tradeoffs they considered and how do you see the requirements of DuckDB in relation to theirs?

From the Carnegie Mellon DuckDB talk [3], I saw that you already have a zero-copy reader to the pandas memory data structures, so the vision I have is that DuckDB could be the universal SQL interface to arrow datasets which can then also be shared with more complex ML models. Is that something that we can hope for or are there obstacles to this?

[1] https://youtu.be/Jib2AmRb_rk?t=3150

[2] https://ursalabs.org/blog/2020-feather-v2/

[3] https://www.youtube.com/watch?v=PFUZlNQIndo

mytherin · on Sept 20, 2020

Thanks for your detailed reply! As you mentioned - lessons learned by SQLite here are indeed crucial. We are very carefully trying to craft a storage format before fixing it in-place, specifically to try to avoid these problems. Backwards compatibility is a must, but backwards compatibility to a sane format is massively preferable :) No doubt we will end up making some mistakes in hindsight, though.

We already support reading from the arrow in-memory format [1], therefore it is already possible to use DuckDB to perform SQL-on-Arrow. More support there is definitely possible, and I think this is a very promising direction.

[1] https://github.com/cwida/duckdb/pull/866

snthpy · on Sept 21, 2020

Thanks. That's great and I see you even support the new Arrow streaming model!

legg0myegg0 · on Sept 21, 2020

It's also possible to return a result set as an Arrow table, so round trip SQL on Arrow queries is possible (Arrow to DuckDB to Arrow)! It's not 100% zero-copy for strings, but it should work pretty well!

Abishek_Muthian · on Sept 20, 2020

Thanks for the confirmation on portability, I didn't see Go support in the home page; Is there an official Go driver support planned as well?

mytherin · on Sept 20, 2020

Neither Hannes nor me are very experienced with Go, so we are not the most suitable to do that. We do very much welcome external contributions, and someone did make a WIP go driver [1] in the past.

We also have a sqlite3 compatible C API that can be used to use DuckDB. This is actually how we use the sqlite3 shell: we directly use the sqlite shell but bind it to the duckdb library using this API. It might be possible to use the Go SQLite driver in a similar manner to talk to DuckDB.

[1] https://github.com/marcboeker/go-duckdb

sgt · on Sept 20, 2020

And Dart bindings? That could be really interesting for Flutter developers.

StreamBright · on Sept 20, 2020

Do you consider (or maybe you already have) the advanced columnar features? Run length encoding, dictionary encoding, etc. It would be great to see how those perform under these workloads.

mytherin · on Sept 20, 2020

Definitely. We have quite detailed plans for compression :) more to follow when we get to it.

StreamBright · on Sept 20, 2020

Excellent! Thanks for the update.

marton78 · on Sept 20, 2020

Couldn't you have used an already existing format for storage, e.g. Apache ORC?

mytherin · on Sept 20, 2020

Those formats have a different use case from a database system - they are mostly designed for write once, read many times workloads. As far as I’m aware you cannot update them in an ACID compliant manner, for example, which makes it difficult to use them as backend for an ACID database system.

hfmuehleisen · on Sept 20, 2020

Another DuckDB developer here, we support reading table data from Parquet files directly.

StreamBright · on Sept 20, 2020

I am a big fan of those formats but decoupling the actual storage features from the ecosystem is not a trivial task. I haven't look at the C++ version of ORC for a while but it used to be incomplete. Other than that, the solutions ORC uses to compress data is pretty amazing.

brian_herman · on Sept 20, 2020

Yes, their testing folder is 26mb while their source code of the database is only 5.6mb. It looks like they took the same approach to testing as sqlite also.

tmpz22 · on Sept 20, 2020

How might one go about using this in a language that’s not supported in the documentation such as golang?

mytherin · on Sept 20, 2020

Generally you would need to write bindings for the system. Experimental DuckDB bindings for Go do exist [1], however, they are rather old and might not work anymore.

We also have JDBC support, which might help for languages that know how to use that protocol. ODBC support is not implemented yet, but also planned.

We also have a SQLite-compatible C API [2], that can potentially be used to use an existing SQLite driver that uses the SQLite C API.

[1] https://github.com/marcboeker/go-duckdb

[2] https://github.com/cwida/duckdb/blob/master/tools/sqlite3_ap...

at6 · on Sept 20, 2020

Hi, author of the Go bindings here. I've just tested them and they are still working with the latest DuckDB version from master.

mytherin · on Sept 20, 2020

Excellent, thanks again for writing them :)

haggy · on Sept 20, 2020

[flagged]

Smerity · on Sept 20, 2020

A giant plant as in someone from their team planning this out and then planting this as a comment..?

simonw wrote Datasette[1] which makes extensive use of SQLite and acts as a viewer, allowing you to create interactive websites and APIs from arbitrary databases. He'd be a very long term plant and it instead seems far more likely he's interested in the possibilities of DuckDB compatibility within the context of Datasette and other similar projects.

[1]: https://github.com/simonw/datasette

simonw · on Sept 20, 2020

I actually started looking at it today because someone was asking about it on the Datasette discussions board: https://github.com/simonw/datasette/discussions/968

gregsadetsky · on Sept 20, 2020

I don’t think that it’s reasonable to assume that Simon would risk his reputation by coming here to astroturf.

ksec · on Sept 20, 2020

I really wish ALL software project / framework / libary could follow the lead here. Instead of your marketing page telling me how world changing awesome tech you have, which really is a consumer marketing strategy. Just do it like DuckDB,

When to use DuckDB ; When to not use DuckDB

LockAndLol · on Sept 20, 2020

I couldn't put my finger on why I felt like I had just read something useful, but that's it: clear, concise and straight-forward information to help you evaluate use.

Only thing wrong I saw was the detection of the platform. It detected my linux box as "Unix" and proposed I download OSX libs.

FridgeSeal · on Sept 20, 2020

We should make a list of technology that does this, because I know Clickhouse also has a reasonably detailed page on when to use it and when to not use it and why. Postgres also has a very nice “do and donts” wiki page.

stjo · on Sept 20, 2020

Postgres “do and donts”: https://wiki.postgresql.org/wiki/Don%27t_Do_This

fsloth · on Sept 20, 2020

What an excellent point! I suppose the consumer marketing strategy comes from pure imitation - people look around them how "professional" products discuss themselves - with the aim of selling a product and service - and simply copy that.

Algorithms textbooks don't create marketing blurbs for linked lists...

ComodoHacker · on Sept 20, 2020

Looks like DuckDB doesn't seek investments. So they can afford to be defeloper-first. Not everyone can, though.

maigret · on Sept 21, 2020

I'd wish every product would tell what's in for me. Telling that it's world changing doesn't mean it's useful or impactful for me, as consumer or as developer. Usually, usual products brand themselves as "new" for differentiation, but real new products stay away from this to not scare people. Usually you notice when something is really "new" (aka means paradigm change) when the marketing says loud "Don't worry, everything will stay the same as it was, no visible difference for you!".

adrianmonk · on Sept 20, 2020

First of all, really neat project.

There's an issue with the Installation section of the web page, though. Running Chrome on Linux, it says "System detected: Linux", which is right. But under "CLI", it offers me a download of duckdb_cli-osx-amd64.zip.

Just in case it was just the zip file name that's wrong, I downloaded it, but the duckdb binary inside is a "Mach-O 64-bit x86_64 executable" according to the file command.

The correct file is available if I click on Other Installations (https://duckdb.org/docs/installation/).

Also, if I try it on running Firefox on Linux, it says "System detected: UNIX" and gives me the OS X download.

mytherin · on Sept 20, 2020

Thanks for reporting that! We will have a look at fixing it :)

alexchamberlain · on Sept 20, 2020

FWIW I got the same issue on Android, though offering me the "correct" link wouldn't do me much good either :)

viraptor · on Sept 20, 2020

I was hoping from the title that it aims for postgres SQL compatibility, but I can't find it explicitly mentioned in the docs. This really makes me think I really want something like sqlite://memory which completely disregards speed or even persistence. Instead you could say for example "open an in-memory database that behaves like postgres 9" and run your tests against it. With typical fixtures of 10 or so rows, you wouldn't even need a query plans - just naive filte+copy over whole tables for most operations.

Currently I'm using https://pypi.org/project/testing.postgresql/ which spawns a temporary server per each test suite, which has quite a lot of overhead.

giancarlostoro · on Sept 20, 2020

> SQL Parser: We use the PostgreSQL parser that was repackaged as a stand-alone library. The translation to our own parse tree is inspired by Peloton.

Right at the bottom of this page:

https://duckdb.org/docs/why_duckdb.html

viraptor · on Sept 20, 2020

That's awesome news, thanks for that!

giancarlostoro · on Sept 20, 2020

No problem I was curious as well and hunting it down, this project does seem interesting, I'm a huge fan of database systems like SQLite and H2. There's something wonderful about being able to copy a simple file and share it with other developers. This one seems rather interesting as well.

dd36 · on Sept 20, 2020

That’s what made Excel so popular.

mytherin · on Sept 20, 2020

One of the authors of DuckDB here: we use the PostgreSQL parser, and try to be generally compatible with the syntax used by SQLite, Postgres and MySQL. In some cases those are unfortunately mutually exclusive (e.g. null ordering - we have a PRAGMA to change this system-wide). Making the SQL dialect “as compatible as possible” with these systems is very much one of our goals, and if you find any incompatibilities feel free to open an issue :)

mst · on Sept 20, 2020

I'd note that when moving from pg to $other the thing that really trips me up isn't the syntax changes, it's the lack of ARRAY and ROWTYPE.

I'm not sure whether those are in scope for you but it'd be nice if the docs said "to be implemented" or "out of scope" somewhere ... and my apologies in advance if they do and I somehow missed it.

mytherin · on Sept 20, 2020

We already have support for LIST and STRUCT (which I think are equivalent to ARRAY and ROW, respectively). There is still some functionality missing there (notably storage, and several functions) - but the base functionality is there and the remainder definitely will be added.

mst · on Sept 22, 2020

Could only find a pull request on struct, and nothing at all on 'list' because I just got a billion 'select list' articles. Also nothing in the docs.

Struct looks 10x uglier but perfectly easy to generate, mind, but the whole thing is non-trivial to find :/

jrockway · on Sept 20, 2020

I would just create a per-test-file database (and delete it the next time the test runs). The overhead is very small (compared to booting Postgres) and it works exactly like production because it is exactly production.

In general, I am not comfortable using an "almost equivalent" database for tests. Every database engine has its quirks, and if you don't know what they are, your program that passes its tests only "almost" works in production, which is annoying.

oauea · on Sept 20, 2020

And if you need to run expensive migrations, use a template database.

https://www.postgresql.org/docs/current/manage-ag-templatedb...

majodev · on Sept 20, 2020

While PostgreSQL templates allow to scaffold new databases (one per test-spec) quite fast, it's even faster to have a warmed up pool of databases ready: https://github.com/allaboutapps/integresql#background

Disclaimer: author.

oauea · on Sept 20, 2020

Oh that looks great, and very close to what I had in mind for the next upgrade of our in-house testing framework. Definitely going to give that a try soon. Thanks for posting it!

chrisdalke · on Sept 20, 2020

I would also love this, in the past I’ve used the H2 in-memory database for tests with Spring boot applications.

The Hibernate ORM handles configuring the H2 schema but they’re not fully compatible, so it means I have to be careful to not rely on Postgres-specific features. I generally am not testing database logic, just need a working persistence layer to get the tests running, but an in-memory Postgres implementation would be amazing.

viraptor · on Sept 20, 2020

Although it's something I'd love to write, supporting it over a long time as pure FOSS would probably be a sad experience. I wonder what model could be used here. Maybe something like patreon or "sponsor a feature" with an open-source project? FOSS core + proprietary Java / .NET bindings because corps can pay?

chrisdalke · on Sept 20, 2020

Yeah, for sure this is the type of niche, high-effort work that would be used extensively by the closed-source world and yet struggle to find adequate support.

vips7L · on Sept 20, 2020

IIRC h2 supports a postgres compatibility mode. Though I'm not sure how feature complete it is.

http://h2database.com/html/features.html

c17r · on Sept 20, 2020

Man, I’ve wanted the “compatible SQL engine with only RAM storage” for testing for YEARS. Closest I got was some shenanigans with MSSQL’s LocalDB.

mStreamTeam · on Sept 20, 2020

I use docker to spin up new postgresql DBs on the fly

GordonS · on Sept 20, 2020

Same here, and I run the same set of migrations that run in production. To he clear, this is only done once per test session, not for individual tests, and the tests are written in such a way that they don't interfere with each other.

The overhead is actually pretty small, less than 10s. I'd saw too much for unit tests, but we'll within the tolerable range for integration/functional tests. Compared with the time I'd spend hacking together some brittle and unrealistic in-memory alternative, I much prefer to use a real database.

toyg · on Sept 20, 2020

I do that too but it’s a noticeable overhead, I guess it wouldn’t be great for mass-testing scenarios.

striking · on Sept 20, 2020

RAM FS + fsync off?

viraptor · on Sept 20, 2020

Those solutions still have a high overhead. There's acid compliance, serialisation in memory, maintaining indices, and many other layers. Compare it to an ideal testing solution with no initialisation cost and insert being literally: parse the query, add a new entry to a list, done.

RantyDave · on Sept 20, 2020

But you want to be testing against something that is as close as possible to the deployment environment. So if that means acid, indices etc, then that's what it is.

viraptor · on Sept 20, 2020

You can still do them in a trivial way that works like production. For example: if some column has a unique index, look at all rows and compare the new value. You don't need an actual index for it. (And definitely not a fancy concurrent access btree) For transactions/mvcc you can literally make a copy of everything.

Avamander · on Sept 20, 2020

Have you tried postgresql with libeatmydata?

xyzzy_plugh · on Sept 20, 2020

I'd build this, but I'm not sure if I'd be able to get anyone to fund it.

dkarlovi · on Sept 20, 2020

You shouldn't rely on any "just like X" systems for tests, you should use X exactly, including correct config and version as production.

What you want is having a Postgres server, but optimized for tests. What I've done is 1. One instance per pipeline 2. Parallelize by having separate databases per process 3. Build schema once 4. Before each test, purge the database (or run tests in transactions so you can just rollback in the end) 5. Run in Docker, use tmpfs volumes for writes (no disk writes)

It runs fairly quickly.

awb · on Sept 20, 2020

Cool project. Really appreciate the "When to use DuckDB" and "When to not use DuckDB" sections.

Anyone using this in production? Also, any benchmarks vs. other DBs?

legg0myegg0 · on Sept 20, 2020

I work at a Fortune 100 company and we have this in production for our self-service analytics platform as a part of our data transformation web service. Each web request can do multiple pandas transformations, or spin up it's own DuckDB or SQLite db and execute arbitrary SQL transformations. It fits our use case like a glove and is super fast to/from Pandas.

sradman · on Sept 20, 2020

3-month old HN thread [1] with good information from the authors and others.

“An embeddable column store for data science” (Pandas and dbplyer) would have been a good description at that time but the addition of JDBC/Java support expands the use cases.

[1] https://news.ycombinator.com/item?id=23287278

ryndbfsrw · on Sept 20, 2020

The performance of DuckDB's predecessor, MonetDBLite, was/is stellar. DuckDB, when I tried in 6 months ago, was a bit behind in terms of csv parsing or aggregation by groups and didn't seem much faster than regular sqlite so I didn't really jump at it. Does anyone here know if the performance improved since then and how it compares to MonetDBLite?

mytherin · on Sept 20, 2020

We are currently actively working on performance optimization, and have recently finished implementing parallelization. Expect more performance improvements soon :)

mastazi · on Sept 20, 2020

I think the title chosen by the poster misses the mark a bit, the main point seems to be that it’s an OLAP DB optimised for analytics.

Piisamirotta · on Sept 20, 2020

This is slightly offtopic but do you guys have any good guides/best conventions how to save timeseries data to sqlite/similar? I'm no DB expert so struggling with this. I'm having a dynamic number of measurements that usually have the same timestamp. So one table for timestamp_id, signal_id and value?

jononor · on Sept 20, 2020

What would be the recommended way to migrate data from SQLite and DuckDB? Should I dump SQL from SQLite and load in DuckDB? Or use Parquet, or CSV export/import? I think those interested in the tagline "like SQLite but more features" might be interested in a quick howto/guide like that.

sterlinm · on Sept 20, 2020

This talk they did with a database research group at CMU may be of interest as well. https://youtu.be/PFUZlNQIndo

microtherion · on Sept 20, 2020

A suggestion for the web site: It would be good to mention the license more prominently. I had to click through to Github to find out what it was, and I believe this information would matter to more people than just me.

microtherion · on Sept 20, 2020

Correction: As CRConrad, whose comment was inexplicably downvoted, pointed out, the MIT License IS mentioned on the front page (And the Wayback Machine proves it was mentioned before the HN post).

CRConrad · on Sept 20, 2020

Dunno if this was put up in reaction to your comment, or you missed it, but looking at their front page right now it says “Permissive MIT License”.

jsumrall · on Sept 20, 2020

Very cool project! Seems basically to be a column store and stores everything in a single file.

That reminds me of LMBD, which is similar in some ways, except bring a key/value store. Very highly recommend checking it out too to see what embeddable DB makes sense for your project.

http://www.lmdb.tech/doc/

chrismorgan · on Sept 20, 2020

What is it that makes a database an OLAP database rather than a generic relational database?

Expressed otherwise: why would I choose DuckDB over SQLite, or SQLite over DuckDB?

FridgeSeal · on Sept 20, 2020

OLAP databases can/are still Relational databases. The difference is that they’re optimised for different workloads.

SQLite/MySQL/Postgres/MSSQL etc are all OLTP databases whose primary operation is based around operations on single (or few) rows.

OLAP databases like ClickHouse/DuckDB, Monet, Redshift, etc are optimised for operating on columns and performing operations like bulk aggregations, group-bys, pivots, etc on a large subsets or whole tables.

If I was recording user purchases/transactions: SQLite.

If I was aggregating and analysing a batch of data on a machine: DuckDB.

I read an interesting engineering blog from Spotify(?) where in their data pipeline instead of passing around CSV’s or rows of JSON, passed around SQLite databases: DuckDB would probably be a good fit there.

chrismorgan · on Sept 20, 2020

I gather that there may be some differences in functionality offered, but that it’s probably not much. So I presume performance is the biggest area of difference. In that regard: how big are the differences in performance for each category’s representative workloads?

legg0myegg0 · on Sept 20, 2020

Before the latest optimization, and only using 1 core, vs. SQLite we were seeing 133x performance on a basic group by or join, and about 4x for a pretty complex query. It was roughly even to Pandas in performance, but it can scale to larger than memory data and now it can use multiple cores! As an example, I could build views from 2 Pandas DataFrames with 2 columns and 1 million rows each, join them, and return the 1 million row dataset back to Pandas in 2 seconds vs. 40 seconds with SQLite/SQLAlchemy... Pretty sweet. DuckDB is going to be even faster now I bet!

jmatthews · on Sept 21, 2020

Have you compared this to dask from a performance standpoint? That is the larger than memory solution we're currently using for analytics.

chrismorgan · on Sept 20, 2020

How about the other way? When will SQLite perform better than DuckDB?

FridgeSeal · on Sept 20, 2020

If your workload is lots of singular INSERT or UPDATE’s.

greggyb · on Sept 20, 2020

I routinely see order of magnitude or greater improvements moving analytical workloads to compressed, columnstore engines vs rowstore engines.

The shortest useful description I can give of what constitutes an analytical workloads is that it is table-scan-heavy, and often concerned with a small subset of columns in a table.

FridgeSeal · on Sept 21, 2020

To add to the other comment (albeit with different DB’s because I haven’t used DuckDB yet) comparing MSSQL and Clickhouse: similar hardware and same dataset, CH responds in ~30-60ms, MSSQL 300-400+ ms for simpler queries. 90-180ms vs several seconds, up to about 30s for more complex queries.

I could add more indices to MSSQL and do all sorts of query optimisations, but out of the box the OLAP wins hands down for that workload.

3np · on Sept 20, 2020

Please think twice before choosing an embeddable database for hostable software. I see so many open-source projects that get hampered own the line because they chose sqlite and there's now a SPoF that would require a significant rewrite to eliminate - especially as sqlite depending on configuration either corrupts or becomes extremely slow when stored on distributed/network file systems. I assume this goes for duckdb as well.

Embeddable databases absolutely have legit use-cases, but I see them being used for projects where it'd have saved so much trouble to use a proper ORM (perhaps with sqlite support!) from the beginning.

brunoluiz · on Sept 20, 2020

Not sure where an ORM would help in this case. If anything, it might get in the way sometimes (in terms of performance), as some queries might not be well optimised.

But yeah, people shouldn't just "follow the hype". They should always ask why would use this and not that.

737maxtw · on Sept 20, 2020

I think ORM might have been brought up because in many embedded DBs there is some level of language native bindings that eliminate the need for an ORM. A couple of examples that come to mind would be mnesia (KV store rather than SQL but still an example) or some of the SQLite providers for c# which give you their own mini orm (that you will have to refactor if you change out dbs instead of using sqlite with an orm from the get go)

My preference for ORMs is still on the 'give me a SQL DSL and map my objects' side. Let me use SQL as SQL, give me convenience in interfacing with the DB. Complex features like object tracking tend to be nice up front but require extra diligence down the line to avoid pitfalls, and tend to be risky from a standpoint of composability.

Sorry. Almost went on an ORM rant there.

GordonS · on Sept 20, 2020

I guess the thought behind using an ORM here was as an abstraction over the database/storage layer, so theoretically you could change switch out the database easily.

In practice though, real projects often have to sidestep the ORM and handwrite SQL, sometimes using database-specific features.

3np · on Sept 20, 2020

What I meant more specifically was a database-independent RBDMS ORM (e.g EF Core, sqlalchemy) - for cases when there aren't resources or need for separate db driver implementations.

magicalhippo · on Sept 20, 2020

> especially as sqlite depending on configuration either corrupts or becomes extremely slow when stored on distributed/network file systems

I had Plex running via Docker, and the Plex configuration/metadata stored via NFS on my NAS. For quite a while seeking would be intermittently slow, or plain crash the Plex server requiring a restart of the container. Other weird issues as well.

Finally I had enough and did some debugging and searching, and found that Plex uses sqlite in a way that is not compatible with networked storage. So I moved it to an iSCSI volume, mounted as a regular drive on the Docker host and it's been rock solid since.

But yeah, didn't expect that.

mceachen · on Sept 20, 2020

Yup. I was surprised and did the same sleuthing for PhotoStructure when libraries are stored on remote filesystems. I had to add a "local db replica" mode to work around this issue.

When running PhotoStructure under docker, it turned out to be simpler to just assume the DB is stored on a remote filesystem, and enable this mode by default. When running outside a container I can detect if the host volume is remote automatically and set the mode appropriately.

The reason why, other than missing file locking semantics, seems to be from WAL mode's use of an SHM file. Unsurprisingly, memory-mapped files don't work well from remote shares.

defnotashton2 · on Sept 20, 2020

You just saved me a lot of time. In the middle of migrating homelab from day from docker/swarm to kube and had not thought of this. Ty.

davidgaleano · on Sept 20, 2020

Can multiple separate clients access the same single-file database at the same time?

hfmuehleisen · on Sept 20, 2020

One of the DuckDB authors here. The answer is yes, but only in read-only mode at the same time.

davidgaleano · on Sept 20, 2020

What about one single client writing and multiple clients in read-only mode? Any problems with storing the file on network storage? Basically, how far can you push it before it is better to just use PostgreSQL?

torrance · on Sept 20, 2020

Strict typing has been the one big feature missing from SQLite, that this presumably brings.

gralx · on Sept 20, 2020

Foreign keys enabled by default or by configuration setting would be another big feature missing from SQLite3, but I couldn't find any mention of foreign keys in DuckDB's documentation.

mytherin · on Sept 20, 2020

Foreign keys are not supported yet in DuckDB currently, but we do plan to add support for them in the future. If you have a need for them, feel free to post a comment on the issue. We tend to prioritize implementing features that many people ask for :)

joppy · on Sept 20, 2020

SQLite has foreign keys, but they are disabled by default for backwards-compatibility reasons. https://sqlite.org/foreignkeys.html

gralx · on Sept 20, 2020

And it's not possible to enable them by default - there is no existing configuration setting to override that design choice. They hope to include the feature in SQLite4, as I understand it.

ComputerGuru · on Sept 20, 2020

SQLite 4 was killed off in favor of just bringing a few changes to SQLite 3 iteratively.

https://www.sqlite.org/src4/doc/trunk/www/index.wiki

gigatexal · on Sept 20, 2020

OLAP databases seem to omit ForeignKeys. Redshift does this for example. They don’t exist in BigQuery (but BQ pushes you towards a single table design anyway so moot point I guess).

Master_Odin · on Sept 20, 2020

It doesn't look like it provides FK looking at the create table syntax. Just PKs, unique, and check on an expression similar to sqlite.

cryptonector · on Sept 20, 2020

What? SQLite3 absolutely has FKs.

gralx · on Sept 20, 2020

You just have to write

    PRAGMA foreign_keys = on;

each and every single time you want to do an operation that requires an FK constraint, like `INSERT`.

mceachen · on Sept 20, 2020

s/each and every single time/once when you instatiate a new database connection/

ftfy.

You really shouldn't be needing to create new connections often. Really only once for any given process.

There are some 20 PRAGMAs that PhotoStructure sets for library databases, but it's only at process startup, and it takes a couple millis to run them all. It's wonderful to be able to configure stuff so easily.

I also think the design decision for backward comparability (which meant this default is false) is absolutely defensible.

gralx · on Sept 21, 2020

I'll defer to anyone with expertise.

I'm modelling a program in Python on a work computer without administrative privileges, which means it has to stay in user space. Because I haven't decided on a final implementation language, I don't want to commit to any one language's memory model. Vanilla SQL is my solution there.

My specification limits procedures to SQL as much as possible, using the implementation language only for user interaction with and computations on the database not possible in standard SQL. It minimizes use of the heap, which requires opening connections to the database frequently, but this is practical given the low performance requirements of the application.

SQLite3 satisfies my spec except for the extra diligence required first to implement the foreign keys PRAGMA in order to maintain referential integrity, and second to remove it if an RDBMS closer to the SQL standard were eventually chosen.

In a nutshell, my constraints are user space, limiting operations to standard SQL as much as possible, referential integrity, and minimal implementation language dependencies besides having an available SQL API. Given those constraints, would you recommend a different RDBMS? Or would you agree SQLite3 is my least worst option?

hfmuehleisen · on Sept 20, 2020

Yes, DuckDB has strict typing.

KingOfCoders · on Sept 20, 2020

Sorry for my perhaps stupid question,

has anyone tried this with ESP32/Arduino? I would need an anlytical DB for an hiking application (Temperatur/Pressure data etc.)

Avamander · on Sept 20, 2020

Ooh, this sounds like a fun use-case but a power loss sounds super painful.

KingOfCoders · on Sept 20, 2020

Would also store on MicroSD.

jimsmart · on Sept 21, 2020

That’s part of the problem. Power loss during SD card write will, more likely than not, cause corruption - FAT file systems do not have journaling.

KingOfCoders · on Sept 21, 2020

Well yes. But then it's just hiking not emergency services, SD cards also do break, so I would need dual slot etc.

aabbcc1241 · on Sept 21, 2020

Simple, embedded database deserve more attention, especially those with native language support, e.g. map,filter,reduce methods on wrapped data that behave similar to javascript array; and support to nested data structure (like auto join and normalize).

(I made one using Proxy API, called ref-db. The cons is it require manual indexing to speed up lookup by non-primary key)

xiaodai · on Sept 20, 2020

Say I want to fix logistic Regression. Can I do that in database? Or have to extract it out as a dataframe? Cna I stream the data?

winrid · on Sept 20, 2020

If you want to do linear regression aggregations with any DB, one thing you can do is store the coifficents and then aggregate them on request. You sacrafice some accuracy for speed.

xiaodai · on Sept 20, 2020

Model fitting. How to do?

prionassembly · on Sept 20, 2020

Stochastic gradient descent.

cmrdporcupine · on Sept 20, 2020

I love/hate it when I see pieces of software come out that would solve problems I had 10 or 15 years ago and had imperfect solutions for. This would have been great for a software project for call centre analytics I was employed to work on back in the mid-2000s. I believe I used HSQL (embedded Java DB) but it was really not the right tool. At the time I looked at MonetDB but its architecture wasn't right for what I was doing. This would have fit the niche perfectly.

This is really neat. Wish I had a problem I could use it for now.

tracker1 · on Sept 21, 2020

I used embedded firebird for a couple projects in the early/mid 2000's myself. Wish it were a bit more popular.

ulanmaasan · on Sept 20, 2020

So I tried a simple query in the live demo and it seems that the WHERE clause isn't filtering anything. Am I missing something?

  SELECT * FROM part WHERE p_size = 7;

mytherin · on Sept 20, 2020

Looks like a bug, likely related to an optimizer since adding count(*) does produce the correct result. I will have a better look after the weekend. The demo runs a relatively old version so it’s possible it has already been fixed in the meantime. Thanks for the report!

mytherin · on Sept 21, 2020

Should be fixed now. This was an old bug related to filter pushdown into scans. The demo now runs the latest version of the system. Thanks again for the report!

tough · on Sept 20, 2020

I was discussing with a coworker what would be the best option, embbbedabble or service for a database runnning for a Windows GUI app.

My bet was on embeddable, so SQLite, is that the right choice?

It seeems a nightmarish scenario to have 500 postgres instances running on client machines and being accessible from the internet albeit with security enabled and such...

am i right?

alecco · on Sept 20, 2020

Countless mobile apps use SQLite.

tough · on Sept 20, 2020

In this case though I'm talking of a COBOL backend very old windows app.

We want to move data from COBOL files to a database, we're making our own migration tools/scripts in python (In order to mainly rename weird name table/column names to sanish ones), and we can target any SQL flavour/database we want to insert that data at.. the question is what's the better call here...

SQLite seems pretty good enough, at least until we can move our whole app to a more web-friendly environment at which point something like postgres might make more sense? but that could take a couple years... (we're a small team/company in a very specific niche/locale)

737maxtw · on Sept 20, 2020

SQLite works well for app files. Bentley uses it for all sorts of stuff and I had way fewer corrupt files in their format than Autodesks. :)

Do note that sqlite itself doesn't have the kindest handling of date/times, you are at the mercy of your data provider (unless you are going native) Especially when timezones are involved.

If it's running on windows machines you could consider Sql Server LocalDB (runs in user mode, i believe you are allowed to distribute as such in an app). But it doesn't sound like that would make sense from what I've read for your case and tech stack.

Sqlite is a good start because even if you are wrong, it should be pretty dang easy to migrate to whatever is right. :)

tough · on Sept 20, 2020

First of all thank you for all this info.

Actually SQL Server was one of the other contendents, but I disregard it due to it not being open source and having some hard caps in the free license (I think it's 10GB)...

But, I didn't knew about SQLserver local db, so we will look into that!

kfk · on Sept 20, 2020

How does duckdb compare to aws athena or spectrum? Both services offer sql directly on parquet files. I guess you would say duckdb is not distributed but if you create one on time when the user needs to run the sql on the parquet you’d be in a similar use case?

stormdennis · on Sept 20, 2020

So could the DuckDB, Postgresql combination function in a manner analogous to Firebird which has both embedded and server modes. I'd find that really exciting

akkishore · on Sept 20, 2020

Can we create a view over multiple parquet files?

mytherin · on Sept 20, 2020

Yes, totally possible. You can use UNION ALL to merge them together. We are also working on globbing support for the Parquet reader that should be released next week [1].

[1] https://github.com/cwida/duckdb/issues/773

dcl · on Sept 21, 2020

I will definitely look at using this for some personal projects. The R API is greatly appreciated. Thanks for alerting me to it.

rilut · on Sept 21, 2020

Really want to use this for EDA. I hope it has support for client apps like DBeaver or something

pknerd · on Sept 20, 2020

Any GUI Client to explore data?

infinite8s · on Sept 21, 2020

Web or desktop?

stormdennis · on Sept 20, 2020

This sounds great. I wonder if there are plans for an ODBC driver at any point?

zmmmmm · on Sept 20, 2020

Amazing work!

It was asking a lot, but still :-( that JSONB and range datatypes are missing.

threcius · on Sept 20, 2020

Good name.

xgenecloud · on Sept 20, 2020

Kudos, much needed. Neat documentation. Would love to try with JS APIs.

_b8kh · on Sept 20, 2020

Can someone recommend whats the best way to connect to a memory DuckDB instance from php(php_fpm) and get return in usable format?

E.g. would be used for offline statistical analysis

Alterntively, any way to export cli sql results as csv?

secondcoming · on Sept 20, 2020

Is this backward compatible with SQLite?

jononor · on Sept 20, 2020

The on-disk format is different it seems. But there is an implementation of the SQLite C API for interacting with the database. So it depends on which level / what purpose you want compatibility.

mraza007 · on Sept 20, 2020

Just curious about these Do you think embeddable databases are common in car navigation systems

RileyJames · on Sept 20, 2020

This sounds like a solution I’d use. Embedded. Simplifies simple sites.