Hacker News new | past | comments | ask | show | jobs | submit login
DuckDB – An embeddable SQL database like SQLite, but supports Postgres features (duckdb.org)
801 points by goranmoomin on Sept 20, 2020 | hide | past | favorite | 160 comments



I spent a while looking at this today. It's really interesting.

It's not based on SQLite at all (except for borrowing the SQLite shell implementation) but it looks very much like SQLite, in particular:

- It's designed to work as an embedded library, eliminating the network overhead you usually get when talking to a database

- Each database is a single file on disk

- It ships as an "amalgamation" build - a single giant C++ file (SQLite is a single giant C file)

Impressively, if you run "pip install duckdb" it Just Works - you can then "import duckdb" and start using it, with an interface that looks very similar to the sqlite3 module that ships with Python.

The key reason this exists is that it's a column store, with vectorized operations across columns - making it ideal for analytical workloads. This blog entry has some benchmarks that illustrate how well it works in that regard: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

It's also backed up by some strong computer science. It's by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers: https://duckdb.org/docs/why_duckdb#standing-on-the-shoulders...

It's a really interesting piece of software, and unlike many other "new databases" it feels like it fills a very genuine gap in my toolbox.


And MonetDB took some inspiration back from DuckDB to develop an embedded version also:

https://github.com/MonetDBSolutions/MonetDBe-Python/

Disclaimer: i'm working on this.


Could you elaborate on some of the similarities and differences between DuckDB and MonetDB?


Disclaimer: i am working on MonetDB and I established the CWI database architectures group

DuckDB is designed as an experimental system after heavy exposure to the technniques deployed in MonetDB (open-source), Hyper, and Vectorwise.

The properties of the embedded version of MonetDB can be found here https://monetdbe.readthedocs.io/en/latest/introduction.html#

Some difference between MonetDB and DuckDB can be found here https://monetdbe.readthedocs.io/en/latest/migrations.html

and the blogpost mentioned above is covered in https://twitter.com/MonetDB/status/1282412295235280901?s=20


If everything you said is accurate, somebody should make another pitch for WebSQL ( https://en.wikipedia.org/wiki/Web_SQL_Database ) with a custom version of DuckDb!

Every browser maker was interested in implementing it but the W3C couldn't go ahead with it because everyone chose to implement it using SQLite, where as W3C required more than one db back-end implementation to move forward.


Having sqlite included in all browsers with standard bindings would have been great.


It would! But it would also be a security nightmare. Exploit SQLite and you could use it in Safari, Chrome and Firefox - everybody could be affected.


How would that be different from someone editing something in window.localStorage?


If that's all you need, it wouldn't make sense to include SQLite in the browser. If you're going to include a relational database you might as well use it


I think that meant: How would that be more of a security nightmare than some malicious site editing your browser's whatever.localstorage?


Ironically, all the IndexDB implementations were just thin layers on top of sqlite last I checked.


The original slow IndexedDB implementations were done that way, but my understanding is that anything fast and "modern" probably isn't in 2020 as most of the browsers moved to their own IndexedDB storage implementations that are not SQLite-based. IndexedDB has very different expected performance characteristics than a pure relational DB (as it is a key/value store with indexes closer to most "NoSQL" DBs than to SQLite) and it did not (and does not) make long term sense to use SQLite below the hood.


This could work if there is a subset of SQL where duckDB and SQLite behave almost identically


I often wondered why someone didn't do it with BerkleyDb too? I know Oracle owns it, but it is open source and someone like Mozilla could have implemented it in their browser just to get the W3C rolling on WebSQL.


The problem with WebSQL would remain the same, it would be necessary to specify a subset of SQL as a web standard.


> - Each database is a single file on disk

Are these architecture independent? i.e. If I create the database on x86_64 and move it to ARM64; would it work seamlessly?


One of the authors here - portability of the storage is indeed one of our goals. We test that the same file can be loaded everywhere on different systems, including between ARM and x86.

I should mention the storage is still very much a work-in-progress. We are actively working on redesigning the storage to add compression and other extra features, meaning the storage format can change and be incompatible between different DuckDB versions. We plan to stabilize the storage format with V1.0, after which every subsequent DuckDB version should be able to read database files created by previous versions.


DuckDB looks very interesting and I'm quite excited to examine it more closely in the next few days!

I just wanted to add to the discussion that an unchanging file format, or at least a backwards compatible one, is a key feature of sqlite. See for example Richard Hipp's comments here [1] (I think he also mentioned earlier in the talk that the file format has become a limiting factor now in terms of some of the refactoring that they can do). The file format therefore seems likely to be a major factor in the long term success of this project and I am glad to see that you are taking your time before settling on any architecture here.

Given that you are targeting the data science and analytics space, what are your plans for integration with arrow and the feather file format? From a purely user/developer perspective, arrow's aim of shared memory data structures across different analytics tools, seems like a great goal. I know Wes McKinney and Ursa Labs have also spent quite some time at the file storage part of this, see for example the Feather V2 announcement [2].

What are your thoughts on the tradeoffs they considered and how do you see the requirements of DuckDB in relation to theirs?

From the Carnegie Mellon DuckDB talk [3], I saw that you already have a zero-copy reader to the pandas memory data structures, so the vision I have is that DuckDB could be the universal SQL interface to arrow datasets which can then also be shared with more complex ML models. Is that something that we can hope for or are there obstacles to this?

[1] https://youtu.be/Jib2AmRb_rk?t=3150

[2] https://ursalabs.org/blog/2020-feather-v2/

[3] https://www.youtube.com/watch?v=PFUZlNQIndo


Thanks for your detailed reply! As you mentioned - lessons learned by SQLite here are indeed crucial. We are very carefully trying to craft a storage format before fixing it in-place, specifically to try to avoid these problems. Backwards compatibility is a must, but backwards compatibility to a sane format is massively preferable :) No doubt we will end up making some mistakes in hindsight, though.

We already support reading from the arrow in-memory format [1], therefore it is already possible to use DuckDB to perform SQL-on-Arrow. More support there is definitely possible, and I think this is a very promising direction.

[1] https://github.com/cwida/duckdb/pull/866


Thanks. That's great and I see you even support the new Arrow streaming model!


It's also possible to return a result set as an Arrow table, so round trip SQL on Arrow queries is possible (Arrow to DuckDB to Arrow)! It's not 100% zero-copy for strings, but it should work pretty well!


Thanks for the confirmation on portability, I didn't see Go support in the home page; Is there an official Go driver support planned as well?


Neither Hannes nor me are very experienced with Go, so we are not the most suitable to do that. We do very much welcome external contributions, and someone did make a WIP go driver [1] in the past.

We also have a sqlite3 compatible C API that can be used to use DuckDB. This is actually how we use the sqlite3 shell: we directly use the sqlite shell but bind it to the duckdb library using this API. It might be possible to use the Go SQLite driver in a similar manner to talk to DuckDB.

[1] https://github.com/marcboeker/go-duckdb


And Dart bindings? That could be really interesting for Flutter developers.


Do you consider (or maybe you already have) the advanced columnar features? Run length encoding, dictionary encoding, etc. It would be great to see how those perform under these workloads.


Definitely. We have quite detailed plans for compression :) more to follow when we get to it.


Excellent! Thanks for the update.


Couldn't you have used an already existing format for storage, e.g. Apache ORC?


Those formats have a different use case from a database system - they are mostly designed for write once, read many times workloads. As far as I’m aware you cannot update them in an ACID compliant manner, for example, which makes it difficult to use them as backend for an ACID database system.


Another DuckDB developer here, we support reading table data from Parquet files directly.


I am a big fan of those formats but decoupling the actual storage features from the ecosystem is not a trivial task. I haven't look at the C++ version of ORC for a while but it used to be incomplete. Other than that, the solutions ORC uses to compress data is pretty amazing.


Yes, their testing folder is 26mb while their source code of the database is only 5.6mb. It looks like they took the same approach to testing as sqlite also.


How might one go about using this in a language that’s not supported in the documentation such as golang?


Generally you would need to write bindings for the system. Experimental DuckDB bindings for Go do exist [1], however, they are rather old and might not work anymore.

We also have JDBC support, which might help for languages that know how to use that protocol. ODBC support is not implemented yet, but also planned.

We also have a SQLite-compatible C API [2], that can potentially be used to use an existing SQLite driver that uses the SQLite C API.

[1] https://github.com/marcboeker/go-duckdb

[2] https://github.com/cwida/duckdb/blob/master/tools/sqlite3_ap...


Hi, author of the Go bindings here. I've just tested them and they are still working with the latest DuckDB version from master.


Excellent, thanks again for writing them :)


[flagged]


A giant plant as in someone from their team planning this out and then planting this as a comment..?

simonw wrote Datasette[1] which makes extensive use of SQLite and acts as a viewer, allowing you to create interactive websites and APIs from arbitrary databases. He'd be a very long term plant and it instead seems far more likely he's interested in the possibilities of DuckDB compatibility within the context of Datasette and other similar projects.

[1]: https://github.com/simonw/datasette


I actually started looking at it today because someone was asking about it on the Datasette discussions board: https://github.com/simonw/datasette/discussions/968


I don’t think that it’s reasonable to assume that Simon would risk his reputation by coming here to astroturf.


I really wish ALL software project / framework / libary could follow the lead here. Instead of your marketing page telling me how world changing awesome tech you have, which really is a consumer marketing strategy. Just do it like DuckDB,

When to use DuckDB ; When to not use DuckDB


I couldn't put my finger on why I felt like I had just read something useful, but that's it: clear, concise and straight-forward information to help you evaluate use.

Only thing wrong I saw was the detection of the platform. It detected my linux box as "Unix" and proposed I download OSX libs.


We should make a list of technology that does this, because I know Clickhouse also has a reasonably detailed page on when to use it and when to not use it and why. Postgres also has a very nice “do and donts” wiki page.



What an excellent point! I suppose the consumer marketing strategy comes from pure imitation - people look around them how "professional" products discuss themselves - with the aim of selling a product and service - and simply copy that.

Algorithms textbooks don't create marketing blurbs for linked lists...


Looks like DuckDB doesn't seek investments. So they can afford to be defeloper-first. Not everyone can, though.


I'd wish every product would tell what's in for me. Telling that it's world changing doesn't mean it's useful or impactful for me, as consumer or as developer. Usually, usual products brand themselves as "new" for differentiation, but real new products stay away from this to not scare people. Usually you notice when something is really "new" (aka means paradigm change) when the marketing says loud "Don't worry, everything will stay the same as it was, no visible difference for you!".


First of all, really neat project.

There's an issue with the Installation section of the web page, though. Running Chrome on Linux, it says "System detected: Linux", which is right. But under "CLI", it offers me a download of duckdb_cli-osx-amd64.zip.

Just in case it was just the zip file name that's wrong, I downloaded it, but the duckdb binary inside is a "Mach-O 64-bit x86_64 executable" according to the file command.

The correct file is available if I click on Other Installations (https://duckdb.org/docs/installation/).

Also, if I try it on running Firefox on Linux, it says "System detected: UNIX" and gives me the OS X download.


Thanks for reporting that! We will have a look at fixing it :)


FWIW I got the same issue on Android, though offering me the "correct" link wouldn't do me much good either :)


I was hoping from the title that it aims for postgres SQL compatibility, but I can't find it explicitly mentioned in the docs. This really makes me think I really want something like sqlite://memory which completely disregards speed or even persistence. Instead you could say for example "open an in-memory database that behaves like postgres 9" and run your tests against it. With typical fixtures of 10 or so rows, you wouldn't even need a query plans - just naive filte+copy over whole tables for most operations.

Currently I'm using https://pypi.org/project/testing.postgresql/ which spawns a temporary server per each test suite, which has quite a lot of overhead.


> SQL Parser: We use the PostgreSQL parser that was repackaged as a stand-alone library. The translation to our own parse tree is inspired by Peloton.

Right at the bottom of this page:

https://duckdb.org/docs/why_duckdb.html


That's awesome news, thanks for that!


No problem I was curious as well and hunting it down, this project does seem interesting, I'm a huge fan of database systems like SQLite and H2. There's something wonderful about being able to copy a simple file and share it with other developers. This one seems rather interesting as well.


That’s what made Excel so popular.


One of the authors of DuckDB here: we use the PostgreSQL parser, and try to be generally compatible with the syntax used by SQLite, Postgres and MySQL. In some cases those are unfortunately mutually exclusive (e.g. null ordering - we have a PRAGMA to change this system-wide). Making the SQL dialect “as compatible as possible” with these systems is very much one of our goals, and if you find any incompatibilities feel free to open an issue :)


I'd note that when moving from pg to $other the thing that really trips me up isn't the syntax changes, it's the lack of ARRAY and ROWTYPE.

I'm not sure whether those are in scope for you but it'd be nice if the docs said "to be implemented" or "out of scope" somewhere ... and my apologies in advance if they do and I somehow missed it.


We already have support for LIST and STRUCT (which I think are equivalent to ARRAY and ROW, respectively). There is still some functionality missing there (notably storage, and several functions) - but the base functionality is there and the remainder definitely will be added.


Could only find a pull request on struct, and nothing at all on 'list' because I just got a billion 'select list' articles. Also nothing in the docs.

Struct looks 10x uglier but perfectly easy to generate, mind, but the whole thing is non-trivial to find :/


I would just create a per-test-file database (and delete it the next time the test runs). The overhead is very small (compared to booting Postgres) and it works exactly like production because it is exactly production.

In general, I am not comfortable using an "almost equivalent" database for tests. Every database engine has its quirks, and if you don't know what they are, your program that passes its tests only "almost" works in production, which is annoying.


And if you need to run expensive migrations, use a template database.

https://www.postgresql.org/docs/current/manage-ag-templatedb...


While PostgreSQL templates allow to scaffold new databases (one per test-spec) quite fast, it's even faster to have a warmed up pool of databases ready: https://github.com/allaboutapps/integresql#background

Disclaimer: author.


Oh that looks great, and very close to what I had in mind for the next upgrade of our in-house testing framework. Definitely going to give that a try soon. Thanks for posting it!


I would also love this, in the past I’ve used the H2 in-memory database for tests with Spring boot applications.

The Hibernate ORM handles configuring the H2 schema but they’re not fully compatible, so it means I have to be careful to not rely on Postgres-specific features. I generally am not testing database logic, just need a working persistence layer to get the tests running, but an in-memory Postgres implementation would be amazing.


Although it's something I'd love to write, supporting it over a long time as pure FOSS would probably be a sad experience. I wonder what model could be used here. Maybe something like patreon or "sponsor a feature" with an open-source project? FOSS core + proprietary Java / .NET bindings because corps can pay?


Yeah, for sure this is the type of niche, high-effort work that would be used extensively by the closed-source world and yet struggle to find adequate support.


IIRC h2 supports a postgres compatibility mode. Though I'm not sure how feature complete it is.

http://h2database.com/html/features.html


Man, I’ve wanted the “compatible SQL engine with only RAM storage” for testing for YEARS. Closest I got was some shenanigans with MSSQL’s LocalDB.


I use docker to spin up new postgresql DBs on the fly


Same here, and I run the same set of migrations that run in production. To he clear, this is only done once per test session, not for individual tests, and the tests are written in such a way that they don't interfere with each other.

The overhead is actually pretty small, less than 10s. I'd saw too much for unit tests, but we'll within the tolerable range for integration/functional tests. Compared with the time I'd spend hacking together some brittle and unrealistic in-memory alternative, I much prefer to use a real database.


I do that too but it’s a noticeable overhead, I guess it wouldn’t be great for mass-testing scenarios.


RAM FS + fsync off?


Those solutions still have a high overhead. There's acid compliance, serialisation in memory, maintaining indices, and many other layers. Compare it to an ideal testing solution with no initialisation cost and insert being literally: parse the query, add a new entry to a list, done.


But you want to be testing against something that is as close as possible to the deployment environment. So if that means acid, indices etc, then that's what it is.


You can still do them in a trivial way that works like production. For example: if some column has a unique index, look at all rows and compare the new value. You don't need an actual index for it. (And definitely not a fancy concurrent access btree) For transactions/mvcc you can literally make a copy of everything.


Have you tried postgresql with libeatmydata?


I'd build this, but I'm not sure if I'd be able to get anyone to fund it.


You shouldn't rely on any "just like X" systems for tests, you should use X exactly, including correct config and version as production.

What you want is having a Postgres server, but optimized for tests. What I've done is 1. One instance per pipeline 2. Parallelize by having separate databases per process 3. Build schema once 4. Before each test, purge the database (or run tests in transactions so you can just rollback in the end) 5. Run in Docker, use tmpfs volumes for writes (no disk writes)

It runs fairly quickly.


Cool project. Really appreciate the "When to use DuckDB" and "When to not use DuckDB" sections.

Anyone using this in production? Also, any benchmarks vs. other DBs?


I work at a Fortune 100 company and we have this in production for our self-service analytics platform as a part of our data transformation web service. Each web request can do multiple pandas transformations, or spin up it's own DuckDB or SQLite db and execute arbitrary SQL transformations. It fits our use case like a glove and is super fast to/from Pandas.


3-month old HN thread [1] with good information from the authors and others.

“An embeddable column store for data science” (Pandas and dbplyer) would have been a good description at that time but the addition of JDBC/Java support expands the use cases.

[1] https://news.ycombinator.com/item?id=23287278


The performance of DuckDB's predecessor, MonetDBLite, was/is stellar. DuckDB, when I tried in 6 months ago, was a bit behind in terms of csv parsing or aggregation by groups and didn't seem much faster than regular sqlite so I didn't really jump at it. Does anyone here know if the performance improved since then and how it compares to MonetDBLite?


We are currently actively working on performance optimization, and have recently finished implementing parallelization. Expect more performance improvements soon :)


I think the title chosen by the poster misses the mark a bit, the main point seems to be that it’s an OLAP DB optimised for analytics.


This is slightly offtopic but do you guys have any good guides/best conventions how to save timeseries data to sqlite/similar? I'm no DB expert so struggling with this. I'm having a dynamic number of measurements that usually have the same timestamp. So one table for timestamp_id, signal_id and value?


What would be the recommended way to migrate data from SQLite and DuckDB? Should I dump SQL from SQLite and load in DuckDB? Or use Parquet, or CSV export/import? I think those interested in the tagline "like SQLite but more features" might be interested in a quick howto/guide like that.


This talk they did with a database research group at CMU may be of interest as well. https://youtu.be/PFUZlNQIndo


A suggestion for the web site: It would be good to mention the license more prominently. I had to click through to Github to find out what it was, and I believe this information would matter to more people than just me.


Correction: As CRConrad, whose comment was inexplicably downvoted, pointed out, the MIT License IS mentioned on the front page (And the Wayback Machine proves it was mentioned before the HN post).


Dunno if this was put up in reaction to your comment, or you missed it, but looking at their front page right now it says “Permissive MIT License”.


Very cool project! Seems basically to be a column store and stores everything in a single file.

That reminds me of LMBD, which is similar in some ways, except bring a key/value store. Very highly recommend checking it out too to see what embeddable DB makes sense for your project.

http://www.lmdb.tech/doc/


What is it that makes a database an OLAP database rather than a generic relational database?

Expressed otherwise: why would I choose DuckDB over SQLite, or SQLite over DuckDB?


OLAP databases can/are still Relational databases. The difference is that they’re optimised for different workloads.

SQLite/MySQL/Postgres/MSSQL etc are all OLTP databases whose primary operation is based around operations on single (or few) rows.

OLAP databases like ClickHouse/DuckDB, Monet, Redshift, etc are optimised for operating on columns and performing operations like bulk aggregations, group-bys, pivots, etc on a large subsets or whole tables.

If I was recording user purchases/transactions: SQLite.

If I was aggregating and analysing a batch of data on a machine: DuckDB.

I read an interesting engineering blog from Spotify(?) where in their data pipeline instead of passing around CSV’s or rows of JSON, passed around SQLite databases: DuckDB would probably be a good fit there.


I gather that there may be some differences in functionality offered, but that it’s probably not much. So I presume performance is the biggest area of difference. In that regard: how big are the differences in performance for each category’s representative workloads?


Before the latest optimization, and only using 1 core, vs. SQLite we were seeing 133x performance on a basic group by or join, and about 4x for a pretty complex query. It was roughly even to Pandas in performance, but it can scale to larger than memory data and now it can use multiple cores! As an example, I could build views from 2 Pandas DataFrames with 2 columns and 1 million rows each, join them, and return the 1 million row dataset back to Pandas in 2 seconds vs. 40 seconds with SQLite/SQLAlchemy... Pretty sweet. DuckDB is going to be even faster now I bet!


Have you compared this to dask from a performance standpoint? That is the larger than memory solution we're currently using for analytics.


How about the other way? When will SQLite perform better than DuckDB?


If your workload is lots of singular INSERT or UPDATE’s.


I routinely see order of magnitude or greater improvements moving analytical workloads to compressed, columnstore engines vs rowstore engines.

The shortest useful description I can give of what constitutes an analytical workloads is that it is table-scan-heavy, and often concerned with a small subset of columns in a table.


To add to the other comment (albeit with different DB’s because I haven’t used DuckDB yet) comparing MSSQL and Clickhouse: similar hardware and same dataset, CH responds in ~30-60ms, MSSQL 300-400+ ms for simpler queries. 90-180ms vs several seconds, up to about 30s for more complex queries.

I could add more indices to MSSQL and do all sorts of query optimisations, but out of the box the OLAP wins hands down for that workload.


Please think twice before choosing an embeddable database for hostable software. I see so many open-source projects that get hampered own the line because they chose sqlite and there's now a SPoF that would require a significant rewrite to eliminate - especially as sqlite depending on configuration either corrupts or becomes extremely slow when stored on distributed/network file systems. I assume this goes for duckdb as well.

Embeddable databases absolutely have legit use-cases, but I see them being used for projects where it'd have saved so much trouble to use a proper ORM (perhaps with sqlite support!) from the beginning.


Not sure where an ORM would help in this case. If anything, it might get in the way sometimes (in terms of performance), as some queries might not be well optimised.

But yeah, people shouldn't just "follow the hype". They should always ask why would use this and not that.


I think ORM might have been brought up because in many embedded DBs there is some level of language native bindings that eliminate the need for an ORM. A couple of examples that come to mind would be mnesia (KV store rather than SQL but still an example) or some of the SQLite providers for c# which give you their own mini orm (that you will have to refactor if you change out dbs instead of using sqlite with an orm from the get go)

My preference for ORMs is still on the 'give me a SQL DSL and map my objects' side. Let me use SQL as SQL, give me convenience in interfacing with the DB. Complex features like object tracking tend to be nice up front but require extra diligence down the line to avoid pitfalls, and tend to be risky from a standpoint of composability.

Sorry. Almost went on an ORM rant there.


I guess the thought behind using an ORM here was as an abstraction over the database/storage layer, so theoretically you could change switch out the database easily.

In practice though, real projects often have to sidestep the ORM and handwrite SQL, sometimes using database-specific features.


What I meant more specifically was a database-independent RBDMS ORM (e.g EF Core, sqlalchemy) - for cases when there aren't resources or need for separate db driver implementations.


> especially as sqlite depending on configuration either corrupts or becomes extremely slow when stored on distributed/network file systems

I had Plex running via Docker, and the Plex configuration/metadata stored via NFS on my NAS. For quite a while seeking would be intermittently slow, or plain crash the Plex server requiring a restart of the container. Other weird issues as well.

Finally I had enough and did some debugging and searching, and found that Plex uses sqlite in a way that is not compatible with networked storage. So I moved it to an iSCSI volume, mounted as a regular drive on the Docker host and it's been rock solid since.

But yeah, didn't expect that.


Yup. I was surprised and did the same sleuthing for PhotoStructure when libraries are stored on remote filesystems. I had to add a "local db replica" mode to work around this issue.

When running PhotoStructure under docker, it turned out to be simpler to just assume the DB is stored on a remote filesystem, and enable this mode by default. When running outside a container I can detect if the host volume is remote automatically and set the mode appropriately.

The reason why, other than missing file locking semantics, seems to be from WAL mode's use of an SHM file. Unsurprisingly, memory-mapped files don't work well from remote shares.


You just saved me a lot of time. In the middle of migrating homelab from day from docker/swarm to kube and had not thought of this. Ty.


Can multiple separate clients access the same single-file database at the same time?


One of the DuckDB authors here. The answer is yes, but only in read-only mode at the same time.


What about one single client writing and multiple clients in read-only mode? Any problems with storing the file on network storage? Basically, how far can you push it before it is better to just use PostgreSQL?


Strict typing has been the one big feature missing from SQLite, that this presumably brings.


Foreign keys enabled by default or by configuration setting would be another big feature missing from SQLite3, but I couldn't find any mention of foreign keys in DuckDB's documentation.


Foreign keys are not supported yet in DuckDB currently, but we do plan to add support for them in the future. If you have a need for them, feel free to post a comment on the issue. We tend to prioritize implementing features that many people ask for :)


SQLite has foreign keys, but they are disabled by default for backwards-compatibility reasons. https://sqlite.org/foreignkeys.html


And it's not possible to enable them by default - there is no existing configuration setting to override that design choice. They hope to include the feature in SQLite4, as I understand it.


SQLite 4 was killed off in favor of just bringing a few changes to SQLite 3 iteratively.

https://www.sqlite.org/src4/doc/trunk/www/index.wiki


OLAP databases seem to omit ForeignKeys. Redshift does this for example. They don’t exist in BigQuery (but BQ pushes you towards a single table design anyway so moot point I guess).


It doesn't look like it provides FK looking at the create table syntax. Just PKs, unique, and check on an expression similar to sqlite.


What? SQLite3 absolutely has FKs.


You just have to write

    PRAGMA foreign_keys = on;
each and every single time you want to do an operation that requires an FK constraint, like `INSERT`.


s/each and every single time/once when you instatiate a new database connection/

ftfy.

You really shouldn't be needing to create new connections often. Really only once for any given process.

There are some 20 PRAGMAs that PhotoStructure sets for library databases, but it's only at process startup, and it takes a couple millis to run them all. It's wonderful to be able to configure stuff so easily.

I also think the design decision for backward comparability (which meant this default is false) is absolutely defensible.


I'll defer to anyone with expertise.

I'm modelling a program in Python on a work computer without administrative privileges, which means it has to stay in user space. Because I haven't decided on a final implementation language, I don't want to commit to any one language's memory model. Vanilla SQL is my solution there.

My specification limits procedures to SQL as much as possible, using the implementation language only for user interaction with and computations on the database not possible in standard SQL. It minimizes use of the heap, which requires opening connections to the database frequently, but this is practical given the low performance requirements of the application.

SQLite3 satisfies my spec except for the extra diligence required first to implement the foreign keys PRAGMA in order to maintain referential integrity, and second to remove it if an RDBMS closer to the SQL standard were eventually chosen.

In a nutshell, my constraints are user space, limiting operations to standard SQL as much as possible, referential integrity, and minimal implementation language dependencies besides having an available SQL API. Given those constraints, would you recommend a different RDBMS? Or would you agree SQLite3 is my least worst option?


Yes, DuckDB has strict typing.


Sorry for my perhaps stupid question,

has anyone tried this with ESP32/Arduino? I would need an anlytical DB for an hiking application (Temperatur/Pressure data etc.)


Ooh, this sounds like a fun use-case but a power loss sounds super painful.


Would also store on MicroSD.


That’s part of the problem. Power loss during SD card write will, more likely than not, cause corruption - FAT file systems do not have journaling.


Well yes. But then it's just hiking not emergency services, SD cards also do break, so I would need dual slot etc.


Simple, embedded database deserve more attention, especially those with native language support, e.g. map,filter,reduce methods on wrapped data that behave similar to javascript array; and support to nested data structure (like auto join and normalize).

(I made one using Proxy API, called ref-db. The cons is it require manual indexing to speed up lookup by non-primary key)


Say I want to fix logistic Regression. Can I do that in database? Or have to extract it out as a dataframe? Cna I stream the data?


If you want to do linear regression aggregations with any DB, one thing you can do is store the coifficents and then aggregate them on request. You sacrafice some accuracy for speed.


Model fitting. How to do?


Stochastic gradient descent.


I love/hate it when I see pieces of software come out that would solve problems I had 10 or 15 years ago and had imperfect solutions for. This would have been great for a software project for call centre analytics I was employed to work on back in the mid-2000s. I believe I used HSQL (embedded Java DB) but it was really not the right tool. At the time I looked at MonetDB but its architecture wasn't right for what I was doing. This would have fit the niche perfectly.

This is really neat. Wish I had a problem I could use it for now.


I used embedded firebird for a couple projects in the early/mid 2000's myself. Wish it were a bit more popular.


So I tried a simple query in the live demo and it seems that the WHERE clause isn't filtering anything. Am I missing something?

  SELECT * FROM part WHERE p_size = 7;


Looks like a bug, likely related to an optimizer since adding count(*) does produce the correct result. I will have a better look after the weekend. The demo runs a relatively old version so it’s possible it has already been fixed in the meantime. Thanks for the report!


Should be fixed now. This was an old bug related to filter pushdown into scans. The demo now runs the latest version of the system. Thanks again for the report!


I was discussing with a coworker what would be the best option, embbbedabble or service for a database runnning for a Windows GUI app.

My bet was on embeddable, so SQLite, is that the right choice?

It seeems a nightmarish scenario to have 500 postgres instances running on client machines and being accessible from the internet albeit with security enabled and such...

am i right?


Countless mobile apps use SQLite.


In this case though I'm talking of a COBOL backend very old windows app.

We want to move data from COBOL files to a database, we're making our own migration tools/scripts in python (In order to mainly rename weird name table/column names to sanish ones), and we can target any SQL flavour/database we want to insert that data at.. the question is what's the better call here...

SQLite seems pretty good enough, at least until we can move our whole app to a more web-friendly environment at which point something like postgres might make more sense? but that could take a couple years... (we're a small team/company in a very specific niche/locale)


SQLite works well for app files. Bentley uses it for all sorts of stuff and I had way fewer corrupt files in their format than Autodesks. :)

Do note that sqlite itself doesn't have the kindest handling of date/times, you are at the mercy of your data provider (unless you are going native) Especially when timezones are involved.

If it's running on windows machines you could consider Sql Server LocalDB (runs in user mode, i believe you are allowed to distribute as such in an app). But it doesn't sound like that would make sense from what I've read for your case and tech stack.

Sqlite is a good start because even if you are wrong, it should be pretty dang easy to migrate to whatever is right. :)


First of all thank you for all this info.

Actually SQL Server was one of the other contendents, but I disregard it due to it not being open source and having some hard caps in the free license (I think it's 10GB)...

But, I didn't knew about SQLserver local db, so we will look into that!


How does duckdb compare to aws athena or spectrum? Both services offer sql directly on parquet files. I guess you would say duckdb is not distributed but if you create one on time when the user needs to run the sql on the parquet you’d be in a similar use case?


So could the DuckDB, Postgresql combination function in a manner analogous to Firebird which has both embedded and server modes. I'd find that really exciting


Can we create a view over multiple parquet files?


Yes, totally possible. You can use UNION ALL to merge them together. We are also working on globbing support for the Parquet reader that should be released next week [1].

[1] https://github.com/cwida/duckdb/issues/773


I will definitely look at using this for some personal projects. The R API is greatly appreciated. Thanks for alerting me to it.


Really want to use this for EDA. I hope it has support for client apps like DBeaver or something


Any GUI Client to explore data?


Web or desktop?


This sounds great. I wonder if there are plans for an ODBC driver at any point?


Amazing work!

It was asking a lot, but still :-( that JSONB and range datatypes are missing.


Good name.


Kudos, much needed. Neat documentation. Would love to try with JS APIs.


Can someone recommend whats the best way to connect to a memory DuckDB instance from php(php_fpm) and get return in usable format?

E.g. would be used for offline statistical analysis

Alterntively, any way to export cli sql results as csv?


Is this backward compatible with SQLite?


The on-disk format is different it seems. But there is an implementation of the SQLite C API for interacting with the database. So it depends on which level / what purpose you want compatibility.


Just curious about these Do you think embeddable databases are common in car navigation systems


This sounds like a solution I’d use. Embedded. Simplifies simple sites.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: