sqlite is also part of Python‘s standard library. No need to add dependencies, things just work. If you don’t plan to run heavy analytical workloads in your research, sqlite is probably just fine.
There is an article I think just yesterday showing DuckDb failing for the billion row taxi test, and the author had to go through several boots because of the 0.x novelty of the database
Why can't the scheduler just do its job and let applications enjoy the highest performance they can achieve? I assume DuckDB is making good use of those cores, not just spinning wildly ;).
That’s not how most apps work. They tend to allocate memory for specific purposes and free it when done. Only more heavy apps will retain large buffer pools, like databases but they are usually tunable.
And apps themselves don’t page to disk, usually. The OS does that. Before we had MMUs apps like photoshop used to manage the page file manually. But again an app doing that is doing that for a specific purpose, not just to be greedy.
Is there some specific software you are referring to?
No, but a general rant at analysis software in general which seeks to maximise all resources.
I agree that database software likely does not behave so aggressively, but the terms I saw in the duckdb post triggered me to believe they were trending in that direction.
I'm still resistant to using SQLite for projects it would be perfect for, like small nodejs services with relatively small batches of non-permanent data... and I'm not sure why. Deprecation from web browsers put me off, and I think it's unfortunate, but it isn't such a big deal for my server-side purposes... but still... I just don't want to harness a 10-year deployment to something a little iffy when I'm completely sure that including a full MySQL db will be a safe bet. For the same reasons but more so, I'm not really going to mess around with DuckDB. It would really be nice if there were an interoperable web / server light relational db standard, and it's a shame that SQLite somehow got so close to that... but we seem to reach these peaks and then go backwards for a decade. (See: Every animated interactive 2D canvas screen graph for the last decade trying to reproduce some portion of Flash).
SQLite is not 'a little iffy'. It's a) one of the most resilient pieces of software ever made and b) one of the most widely adopted pieces of software ever made.
One data point that really highlights this is that it is used in the flight software in Airbus planes. This alone is a very strong indicator that it is extremely robust, and also that it will be around for a very long time. There's a whole list of additional industry use-cases [1] that support this claim even more.
A good resource to better understand how SQLite is tested is https://www.sqlite.org/testing.html. I found it to be a fascinating read, and quite impressive. I came away with a lot of confidence in SQLite's robustness.
> Deprecation from web browsers put me off, and I think it's unfortunate
See, as I recall WebSQL was deprecated because to be a proper standard it needed more than one implementation, but nobody thought they could improve on Sqlite.
So the draft standard ended up saying user agents must implement the SQL dialect supported by Sqlite 3.6.19 which is.... not normal, for a standard.
But fundamentally, an argument in favour of Sqlite
I tried it in my toy project to replace spreadsheets, but I don't see it as an sqlite replacement, as a matter of fact it can use sqlite as a db backend : https://duckdb.org/docs/extensions/sqlite
Did clickhouse-local reduce its install size? I vaguely remember that it always included all symbols and other stuff and was over 500MB and not trivial to install. Definitely a hurdle for some people. Iirc they were working on addressing this las time I checked
In my field many people use MongoDB to store data - if they feel like it... I personally use SQLite for keeping logs sometimes though that is quite annoying when all your files live on NFS - btw: how does DuckDB deal with NFS if we are at that?
Because if it doesn't, you can just let your experiments log to CSV - which likely still is more than what some people do and a perfect input to DuckDB...
I've never heard of DuckDB. I haven't used SQLite a lot, but one thing that I noticed it lacks, and this is pretty important, is that you can't add a foreign key constraint after a table is created, which is bad if you change your schema later. I don't even understand why this limitation exists.
One typical way to work around this limitation is to create a new table, copy the data there, and drop the old table. Copying from table to table is in general very fast.
Primary keys and foreign keys are rarely used in data science workloads. In our performance guide, we recommend users to avoid using primary keys unless they are absolutely necessary [1]. In workloads running in infrastructure components, keys and constraints are more important, and they are supported by DuckDB (but currently some of them can only be added when the table is created).
DuckDB is a columnar database so computing the sum of one column should be really fast. Yet it requires more than a minute for 280k rows? That seems way too slow so probably just a limitation of the shared file system.
Especially since SQLite would give a near-instant reply if there was an index in the table. The entire example is wrongly reasoned from start to finish.
Thanks for finding that… but it does imply to me that maybe the author should have led with that? If researchers are already using SQLite then the fact they can more or less trivially transition to DuckDB, with an escape hatch back if needed, seems a decent selling point.
DuckDB runs fine on laptops, including x86 and ARM architectures.
It runs on macOS, Windows and Linux.
It's not yet available as a conda package but can be installed from conda-forge (or from PyPI via pip). Additionally, it also works in R (via CRAN) and has bindings to many other languages.
Honestly, duckdb looks fine to me (I haven't played with it yet, though articles like this would put me off of it, and make me question how long its going to be around), it's the the tone of the article, which sets up what appears to me to be a strawman, and then plows on.
Also, my naive and quick look at your docs gives me the impression this isn't actually a storage engine (which didn't come across in the article), but instead an in-memory query engine? The FAQ doesn't really clarify this (e.g. is there an actual storage format, how does this compare to using one of the more traditional dbs, how stable it is (c.f. sqlite being recommended by the library of congress), does this assume all the data can be loaded into memory, or can I stream the data in).
Researchers, please work with that you know and works well for you.