Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Researchers, please replace SQLite with DuckDB now (dirk-petersen.medium.com)
32 points by tosh 9 months ago | hide | past | favorite | 50 comments



DuckDB is v0.10.0, sqlite is battle tested for years.

Researchers, please work with that you know and works well for you.


sqlite is also part of Python‘s standard library. No need to add dependencies, things just work. If you don’t plan to run heavy analytical workloads in your research, sqlite is probably just fine.


I've tried to use DuckDB with node in Alpine, and I had problems that needed time to work around.

No problems with SQLite.

I ended up putting the code in Debian because of other problmes I found but the decision was already made.


There is an article I think just yesterday showing DuckDb failing for the billion row taxi test, and the author had to go through several boots because of the 0.x novelty of the database


The SQLite table has no indexes, and that alone would ensure instant response in count(*). The entire article follows some pretty skewed reasoning.

Don’t researchers know how to create database indexes? That would explain a lot of what I see in day to day pseudo-benchmarking like this.

(And yes, DuckDB is pretty cool. But trying to compare things without understanding how they should work is… sub-optimal)


Bloggers, please replace Medium with anything now.


Yes please. Substack seems more user-friendly


Substack has problems too.

For hosted foss services, write.as (https://write.as/) and bearblog (https://bearblog.dev/) are good. If self-hosting, the choices are infinite.


> DuckDB is designed from day one to use all the CPU cores in your machine

Hi. No thank you? I'm tired of greedy software that I have to literally pin down by shackling it to a cgroup via systemd or docker.

1 core is fine. We have a group of 20, and limited resources, and there are no fires.


You can limit the number of threads used by DuckDB via `SET threads = x`, see https://duckdb.org/docs/configuration/pragmas#threads

(Disclaimer: I work at DuckDB Labs)


Why can't the scheduler just do its job and let applications enjoy the highest performance they can achieve? I assume DuckDB is making good use of those cores, not just spinning wildly ;).


True, the scheduler usually does do fine job CPU-wise. In terms of RAM though, if an app sees 120GB it will expand to 120GB before considering paging.

I'm just not a fan of software that immediately tries to use all possible resources for 2% gain in a performance benchmark over peers.


That’s not how most apps work. They tend to allocate memory for specific purposes and free it when done. Only more heavy apps will retain large buffer pools, like databases but they are usually tunable. And apps themselves don’t page to disk, usually. The OS does that. Before we had MMUs apps like photoshop used to manage the page file manually. But again an app doing that is doing that for a specific purpose, not just to be greedy.

Is there some specific software you are referring to?


No, but a general rant at analysis software in general which seeks to maximise all resources.

I agree that database software likely does not behave so aggressively, but the terms I saw in the duckdb post triggered me to believe they were trending in that direction.


I think having an option to utilize multiple cores is better than not having one.

Would be nice to have an option to limit the number of cores used by DuckDB though, if it doesn't have one already. (`duckdb -j4`?)


Yeah, this is kind of a nuisance on high-core count CPUs, especially SMP.


I'm still resistant to using SQLite for projects it would be perfect for, like small nodejs services with relatively small batches of non-permanent data... and I'm not sure why. Deprecation from web browsers put me off, and I think it's unfortunate, but it isn't such a big deal for my server-side purposes... but still... I just don't want to harness a 10-year deployment to something a little iffy when I'm completely sure that including a full MySQL db will be a safe bet. For the same reasons but more so, I'm not really going to mess around with DuckDB. It would really be nice if there were an interoperable web / server light relational db standard, and it's a shame that SQLite somehow got so close to that... but we seem to reach these peaks and then go backwards for a decade. (See: Every animated interactive 2D canvas screen graph for the last decade trying to reproduce some portion of Flash).


SQLite is not 'a little iffy'. It's a) one of the most resilient pieces of software ever made and b) one of the most widely adopted pieces of software ever made.


One data point that really highlights this is that it is used in the flight software in Airbus planes. This alone is a very strong indicator that it is extremely robust, and also that it will be around for a very long time. There's a whole list of additional industry use-cases [1] that support this claim even more.

[1] https://www.sqlite.org/famous.html


Insert Boeing/MongoDB joke here.


> something a little iffy

A good resource to better understand how SQLite is tested is https://www.sqlite.org/testing.html. I found it to be a fascinating read, and quite impressive. I came away with a lot of confidence in SQLite's robustness.


> Deprecation from web browsers put me off, and I think it's unfortunate

See, as I recall WebSQL was deprecated because to be a proper standard it needed more than one implementation, but nobody thought they could improve on Sqlite.

So the draft standard ended up saying user agents must implement the SQL dialect supported by Sqlite 3.6.19 which is.... not normal, for a standard.

But fundamentally, an argument in favour of Sqlite


Uh, SQLite is the only database that is recommended by the Library of Congress, it's not a bit iffy: https://sqlite.org/locrsf.html


> Researchers, please replace SQLite with DuckDB now

I really don't like the tone of this article. Author, please change the tone to something more less aggressive.


That's far from the worst, though. At least this one says please.

The ones that irritate me are the ones with titles in the form of a direct order, such as "Stop using (technology)!".

That one makes me feel like starting to use (technology) just to demonstrate that I don't take orders from this person. :-)


There are researchers that use sqlite over csv/excel? Is this article based In reality?


Did not know this existed even after searching for sqlite alternatives for embeddable relational databases.

I use sqlite for storing configs with data rather than dumping everything in one big YAML.

I might give duckdb a try in our next project


I tried it in my toy project to replace spreadsheets, but I don't see it as an sqlite replacement, as a matter of fact it can use sqlite as a db backend : https://duckdb.org/docs/extensions/sqlite


Or use clickhouse-local/chdb and get something even faster that you can also scale if you need a server later.


Did clickhouse-local reduce its install size? I vaguely remember that it always included all symbols and other stuff and was over 500MB and not trivial to install. Definitely a hurdle for some people. Iirc they were working on addressing this las time I checked


It's a 497 MB file that compresses down to 83 MB with zstd. Is 83 MB so huge to transfer?

> not trivial to install

It's a single binary, there is nothing to install. If this isn't trivial to install I can't imagine installing anything else.


Single binray is great! Maybe I misremember, or maybe it was different when I checked last time (or I checked compiling yourself?).

497MB is big, though, and 100 MB is as well, but I agree for desktop use in research it is more than OK.


it would help to specify which researchers?

In my field many people use MongoDB to store data - if they feel like it... I personally use SQLite for keeping logs sometimes though that is quite annoying when all your files live on NFS - btw: how does DuckDB deal with NFS if we are at that?

Because if it doesn't, you can just let your experiments log to CSV - which likely still is more than what some people do and a perfect input to DuckDB...


I've never heard of DuckDB. I haven't used SQLite a lot, but one thing that I noticed it lacks, and this is pretty important, is that you can't add a foreign key constraint after a table is created, which is bad if you change your schema later. I don't even understand why this limitation exists.


DuckDB also does not (yet) support adding foreign keys to a table, see: https://duckdb.org/docs/sql/statements/alter_table#add--drop...

One typical way to work around this limitation is to create a new table, copy the data there, and drop the old table. Copying from table to table is in general very fast.

(Disclaimer: I work at DuckDB Labs)


That's weird. I thought you would have to have FKs to be a RDBMS, otherwise there isn't much R-MS in it.


Primary keys and foreign keys are rarely used in data science workloads. In our performance guide, we recommend users to avoid using primary keys unless they are absolutely necessary [1]. In workloads running in infrastructure components, keys and constraints are more important, and they are supported by DuckDB (but currently some of them can only be added when the table is created).

[1] https://duckdb.org/docs/guides/performance/schema#microbench...


DuckDB is a columnar database so computing the sum of one column should be really fast. Yet it requires more than a minute for 280k rows? That seems way too slow so probably just a limitation of the shared file system.


It's metadata, so comparing how good a count(*) is is pretty pointless.


Especially since SQLite would give a near-instant reply if there was an index in the table. The entire example is wrongly reasoned from start to finish.


I agree that the article is not very good


Query (if you’ll pardon the pun): can DuckDB read and write .sqlite files?



Thanks for finding that… but it does imply to me that maybe the author should have led with that? If researchers are already using SQLite then the fact they can more or less trivially transition to DuckDB, with an escape hatch back if needed, seems a decent selling point.


Why is Pandas needed in a comparison which tries to be head-to-head?


Incredibly bad advice.


And if you test this on a laptop, which is more likely what they have (rather than a server)?

Does this work on Windows (which is more likely what they're using)?

Does it come with conda (which pandas does)?

I'm not sure the author is familiar with researchers, nor if this is just an ad...


DuckDB runs fine on laptops, including x86 and ARM architectures.

It runs on macOS, Windows and Linux.

It's not yet available as a conda package but can be installed from conda-forge (or from PyPI via pip). Additionally, it also works in R (via CRAN) and has bindings to many other languages.

(Disclaimer: I work at DuckDB Labs)


Honestly, duckdb looks fine to me (I haven't played with it yet, though articles like this would put me off of it, and make me question how long its going to be around), it's the the tone of the article, which sets up what appears to me to be a strawman, and then plows on.

Also, my naive and quick look at your docs gives me the impression this isn't actually a storage engine (which didn't come across in the article), but instead an in-memory query engine? The FAQ doesn't really clarify this (e.g. is there an actual storage format, how does this compare to using one of the more traditional dbs, how stable it is (c.f. sqlite being recommended by the library of congress), does this assume all the data can be loaded into memory, or can I stream the data in).


What's up with flagging DuckDB articles? This seems on topic for Hacker News.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: