SQLite guys, please add FDW support ala Postgres and easy foreign function support for Python and R, and you’ll corner most of analytics and data science.
Another underappreciated feature (while we're at it) would be WAL instead of undo-journaling (https://www.sqlite.org/wal.html), which enables concurrent reading and writing of SQLite databases. Has been available for some ten years or so, but is off by default.
Functions are available through the Python bindings (which are not maintained by the SQLite project), virtual tables I think, aren't. Alternate bindings (https://rogerbinns.github.io/apsw/) claim to achieve "Everything you can do from the SQLite C API" interop, which would include virtual tables.
Having written a fairly significant amount of SQLite virtual table module code, I think it isn't quite sophisticated enough to be useful. I regret building my application atop that virtual table functionality. Many parts of queries that are vital for efficient execution are not pushed down to the virtual table provider; for instance there is no way for "SELECT COUNT(*) FROM table" to do anything but retrieve every field of every row and return them to the SQLite engine, because the interface does not push the "COUNT" down to the virtual table provider. Even if your underlying data store can trivially answer the question directly, you have to pull all the data and let SQLite compute the answer. So on and so forth.
Out of curiosity, why would you ever run a SELECT COUNT(*)? What use case demands that versus, say, SELECT COUNT(col) where col is the name of a column in the table?
That was just a simplified example. "SELECT COUNT(col) FROM ..." has the same problem: from the virtual table provider's point of view, the query you submitted is "SELECT col FROM ...". There's no way for it to avoid retrieving the data even if it has a faster way just to count it. This is just one small example -- the virtual table interface is just too simplistic to support a lot of real-life scenarios.
Another good example is joins -- they aren't pushed down, and SQLite can only do joins as nested loops. All you can do is stream raw data into the SQLite engine and allow it to perform the join, even if you can do it more efficiently yourself.
In my case I was passing queries to another database engine, so every single thing that could be pushed down to the virtual table provider, I would have been able to take advantage of. But not enough stuff is.
No. That’s the way one may see the first time they’re learning what a SELECT statement is, but it’s most certainly not what you’d ever do in anything near production code.
The parent made this statement:
> ... for instance there is no way for `SELECT COUNT(star) FROM table` to do anything but retrieve every field of every row and return them to the SQLite engine.”
The only response here is, “Of course! That’s what you told it to do! Why would it do anything else—and why would you ever tell it to do that?!”
Edit: To be clear and not sound as though I’m just being dismissive, you can of course use `COUNT(star)`, but the parent made it sound like this was something non-trivial. There are better ways to write such a query when you care about the exact count of rows and, assuming you can’t depend on grabbing the max primary key due to deletions, you’re complaining about the database wasting cycles pulling back irrelevant info. Get to know your database, as it often has features to query such metadata in a much smarter and more performant way.
PS: I replaced the asterisk with the word “star” because it’s making the formatting of this comment atrocious. Apologies for all the edits. I give up now.
I'm sorry, you've misunderstood. I'm fully aware that you can query it more efficiently, and in fact that's the point. If you're writing a virtual table provider, SQLite does not provide enough information about the actual user's query for you to know that you can do it more efficiently. There's no way to know that the user wants a count at all; the count function is not pushed down into the virtual table provider interface. SQLite only tells you that it needs you to retrieve every field in every row. You can't know that, in actuality, SQLite only wants to count the rows.
Not only that, the GNOME Data Access (GDA) uses this and has virtual tables for accessing disparate remote databases (including LDAP), so we know this works. What u/usgroup may be asking for is for a bunch of virtual tables to be included in SQLite3 proper.
As far as I know you already can extend SQLite with custom scalar and aggregate functions and virtual tables (FDW) at least in Python (with apsw). Am I missing something?
Many analytics use cases are single user. I’ve often thought you could do worse than SQLite as a first pass at a dataframe implementation. And SQLite is a great fit for a range of analyses on a single-user computer where you’re looking to sample from or calculate aggregations from data that fits on harddisk but not in RAM.
Now, where SQLite starts to fall down in analytics workloads is that it’s row-oriented rather than column oriented. Performance could be better. Still, even for analytic workloads SQLite can be good enough for medium-sized data!
I dunno... In my neck of the woods, datasets larger than anything a hard drive will store are commonplace, making SQLite a non-starter. I can imagine SQLite being useful for prototyping or for toy apps or for one-offs with small datasets, but not much else...
(Edit to clarify: I think it's awesome for embedded apps. But it never struck me as a good choice for data science.)
Cool, but in between data that fits in RAM and data that requires exotic storage and computing surely you can at least imagine data of this mid-sized variety.