Besides that, the comprehensive testing and evaluation of SQLite never ceases to amaze me. I'm usually hesitant to call software development "engineering", but SQLite is definitely well-engineered.
Ditto! For anyone who isn't familiar with SQLite's testing procedures, read this[1] fascinating page.
The SQLite project has a mind boggling 711 times more test code than SQLite itself has. Put another way, only 0.1% of the project's code is SQLite itself. The other 99.9% consists of tests for that 0.1%.
In a similar vein, I rarely (if ever) seen a library that handles dynamic memory allocation more robustly than SQLite. This page is a glory to behold: https://www.sqlite.org/malloc.html
That page is inspiring, for a commodity, open source and old software project everything is very clearly defined and the language is very definitive which I hope is indicative of the actual library quality (I’ve never used SQLite).
Although it was more a typo than a math error (as the 99.9% figure was correct). I'll chalk it up to posting when I should be sleeping (4:30am local time).
This looks great, but I couldn't get through the first question on aggregate functions. Are there any SQL books/tutorials that go over things like this?
A lot of material I've seen has been like the classic image of "How to draw an owl. First draw two circles, then draw the rest of the owl", where they tell you the super basic stuff, then assume you know everything.
Having an understanding of relational algebra (not the symbols, but the concept; math is always about the concept) generally helps a lot as well; it's the reason why compsci database lectures often start with relational algebra.
> We would like to find the total weight of cats grouped by age. But only return those groups with a total weight larger than 12.
The total weight of cats grouped by age.
SELECT sum(weight), age
FROM cats
GROUP BY age
But only return those groups with a total weight larger than 12.
SELECT sum(weight), age
FROM cats
GROUP BY age
HAVING sum(weight) > 12
Ordered by age.
SELECT sum(weight), age
FROM cats
GROUP BY age
HAVING sum(weight) > 12
ORDER BY age
The total weight column should be called total_weight.
SELECT sum(weight) AS total_weight, age
FROM cats
GROUP BY age
HAVING sum(weight) > 12
ORDER BY age
(self plug) try pgexercises.com. It's in a very similar vein to windowfunctions - I understand it was one of the original inspirations - but it covers a much wider range of material, with a gradual growth from very basic stuff up to slightly advanced.
I'll vouch for this plug! I've spent quite a bit of time learning SQL (and Postgres specific stuff), and this was one of my favorite resources.
I can also recommend Learning SQL by Alan Beaulieu, which was recommended in some HN comment. After looking into various books, some of them Postgres specific, I found this one to be the best for learning from scratch.
The discussion at https://news.ycombinator.com/item?id=17231349 has many more links and descriptions of how to use window functions in practice. (Including my comment at the top about how in practice I find that you usually want to combine window functions with nested queries.)
For Python folks interested in using these features, you might be interested in this post [0] which describes how to compile the latest SQLite and the python sqlite3 driver. I've got a fork of the standard lib sqlite3 driver that includes support for user-defined window functions in Python as well which may interest you.
The query optimizer improvements are pretty cool too.
Even though, I don't really understand that one : "The IN-early-out optimization: When doing a look-up on a multi-column index and an IN operator is used on a column other than the left-most column, then if no rows match against the first IN value, check to make sure there exist rows that match the columns to the right before continuing with the next IN value.
". I would think there's no need to check the right column(s) if the leftmost one has no match...
SELECT * FROM tab WHERE key1=1 AND key2 IN (2,3,4,5);
SQLite starts by doing a single b-tree lookup on the index on (1,2) - composed from the key1 field and the first possibility of the key2 field. If that works, then it proceeds to look up (1,3), (1,4), and (1,5). But if the (1,2) lookup fails, then it backs off and tries just (1,) to see if that matches anything at all. If (1,) finds any record, the search proceeds with (1,3), (1,4),etc. But if (1,*) fails, the search stops immediately.
The insight here is that a multi-column key value can be resolved using a single binary search. It is not a sequence thing where we first look for the key1=1 and then do a separate lookup in a subtree for key2. Both key1 and key2 are resolved in the same binary search.
This is what I expected the optimization to be, except I'm still not sure that I understand the wording of "that match the columns to the right", I'd expect that to be "that match the columns to the left", after all, you're checking the existence of (1,* ), not of (* ,2) or (* ,3).
I am interested in this topic as well, although sqlite explicitly says that it is not meant for client-server configuration.
I still want to see if it is feasible and someone is using it in production.
What they mean by that comment is that it doesn't have users in the sense that mysql, SQL Server, etc have.
https://www.sqlite.org/whentouse.html
The documentations says that it works well for websites. For client server, you just need to make an API that handles the interactions with the database.
I got thrown into a legacy web project that used sqlite as the database. It was a small internal-only app, I guess the original developer(s) figured it was so small that sqlite would be plenty and it would reduce the environment complexity.
Unsurprisingly they were wrong. It was small, but sqlite couldn't handle multiple users. I believe this was before sqlite had WAL support, so reading would lock the DB. The 'solution' was to split the sqlite database into many smaller DBs that would allow users to use the site at the same time as long as they were in different areas. This greatly added to the complexity of the application. Some reports would need to access multiple databases to get what it needed, so it would still lock out people. Complexity was much higher then having all the data in a single postgresql/mysql database. All the users hated the system and often ran into DB lock issues.
> sqlite explicitly says that it is not meant for client-server configuration
Most applications don't need the level of concurrency that Pg or MySQL provide and they will be just fine with something simple to operate, which is why I wrote https://redisql.com/
Shameless plug since I am the author, but you could be interested in https://redisql.com/
Basically an Redis module that embed SQLite, I offer replication on the PRO version, if you want to try it out you can download the trial version for free.
If the trial version is not enough, send me an email and we could work something out ;)
Almost all my main apps have a sync history. Before was using batch processing and a version field:
customer:
name, ...., version = 1
update customer:
name, ...., version = 2
and sync manually. Now I use a log/event sourcing like setup, where every write to the BD is also stored in a log table and sync from there. Much better!
Say you record an event like "name of customer X has been changed to FooCo". Instead, record "name of customer X has been changed to FooCo from Foo & Sons Co". If you want to undo it, just swap from/to (=invert it).
In my experience the usefulness of audit tables diminishes as the table becomes more complex (more columns; composite and document columns, like arrays, json, xml), because it becomes difficult to find what exactly changed. Another problem is related rows; audit tables aren't very good to capture the context of changes. You can work around this with more complex triggers, but it quickly becomes a lot easier to do this from the application and not in the database, capturing semantics and not just data changes.
The issue with any application-based approach is of course that it can't provide total access mediation.
Bedrock [1] by Expensify is supposed to be solid. Master/slave async replication (i.e. all slaves are read only), pluggable, provides full SQLite access via the MySQL protocol.
A bit off topic, but would be great to use SQLite in the browser instead of IndexedDB.
I love relational databases, but you're almost forced into a NoSQL approach when developing a SPA since the client (browser) only supports simple key -> value storage. It would be a dream to use LINQ-to-SQL, or similar type safe query DSLs like Slick or Quill (Scala), or Esqueleto (Haskell) in the browser.
Combine that with a single language driving the backend and frontend and voila, no duplication of model, validation, etc. layers on server and client.
One can dream I guess, but the reality is NoSQL fits the modern web app like a glove, for better or worse.
> A bit off topic, but would be great to use SQLite in the browser instead of IndexedDB
That almost happened. There was a thing called WebSQL [1] that was W3C was working on to add SQL to the browser. Everyone who implemented it used SQLite.
Apparently, that disqualified it from standardization. To move ahead, they wanted to see independent implementations of the standard.
No browser makers stepped up to reduce the quality of their implementation by replacing some of the best designed, best written, best tested code on the planet with some other SQL back end to satisfy the committee, and so Mozilla was able to push IndexedDB as the standard browser DB interface.
They had to, since what was standardized was specifically the SQL dialect of SQLite v3.6.19.
> No browser makers stepped up to reduce the quality of their implementation by replacing some of the best designed, best written, best tested code on the planet with some other SQL back end to satisfy the committee
There were only two implementations at all: WebKit and Opera. Mozilla and Microsoft weren't going to implement it without a spec decoupled from particular backend.
This almost happened. Ten years ago "WebSQL" was being considered as a standard - it would have introduced SQLite-backed databases as a built-in browser feature: https://en.wikipedia.org/wiki/Web_SQL_Database
The W3C dropped it in November 2010 because they were uncomfortable supporting a web standard which was entirely tied to a single implementation of SQL (SQLite). I adore SQLite but I think the W3C made the right call there - web standards need more than one compatible implementation.
> I adore SQLite but I think the W3C made the right call there - web standards need more than one compatible implementation.
could you elaborate on that? I get the impression that now, eight years later, we're much worse off. I don't really see the upside of not going for what is, AFAIK, a relatively vanilla implementation of SQL.
A web standard needs to have more than one implementation. If you allow one single implementation to define the standard, you get something like Flash.
I don't have a particularly strong practical argument here: for me this is more of a fundamental principle for how the web should work.
Flash was a closed source plugin that was added to the browser as a binary blob. It could be updated by the user, and because it frequently had security bugs users were frequently urged to update. Those updates also often introduced new features, which content providers started taking advantage of once a significant number of people updated, putting pressure on the rest to update too.
SQLite is open source that would have been compiled into the browser by the browser vendor. A typical user would not have the means to update it separately from the rest of the browser.
So why couldn't W3C pick a particular release of SQLite, list a specific subset of its features, and declare that this, as described in the documentation for that specific version, is the standard?
I really prefer that syntax and I’m always a little sad when I have to copy-paste windows across average, total, count, standard deviation, max, min… I fully admit that it’s syntactic sugar but it’s the elegant kind.
There is nothing stopping you from doing it yourself. The Sqlite FAQ[1] even has an entry on it.
Two things holding this back: 1. code being inside our outside sqlite would not be much different. 2. the amount of additional test code would be humongous for an operation of that level of complexity.
A lot of table alterations can be done without rebuilding the table, though perhaps a table scan might be needed to validate constraints (though this should optionally be a separate statement). To do this safely requires the RDBMS to support the functionality.
Well if you don't mind rewriting tables, just create a table in the new structure, insert data from the old table to the new one, drop the old table, and rename the new table.
Lock db exclusively, disable foreign key checking, rebuild and replace table, reconfigure foreign keys (they are gone after delete and rename), and everything should be fine.
Use the same database for test and production. Nowadays, it is really easy to install PostgreSQL or MySQL locally or in your test environment. SQLite is great, but not for replacing PostgreSQL or MySQL during tests.
Create new empty table, set up trigger on old table to copy across any inserted/updated rows, backfill the remaining rows incrementally, and finally use ALTER TABLE to atomically replace the old table with the new one.
When you wrote "backfill the remaining rows incrementally", did you mean having some background process or thread that reads a batch of rows from the old table (for example a few thousands of row), then inserts them in the new table, then commits, and keep doing this until all rows are copied? This way, other writers will be blocked only for the duration of each incremental transaction and will have an opportunity to lock the database for themselves between two batches?
Do you have experience doing this on production servers?
Unrelated to window functions, but I finally took the time to start digging into SQLite's internals. People always sing its praises, so it was time to see what all the fuss was about.
Someone else already mentioned that the vast majority of SQLite's codebase are tests. Well, on top of that, of the real "working" codebase I'd say the majority of it is comments. It's incredible. The source is more book than code. If you have any when, why, or how question about SQLite, I guarantee it's answered in the code comments (or at least one of the hundreds of superb documents on their website).
Another surprise I discovered: SQLite has a virtual machine and its own bytecode. All queries you execute against a SQLite database are compiled into SQLite's own little bytecode and then executed on a VM designed for working with SQLite's database. Go ahead, start `sqlite3 yourdb.sqlite` and then run `explain select * from yourtable;`. It'll dump the bytecode for that statement; or any statement you put after `explain`. So cool!
In hindsight, it makes a lot of sense, and a well built VM can be nearly as efficient as any other alternative.
Fun bit of history. The VM used to be stack based, but now it's register based. I guess they learned the same lessons the rest of the industry learned over that time period :P
(N.B. the bytecode is for internal use only; it's not a public facing API. You should never, ever use bytecode directly yourself.)
There are some painful parts of the codebase though. These aren't "cons" per se. More like necessarily evils.
1) It is filled to the brim with backwards compatibility hacks that make the code more complex than it strictly needs to be. (Most of these are the result of various users of the library misusing the API. The SQLite devs are generous enough to grandfather in the bugs that made those applications work. That's excellent, but it definitely makes the code more "crusty".)
2) One of SQLite's big features is its flexible memory subsystem. It handles OOM, and provides an API for completely customizing the memory subsystem. But given that this is C and memory allocation and interaction is pervasive, the code ends up littered with function calls and clauses. Handling OOM is no small task, and often how to handle the OOM is different in different places. So you can imagine the complexity that adds to the codebase.
Again, those are necessary evils, so its not something I'm "complaining" about. But I thought they were worth mentioning for fellow adventures like me who decide to dive in (which I highly recommend).
So, thanks to how well designed SQLite is overall, and their great documentation, I was able to write a parser in Rust for the SQLite file format in a handful of hours (https://sqlite.org/fileformat2.html). The file format is surprisingly simple. I'm now writing a Cursor to walk the tables, which is a fun exercise of classic B-Tree algorithms.
> It is filled to the brim with backwards compatibility hacks that make the code more complex than it strictly needs to be
They finally ended sqlite4 as an architectural misadventure, but perhaps they should try again with the current codebase/features but all of the backwards cruft jettisoned.
Or if you're feeling lazy (like I was), there's a fork of his library at https://github.com/karlb/pysqlite3 which compiles the 3.25.0 amalgamation by default. This worked for me:
$ python3 -mvirtualenv venv
$ source venv/bin/activate
$ pip install git+git://github.com/karlb/pysqlite3
Collecting git+git://github.com/karlb/pysqlite3
...
Installing collected packages: pysqlite3
Successfully installed pysqlite3-0.2.0
$ python
Python 3.6.5 (default, Mar 30 2018, 06:41:53)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysqlite3
>>> pysqlite3.connect(":memory:").execute("select sqlite_version()").fetchall()
[('3.25.0',)]
I just checked a stripped copy of postgresql-10 built with all options on and it is 7MB. That could be reduced a little by leaving out language support and ssl support etc.
While binary size could certainly be a factor in a good deal of embedded environments, we also need to look at the resource requirements of the binary in question as well.
Sqlite doesn't need too much more memory than what its binary needs whereas with postgresql, you need all sorts of bells and whistles just to get the database system to boot.
Well, sure, sqllite is smaller than postgres, but database system performance is largely dictated by the size of the buffer cache. For something like a configuration database (a great use of sqllite) this does not matter. But for more than a very modest amount of data more memory for buffers will benefit both postgresql and sqllite.
Anyway, not trying to make the case that postgresql is small compared to sqllite, it obviously isn't, just wanted to point out that it's not _that_ big either.
If you look at the build instructions you will see that in the ./configure step you can enable and disable different features. Try ./configure --help to get a list.
It is dynamically linked to openssl, but you can configure out the internal code to support the functionality.
"strip" however meant the unix strip command to remove debug symbols.
To me, what makes it 'lite' is that it doesn't have a client-server architecture that requires a running daemon. It's just a file format and a library designed to interact with it.
I just had a look at the executable of my latest - extremely simple - webapp. A 64 bit elf, statically linked with Musl libc and SQLite, written in Nim.
1.4MB all told, with no optimizations whatsoever. In the year 2018 AD, we count that as light.
I could swear I've seen you say this. Maybe I'm mis-remembering? ISTR it was that for each table source in a query SQLite3 uses just one index (or the table itself) for indexing or scanning to find relevant rows in that table source.
SQLite can use multiple indexes if there are OR terms in the WHERE clause.
SQLite tries to only uses indexes in situations where they help the query run faster. SQLite is not limited in its use of indexes. It is just that the use of multiple indexes for a single FROM-clause term is rarely helpful.
There's a whole categories of analysis queries that are really easy to write with window functions and very annoying to write without. They help eliminate nested subqueries and self joins. This should make SQLlite a better choice for data analysis and reporting.
SQLite guys, please add FDW support ala Postgres and easy foreign function support for Python and R, and you’ll corner most of analytics and data science.
Another underappreciated feature (while we're at it) would be WAL instead of undo-journaling (https://www.sqlite.org/wal.html), which enables concurrent reading and writing of SQLite databases. Has been available for some ten years or so, but is off by default.
Functions are available through the Python bindings (which are not maintained by the SQLite project), virtual tables I think, aren't. Alternate bindings (https://rogerbinns.github.io/apsw/) claim to achieve "Everything you can do from the SQLite C API" interop, which would include virtual tables.
Having written a fairly significant amount of SQLite virtual table module code, I think it isn't quite sophisticated enough to be useful. I regret building my application atop that virtual table functionality. Many parts of queries that are vital for efficient execution are not pushed down to the virtual table provider; for instance there is no way for "SELECT COUNT(*) FROM table" to do anything but retrieve every field of every row and return them to the SQLite engine, because the interface does not push the "COUNT" down to the virtual table provider. Even if your underlying data store can trivially answer the question directly, you have to pull all the data and let SQLite compute the answer. So on and so forth.
Out of curiosity, why would you ever run a SELECT COUNT(*)? What use case demands that versus, say, SELECT COUNT(col) where col is the name of a column in the table?
That was just a simplified example. "SELECT COUNT(col) FROM ..." has the same problem: from the virtual table provider's point of view, the query you submitted is "SELECT col FROM ...". There's no way for it to avoid retrieving the data even if it has a faster way just to count it. This is just one small example -- the virtual table interface is just too simplistic to support a lot of real-life scenarios.
Another good example is joins -- they aren't pushed down, and SQLite can only do joins as nested loops. All you can do is stream raw data into the SQLite engine and allow it to perform the join, even if you can do it more efficiently yourself.
In my case I was passing queries to another database engine, so every single thing that could be pushed down to the virtual table provider, I would have been able to take advantage of. But not enough stuff is.
No. That’s the way one may see the first time they’re learning what a SELECT statement is, but it’s most certainly not what you’d ever do in anything near production code.
The parent made this statement:
> ... for instance there is no way for `SELECT COUNT(star) FROM table` to do anything but retrieve every field of every row and return them to the SQLite engine.”
The only response here is, “Of course! That’s what you told it to do! Why would it do anything else—and why would you ever tell it to do that?!”
Edit: To be clear and not sound as though I’m just being dismissive, you can of course use `COUNT(star)`, but the parent made it sound like this was something non-trivial. There are better ways to write such a query when you care about the exact count of rows and, assuming you can’t depend on grabbing the max primary key due to deletions, you’re complaining about the database wasting cycles pulling back irrelevant info. Get to know your database, as it often has features to query such metadata in a much smarter and more performant way.
PS: I replaced the asterisk with the word “star” because it’s making the formatting of this comment atrocious. Apologies for all the edits. I give up now.
I'm sorry, you've misunderstood. I'm fully aware that you can query it more efficiently, and in fact that's the point. If you're writing a virtual table provider, SQLite does not provide enough information about the actual user's query for you to know that you can do it more efficiently. There's no way to know that the user wants a count at all; the count function is not pushed down into the virtual table provider interface. SQLite only tells you that it needs you to retrieve every field in every row. You can't know that, in actuality, SQLite only wants to count the rows.
Not only that, the GNOME Data Access (GDA) uses this and has virtual tables for accessing disparate remote databases (including LDAP), so we know this works. What u/usgroup may be asking for is for a bunch of virtual tables to be included in SQLite3 proper.
As far as I know you already can extend SQLite with custom scalar and aggregate functions and virtual tables (FDW) at least in Python (with apsw). Am I missing something?
Many analytics use cases are single user. I’ve often thought you could do worse than SQLite as a first pass at a dataframe implementation. And SQLite is a great fit for a range of analyses on a single-user computer where you’re looking to sample from or calculate aggregations from data that fits on harddisk but not in RAM.
Now, where SQLite starts to fall down in analytics workloads is that it’s row-oriented rather than column oriented. Performance could be better. Still, even for analytic workloads SQLite can be good enough for medium-sized data!
I dunno... In my neck of the woods, datasets larger than anything a hard drive will store are commonplace, making SQLite a non-starter. I can imagine SQLite being useful for prototyping or for toy apps or for one-offs with small datasets, but not much else...
(Edit to clarify: I think it's awesome for embedded apps. But it never struck me as a good choice for data science.)
Cool, but in between data that fits in RAM and data that requires exotic storage and computing surely you can at least imagine data of this mid-sized variety.
Besides that, the comprehensive testing and evaluation of SQLite never ceases to amaze me. I'm usually hesitant to call software development "engineering", but SQLite is definitely well-engineered.