I've just been exploring serving large SQLite databases in chunks and querying t...

dopamean · on July 6, 2023

The timing of this is incredible. I was recently given a 120gb SQLite db to work with and I was going to investigate ways to work with it in the morning.

danielvaughn · on July 6, 2023

I’m not sure what they offer on this front, but turso is a new edge-based SQLite platform. It might be worth checking out:

https://turso.tech/

OJFord · on July 8, 2023

The current pricing is free for 8GB (10GB or 1B rows read /month) or $3k/month unlimited? So not really possible to start relying on it for anything serious yet (except really serious).

The middle tier isn't even fully described yet, but does say 20GB limit (and from $29/month) - I suppose that means that's a base from which you'll be able to pay for more storage as needed.

dopamean · on July 6, 2023

Thank's! I'll have a look at this as well.

gunapologist99 · on July 7, 2023

FYI, it's using the libsql fork of sqlite instead of sqlite.org itself.

punnerud · on July 6, 2023

Like this? Serving Wikipedia as a static SQLite file, using range request: https://news.ycombinator.com/item?id=27016630

codethief · on July 6, 2023

Fantastic, thank you! I came here to find this link. :)

noman-land · on July 6, 2023

Yes, this is the post I learned about this from and that's the person who wrote the sql.js-httpvfs library I'm using. Extremely cool.

psanford · on July 6, 2023

I often use (my[0]) sqlite http vfs to query databases stored in s3. You can simply make a presigned url and then give that to the vfs and then it all just works.

[0]: https://github.com/psanford/sqlite3vfshttp

pyeri · on July 6, 2023

>> querying them with http range requests to prevent downloading the entire database. It's pretty awesome!

Querying isn't a problem, you can query as much as you want. But where you'll hit the Sqlite limitation is in scaling for multi user write scenarios. Yes, even Sqlite can handle a few concurrent write requests but once your users start scaling in millions, you'll eventually need a proper RDBMS like mysql or postgres.

noman-land · on July 6, 2023

I've been thinking about this a lot. In my particular use-case I can get away with having a totally read-only app.

However if I want people to be able to submit corrections to the transcriptions, I need a way to do that. I was thinking of setting up some sort of queue system to make writes non-realtime since, for this app, it's not a big deal.

It will be in interesting challenge, for sure.

vidarh · on July 6, 2023

sql.js-httpvfs is a read-only solution. The point is to be able to put up pre-baked datasets on any random web server w/range support, or a blob store.

bakkoting · on July 6, 2023

I've been using sql.js-httpvfs to get full-text search on a static Github Pages site [1]. (Just make sure you have fts5 enabled on your database.) It works astonishingly well.

[1] https://matrixlogs.bakkot.com/WHATWG/search?q=database

noman-land · on July 6, 2023

This is awesome. Maybe you can help me figure out the best way to search. The transcriptions are one row per word since each word is timestamped. I'm currently just doing client side search in JS for the titles and descriptions but I'll have to do proper SQL search for the body of the transcriptions. What's the best way to search for strings of text if the db is one row per word (~4 million rows)?

electroly · on July 6, 2023

In theory you can add an index on the token text column and then implement full text searching yourself using the index to find the set of documents containing each search term, counting the occurrences per document, and combining the results from all the terms. But FTS5 already does that for you, and it's more efficient than the above.

The traditional FTS5 technique as documented [0] should work, same as if this were a local SQLite database. You'll have to duplicate the transcriptions into TEXT form (one row per document) and then use FTS5 to create a virtual table for full text searching. The TEXT form won't actually be accessed for searching, it's just used to build the inverted index.

[0] https://www.sqlite.org/fts5.html

bakkoting · on July 6, 2023

I don't know enough about how FTS works to help, I'm afraid. I don't know if it's possible to search across multiple rows.

Were I in your position I'd just create a copy of the DB with the full transcript in each row and run the search against that. If you only have 4 million words, creating an extra copy of the database shouldn't be prohibitively large.

zX41ZdbW · on July 7, 2023

I recommend checking a similar project - storing ClickHouse databases on GitHub pages for serverless analytical queries: https://github.com/ClickHouse/web-tables-demo

eterevsky · on July 6, 2023

I don't know much about SQLite internals, but on the face of it it sounds hacky as hell (pardon if I'm wrong).

Wouldn't it be better to make a proper client-server API similar to traditional SQL databases, but on top of SQLite?

ComodoHacker · on July 6, 2023

The point is to use dumb cheap blob storage instead of CPU-consuming server someone has to manage.

antonvs · on July 6, 2023

There are many use cases where scaling a traditional centralized SQL database is problematic, that can be addressed by something like this.

We have one: we run thousands of VMs at a time, all accessing the same "database". Since we already have a good amount of horizontally-scaled compute, having to maintain a separate database cluster, or large vertically-scaled database instance, to match our peak load requirements is problematic in terms of one or more of cost, complexity, and performance. In particular, horizontally-scaled distributed databases tend not to scale up and down efficiently, because of the complexity and time involved in joining the cluster, so the cost benefits of horizontal scaling of compute are lost.

An approach like this can fit well in cases like these.

eterevsky · on July 6, 2023

I can't see how this could for a readonly database, but how would you resolve collisions on writes and don't make them super slow in the process?

antonvs · on July 12, 2023

Right, it certainly wouldn’t work for a big multi-tenant db. But the right data organization can make it work.

In our case, the data is organized such that there’s usually only one client at a time that would need to write to a given db file. Think for example of a folder per user, each containing a db file and various other files for that user. What we’re doing is actually even more granular than that - we have a folder for each object that a user could be working on at a given time.

phamilton · on July 6, 2023

Databases are really just data structures and algorithms along with some notion of durability.

Client/Server databases are just remote data structures. (E.g. Redis is short for "Remote Dictionary Server")

Sometimes you want your data structures and algorithms to run locally. Could be performance, privacy, cost, or any number of reasons.

Local, in-memory data structures hit a few bottlenecks. First, they may not fit in memory. A mechanism for keeping the dataset in larger storage (e.g. disk) and paging in the necessary bits as needed extends the range of datasets one can comfortably work with locally by quite a bit. That's standard SQLite.

A second potential bottleneck to local data structures is distribution. We carry computers in our pockets, on our watches, in our cars. Delivering large datasets to each of those locations may be impractical. Cloud based VFS allows the benefits of local data structures on the subset they need without requiring them to fetch the entire dataset. That can be a huge win if there's a specific subset they need.

It always depends on the use case, but when the case fits there are a lot of big wins here.

eterevsky · on July 6, 2023

One obvious problem that I see with this approach is that it will break if there is any change in the storage format.

With client-server architecture, the server code owns the data format, while in this storage-level remote access case, you have to ensure that all of your clients are updated simultaneously. Depending on your architecture it might or might not be feasible.

vidarh · on July 6, 2023

For the typical uses for this, you'd tend to serve up a version of Sqlite compiled to wasm or JS to the frontend, so you can be sure it's the same one. Sqlite's storage format is also unusually stable:

https://www.sqlite.org/formatchng.html

zX41ZdbW · on July 7, 2023

It's better to ensure 100% compatibility with the data format so the new server versions can read the old format without conversions. For example, in ClickHouse, you can install the latest version over a version from 7 years ago, and it will work just fine with the old MergeTree tables.

electroly · on July 6, 2023

The VFS layer in SQLite is begging to be used like this. For a read-only implementation (where you generate the SQLite databases out-of-band and upload them), the code nearly writes itself. It's maybe hacky in the sense of "I had to write a SQLite module for this" but it's a very nice fit for SQLite VFS.

noman-land · on July 6, 2023

Yes this code was very straightforward to write and this approach lends itself very well to read-only databases of large datasets with no server infrastructure overhead. It's also "free" to host for the time being.

qbasic_forever · on July 6, 2023

Depends on your needs, if it's just one or even multiple clients reading from the db then range requests of blocks is a great option. Adding a server layer is a huge leap in complexity as you now have code that has to be deployed, managed, secured, etc. vs. just simple blob storage for chunks (s3, etc.).

anon291 · on July 6, 2023

It's not hacky though. SQLite explicitly supports VFS implementations, and it's in wide use throughout industry. All this does is use emscripten + asyncify to compile to javascript and then implements a vfs that calls into regular javascript calls. SQLite is notoriously well tested, and so if it says that the VFS works a certain way, it probably does.

sitkack · on July 6, 2023

How so? It’s not like the CPU or the network care.

eterevsky · on July 6, 2023

It's difficult to implement correctly if there's more than one client. There is no single source-of-truth to manage any conflicts. Might be ok if the clients have only read-only access.

pama · on July 7, 2023

Totally minor but this website does not show anything in lockdown mode on the iPhone.

noman-land · on July 7, 2023

Interesting, thanks for the tip! I've never heard of lockdown mode. Unfortunately I don't own an iPhone. Any chance you could post an issue with steps to repro?

https://github.com/noman-land/transcript.fish/issues

Thank you!

snowstormsun · on July 6, 2023

Wow, that's so cool!

6510 · on July 6, 2023

it badly needs {cursor:pointer} for the things one can click on.

noman-land · on July 6, 2023

Yes, indeed! This has been bugging me and I've fixed it. Thank you.