Hacker News new | past | comments | ask | show | jobs | submit login

I've just been exploring serving large SQLite databases in chunks and querying them with http range requests to prevent downloading the entire database. It's pretty awesome!

I found a really interesting library called sql.js-httpvfs[0] that does pretty much all the work. I chunked up my 350Mb sqlite db into 43 x 8Mb pieces with the included script and uploaded them with my static files to GitHub, which gets deployed via GitHub Pages.[1]

It's in the very rough early stages but you can check it out here.

https://transcript.fish

I recommend going into the console and network tab to see it in action. It's impressively quick and I haven't even fine-tuned it at all yet. SQLite rules.

[0] https://github.com/phiresky/sql.js-httpvfs

[1] https://github.com/noman-land/transcript.fish




The timing of this is incredible. I was recently given a 120gb SQLite db to work with and I was going to investigate ways to work with it in the morning.


I’m not sure what they offer on this front, but turso is a new edge-based SQLite platform. It might be worth checking out:

https://turso.tech/


The current pricing is free for 8GB (10GB or 1B rows read /month) or $3k/month unlimited? So not really possible to start relying on it for anything serious yet (except really serious).

The middle tier isn't even fully described yet, but does say 20GB limit (and from $29/month) - I suppose that means that's a base from which you'll be able to pay for more storage as needed.


Thank's! I'll have a look at this as well.


FYI, it's using the libsql fork of sqlite instead of sqlite.org itself.


Like this? Serving Wikipedia as a static SQLite file, using range request: https://news.ycombinator.com/item?id=27016630


Fantastic, thank you! I came here to find this link. :)


Yes, this is the post I learned about this from and that's the person who wrote the sql.js-httpvfs library I'm using. Extremely cool.


I often use (my[0]) sqlite http vfs to query databases stored in s3. You can simply make a presigned url and then give that to the vfs and then it all just works.

[0]: https://github.com/psanford/sqlite3vfshttp


>> querying them with http range requests to prevent downloading the entire database. It's pretty awesome!

Querying isn't a problem, you can query as much as you want. But where you'll hit the Sqlite limitation is in scaling for multi user write scenarios. Yes, even Sqlite can handle a few concurrent write requests but once your users start scaling in millions, you'll eventually need a proper RDBMS like mysql or postgres.


I've been thinking about this a lot. In my particular use-case I can get away with having a totally read-only app.

However if I want people to be able to submit corrections to the transcriptions, I need a way to do that. I was thinking of setting up some sort of queue system to make writes non-realtime since, for this app, it's not a big deal.

It will be in interesting challenge, for sure.


sql.js-httpvfs is a read-only solution. The point is to be able to put up pre-baked datasets on any random web server w/range support, or a blob store.


I've been using sql.js-httpvfs to get full-text search on a static Github Pages site [1]. (Just make sure you have fts5 enabled on your database.) It works astonishingly well.

[1] https://matrixlogs.bakkot.com/WHATWG/search?q=database


This is awesome. Maybe you can help me figure out the best way to search. The transcriptions are one row per word since each word is timestamped. I'm currently just doing client side search in JS for the titles and descriptions but I'll have to do proper SQL search for the body of the transcriptions. What's the best way to search for strings of text if the db is one row per word (~4 million rows)?


In theory you can add an index on the token text column and then implement full text searching yourself using the index to find the set of documents containing each search term, counting the occurrences per document, and combining the results from all the terms. But FTS5 already does that for you, and it's more efficient than the above.

The traditional FTS5 technique as documented [0] should work, same as if this were a local SQLite database. You'll have to duplicate the transcriptions into TEXT form (one row per document) and then use FTS5 to create a virtual table for full text searching. The TEXT form won't actually be accessed for searching, it's just used to build the inverted index.

[0] https://www.sqlite.org/fts5.html


I don't know enough about how FTS works to help, I'm afraid. I don't know if it's possible to search across multiple rows.

Were I in your position I'd just create a copy of the DB with the full transcript in each row and run the search against that. If you only have 4 million words, creating an extra copy of the database shouldn't be prohibitively large.


I recommend checking a similar project - storing ClickHouse databases on GitHub pages for serverless analytical queries: https://github.com/ClickHouse/web-tables-demo


I don't know much about SQLite internals, but on the face of it it sounds hacky as hell (pardon if I'm wrong).

Wouldn't it be better to make a proper client-server API similar to traditional SQL databases, but on top of SQLite?


The point is to use dumb cheap blob storage instead of CPU-consuming server someone has to manage.


There are many use cases where scaling a traditional centralized SQL database is problematic, that can be addressed by something like this.

We have one: we run thousands of VMs at a time, all accessing the same "database". Since we already have a good amount of horizontally-scaled compute, having to maintain a separate database cluster, or large vertically-scaled database instance, to match our peak load requirements is problematic in terms of one or more of cost, complexity, and performance. In particular, horizontally-scaled distributed databases tend not to scale up and down efficiently, because of the complexity and time involved in joining the cluster, so the cost benefits of horizontal scaling of compute are lost.

An approach like this can fit well in cases like these.


I can't see how this could for a readonly database, but how would you resolve collisions on writes and don't make them super slow in the process?


Right, it certainly wouldn’t work for a big multi-tenant db. But the right data organization can make it work.

In our case, the data is organized such that there’s usually only one client at a time that would need to write to a given db file. Think for example of a folder per user, each containing a db file and various other files for that user. What we’re doing is actually even more granular than that - we have a folder for each object that a user could be working on at a given time.


Databases are really just data structures and algorithms along with some notion of durability.

Client/Server databases are just remote data structures. (E.g. Redis is short for "Remote Dictionary Server")

Sometimes you want your data structures and algorithms to run locally. Could be performance, privacy, cost, or any number of reasons.

Local, in-memory data structures hit a few bottlenecks. First, they may not fit in memory. A mechanism for keeping the dataset in larger storage (e.g. disk) and paging in the necessary bits as needed extends the range of datasets one can comfortably work with locally by quite a bit. That's standard SQLite.

A second potential bottleneck to local data structures is distribution. We carry computers in our pockets, on our watches, in our cars. Delivering large datasets to each of those locations may be impractical. Cloud based VFS allows the benefits of local data structures on the subset they need without requiring them to fetch the entire dataset. That can be a huge win if there's a specific subset they need.

It always depends on the use case, but when the case fits there are a lot of big wins here.


One obvious problem that I see with this approach is that it will break if there is any change in the storage format.

With client-server architecture, the server code owns the data format, while in this storage-level remote access case, you have to ensure that all of your clients are updated simultaneously. Depending on your architecture it might or might not be feasible.


For the typical uses for this, you'd tend to serve up a version of Sqlite compiled to wasm or JS to the frontend, so you can be sure it's the same one. Sqlite's storage format is also unusually stable:

https://www.sqlite.org/formatchng.html


It's better to ensure 100% compatibility with the data format so the new server versions can read the old format without conversions. For example, in ClickHouse, you can install the latest version over a version from 7 years ago, and it will work just fine with the old MergeTree tables.


The VFS layer in SQLite is begging to be used like this. For a read-only implementation (where you generate the SQLite databases out-of-band and upload them), the code nearly writes itself. It's maybe hacky in the sense of "I had to write a SQLite module for this" but it's a very nice fit for SQLite VFS.


Yes this code was very straightforward to write and this approach lends itself very well to read-only databases of large datasets with no server infrastructure overhead. It's also "free" to host for the time being.


Depends on your needs, if it's just one or even multiple clients reading from the db then range requests of blocks is a great option. Adding a server layer is a huge leap in complexity as you now have code that has to be deployed, managed, secured, etc. vs. just simple blob storage for chunks (s3, etc.).


It's not hacky though. SQLite explicitly supports VFS implementations, and it's in wide use throughout industry. All this does is use emscripten + asyncify to compile to javascript and then implements a vfs that calls into regular javascript calls. SQLite is notoriously well tested, and so if it says that the VFS works a certain way, it probably does.


How so? It’s not like the CPU or the network care.


It's difficult to implement correctly if there's more than one client. There is no single source-of-truth to manage any conflicts. Might be ok if the clients have only read-only access.


Totally minor but this website does not show anything in lockdown mode on the iPhone.


Interesting, thanks for the tip! I've never heard of lockdown mode. Unfortunately I don't own an iPhone. Any chance you could post an issue with steps to repro?

https://github.com/noman-land/transcript.fish/issues

Thank you!


Wow, that's so cool!


it badly needs {cursor:pointer} for the things one can click on.


Yes, indeed! This has been bugging me and I've fixed it. Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: