Hosting SQLite databases on GitHub Pages or any static file hoster

simonw · on May 2, 2021

Pop open the network pane in the browser tools and try running this SQL query for a demo of how clever this is:

    select country_code, long_name from wdi_country
    order by rowid desc limit 100

It fetches just 54.2KB of new data (across 49 small HTTP requests) to return 100 results - from a statically hosted database file that's 668.8MB!

I have an animated GIF demo of this here: https://twitter.com/simonw/status/1388933800445452290

phiresky · on May 2, 2021

Huh, that's actually kind of a worst case I didn't think about: Since you're doing a reverse table scan my "sequential access" detection doesn't kick in. If you do the same query but with a forward scan it should fetch roughly the same amount of data but only do like 5 HTTP requests since the request size doubles for every sequential access.

e.g.:

select country_code, long_name from wdi_country where rowid >= 164 order by rowid asc limit 100;

vladf · on May 2, 2021

I solved a similar problem recently: given a stream of data, how should you choose packet size in an online way to minimize regret (a linear combination of spare capacity of last packet and total packets used).

Turns out doubling isn’t the best strategy. The optimal solution is actually to add a constant increment to packet size. How much depends on relative cost of the terms in the regret function.

brandmeyer · on May 2, 2021

> I’ve set the page size to 1 KiB for this database.

> That’s because I implemented a pre-fetching system that tries to detect access patterns through three separate virtual read heads and exponentially increases the request size for sequential reads.

> Since you're doing a reverse table scan my "sequential access" detection doesn't kick in.

You know, starting off with the default 4kB page size naturally adds some resistance to these kinds of failure cases. If the VFS isn't issuing many requests in parallel, I would think that setting up a page size near target_bandwidth * round_trip_time would be a better initial guess. 1kB would be appropriate for a pretty low latency-bandwidth product.

phiresky · on May 2, 2021

That's true, but it also means that random access will always use at least that amount of data even if it only has to fetch a tiny amount. I did a few (non-scientific) benchmarks on a few queries and 1kB seemed like an OK compromise.

And note that the request chunk size is bound to the SQLite page size, and to change that page size you have to rewrite the whole DB. So it can't be set on the fly unless you have multiple copies of the database.

pjc50 · on May 2, 2021

1kb fits in most IP MTU sizes, so that seems reasonable.

kelnos · on May 3, 2021

Do most HTTP responses have less than ~500 bytes of headers? I guess specifically here, GH pages' responses.

It looks like one of the requests made to the DB included a little over 700 bytes of response status line and headers, so that would probably end up spilling into more than one response packet, unfortunately.

londons_explore · on May 5, 2021

With http/3 header compression, I assume the answer is yes.

simonw · on May 2, 2021

Hah, I thought "in reverse order by ID" might be a stress test but I was still very impressed by how it performed!

bbkane · on May 2, 2021

Would it be possible to use a datasette frontend with this as a "backend" and statically host the whole thing?

simonw · on May 2, 2021

Not easily - Datasette is written in Python, so you could try running it in WebAssembly (like Mozilla did for Jupyter with https://hacks.mozilla.org/2019/03/iodide-an-experimental-too... ) but it would be enough work that it might be easier to reimplement a subset of Datasette directly in JavaScript.

rewq4321 · on May 7, 2021

I've got no experience with Datasette, but if it's all in pure python, then you can use Pyodide[0] to run it within a wasm module. I was surprised how easy it was - took about 15 minutes and I had a Python package working perfectly in the browser. The Python package is about 3x faster than the JS/wasm version, but I'm guessing that performance gap will narrow when we get wasm-gc.

[0] https://github.com/pyodide/pyodide

bbkane · on May 2, 2021

Thank you!

mohanmca · on May 6, 2021

Amazing use of SQLite! Thanks!

f430 · on May 2, 2021

mind blown. how is this possible???

Salgat · on May 3, 2021

Use Http Range (normally used for pausing and continuing large file downloads) to request specific byte ranges from the file. From there you can pull only what you need. With sql indexes it'll be very tiny since the lookup is optimized. Of course if you select *, you're still going to pull the entire database locally.

havernator · on May 3, 2021

TL;DR http, properly implemented, supports a ton more stuff than even many “web developers” are aware of, like… range requests, which are exactly what you’d think they’d be.

teruakohatu · on May 3, 2021

Also SQLite store data in pages and the page size can be tweaked. Combined with range requests any part of the database can be requested.

Cullinet · on May 3, 2021

the most recent update to the W3C's own research webserver, written in Java, called Jigsaw, seems to be dated in 2007. I used it for a lot of purposes until 2002 but I don't know why I stopped working with Jigsaw only that by the time F# emerged in 2004 I was absorbed into a new direction :

https://jigsaw.w3.org/

iirc Jigsaw was used to develop and validate the WebDAV protocols and XQUERY which at the time I remember thinking XQUERY was sure to be the future as implementations of advanced data management and manipulation and query and distribution and declaration looked to be what the whole point of webservers were for. The incredible distractions caused by "rich media" as opposed to multimedia as it was understood then, are really worth thinking about. Saying that, however, the BBC is doing excellent work on restoring the balance of necessary powers to the network standards engineers https://www.bbc.co.uk/sounds/help/questions/about-bbc-sounds...

https://www.bbc.co.uk/rd/blog/2014-03-media-source-extension...

https://www.bbc.co.uk/rd/projects/nearly-live-production

tuukkah · on May 3, 2021

The last link was a very interesting read - I wonder if BBC or anyone else has open-sourced a video streaming and editing system like that.

Connecting this to HTTP range requests, the edited video feed can consist of a list of the exact byte ranges that the clients need to download and play. Found this description of how Akamai uses range requests to serve low-latency streams: https://blogs.akamai.com/2020/11/using-ll-hls-with-byte-rang...

f430 · on May 3, 2021

gpt-2 ?

Cullinet · on May 3, 2021

BTW thank you havernator, because I have just realised what I can do with the setup I'm almost ready to pull the trigger on that'll give me a surfeit of online capacity (at least a baseload can be maintained while the rest is used for work instead of cloud time) : I am definitely going to investigate the possibility of providing a high level of standards specifications for simple web serving. If the W3C Jigsaw project had been maintained, I'd simply put it up and invite interested users to persuade me to send them a shell login. OK obviously that's far too naive today, but I would love to run a especially standards compliant host for negligible nominal or even no charge so people could maybe get a view of better ways to present the WWW.

frankly I think that unless we do things like this, the Internet is simply going to become a closed shop to anyone not wielding enterprise budgets and legal department capabilities.

batch12 · on May 3, 2021

Not trying to offend, but this comment was hard to follow in a weird way. Along with your profile this makes me wonder-- are you gpt-2?

wildrhythms · on May 3, 2021

Their profile contains a full essay of near incomprehensible mumbo jumbo:

https://news.ycombinator.com/user?id=Cullinet

amjd · on May 3, 2021

So I'm not alone. I felt that too!

WJW · on May 2, 2021

So assuming no country has a name longer than 98 characters and that all country codes are 2 characters, that is over 500% overhead? Are you missing a /s in your post?

phiresky · on May 2, 2021

Since random accesses across the internet are really slow, for this kind of fairly small table (where SQLite stores the row data inline within the B-Tree of the table) it basically fetches the whole content for each row - so even if you query only the long_name and country_code column, it will in fact fetch the data of all 29 columns in that table.

If you want it to fetch less data for querying a subset of columns, you could create create an index on those columns - then SQLite will do an COVERING INDEX scan and thus read only the necessary data (with the B-Tree itself and the start / end page alignment being the only overhead).

cookguyruffles · on May 2, 2021

Nothing to add to the conversation, just wanted to say I absolutely adore this. Years ago I used to think it'd be nice if search engines simply exposed all their shards to clients and let clients do all the results merging and suchlike. Of course that's probably a terrible idea in practice for performance, but this library is definitely implementing the spirit in a very practical way!

It also reminded me of a vague inverse of this hack. In old versions of Qemu (possibly it is still implemented, but I have vague memories it got ripped out), you could point Qemu at a directory on disk and it'd produce an emulated floopy disk drive with a virtual FAT12 image containing the directory contents. AFAIK it didn't keep the actual data in memory, I guess all it needed was file sizes to know how to build a virtual memory mapping that contained the filesystem metadata + proxied reads from the underlying files for data sectors. I look forward to seeing your implementation of this concept in a virtualized SQLite file <-> GraphQL proxy ;)

edit: insane, it still exists and apparently supports write mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#Virtual_F...

watermelon0 · on May 2, 2021

Have you actually read the article? SQLite is unmodified, and thinks it runs on a virtual file system, which fetches file chunks via HTTP range headers.

It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.

no-s · on May 4, 2021

> It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.

the harsh reality is that doing sensible queries that only reference and return the data actually needed always makes things faster. Even with server DBMS. Oh, how many times have I lamented the naive "select *" for forcing all the row contents even when there was index coverage for the actually needed data.

throwawayboise · on May 2, 2021

Do most static site hosters support range requests?

dspillett · on May 2, 2021

Most web servers do out of the box, so I would assume most do. Basically all unless they have some reason to turn range processing off or are running a custom/experimental/both server that have implemented the feature (yet).

Not supporting range requests would be a disadvantage for any service hosting large files. Resuming failed long downloads wouldn't work so users might not be happy and there would be more load on your bandwidth and other resources as the AU falls back to performing a full download.

nbevans · on May 3, 2021

Generally yes. Because not having range support means you can't resume file downloads. Which is a pretty essential feature for a static file host.

derefr · on May 3, 2021

More interestingly, do reverse-proxies like Varnish / CDNs like Cloudflare support range requests? If so, do they fetch the whole content on the back, and then allow arbitrary range requests within the cached content on the front?

matsur · on May 3, 2021

Yes, Cloudflare behaves as you describe.

NoInkling · on May 3, 2021

I was wondering that too. Support was spotty in general ~20 years ago but I assume things have improved since then.

OOPMan · on May 2, 2021

It's impressive on one hand.

On the other it's still a lot of overhead.

pfundstein · on May 2, 2021

I would say it's less overhead than downloading the entire db to query it locally...? What is your suggestion for accessing a static database with less overhead?

OskarS · on May 3, 2021

I would bet that if you compare it to a traditional server-client database (which functionally does essentially the same thing: you send it a query over the network, and get a result back), the overhead is probably massive. This is a very clever way to cram that kind of functionality into a static hosting site, and you can imagine some uses for it, but it's clearly not a particularly efficient compared to doing it the "right" way.

Salgat · on May 3, 2021

The idea is that you're weighing the pros cons vs an actual live database. This is basically only a good idea if you're having someone else paying the hosting fees.

OOPMan · on May 4, 2021

Since it's a static database and the queries against it are most likely going to be static, just pre-run the queries and store the results statically in a more space-efficient format. When you're on a dogshit internet connection in a 3rd world country 50kb can actually be pretty unpleasant. Try rate limiting your internet to EDGE rated to see what I mean.

I'm not saying the whole thing isn't impressive, just that the concept itself is one of those "because I can" rather "because I should" things, which kinda devalues it a whole lot.

ineedasername · on May 3, 2021

For a casual or personal use case though, the alternative of running a client-server database on something like a VPS is probably more overhead than this. It's unlikely to be a very scalable option, but for use cases as described by the author it seems like a good fit.

littlecranky67 · on May 3, 2021

I know of the drawbacks of the approach and wouldn't chose it for a lot of my projects, but I would say it is very scalable in those cases where I would. Put the DB on GitHub Pages for free along with your HTML/JS code and you can scale to whatever GitHub is willing and capable of delivering. Yes, your users might transfer way more data than needed but you pay nothing for it and do not have to maintain servers.

In the standard scenario for personal projects (not enterprise) I would have a small VPS/Dedicated server with a REST service running - that would be hugged to death immediately if a link would make it to some site like HN. And also, I completely share the experience of the Author that after a couple of years you have moved on, the VPS is dead etc and you don't want to invest time.

Again, before considering using solution, be sure to understand how it works and the resulting limitations or you will likely chose wrong.

ajfjrbfbf · on May 2, 2021

> sql.js only allows you to create and read from databases that are fully in memory though - so I implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem: sql.js-httpvfs. From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from.

From this paragraph it should be pretty clear that it's actually a great result. The database will obviously need to read more data than it presents, so more is fetched.

nl · on May 2, 2021

This might be true.

But this approach lets you actually work out what the optimal size is:

  select sum(length(country_code) + length(long_name)) from wdi_country;

gives: 6307

Or on average:

  select sum(length(country_code) + length(long_name))/count(*) from wdi_country;

gives: 23

(Note that it doesn't seem possible to use aggregation functions with a limit clause)

simonw · on May 2, 2021

The clever part here is only fetching 50KB out of 680MB.

graderjs · on May 3, 2021

Man you are a frickin genius, seriously. like how you put all this together all the depth of knowledge of different topics this would require the low level and the high level and the way you explain it simply confidently and with impact. Your work is really an inspiration. You computer scienced the sheet out of this thing.

this achievement, and this blog post, to me is on par with blog posts that you would see from a major company where they solve some significant business critical technical challenge in-house. for example: the GitHub blog post about how they created their spinning globe of commits on their homepage, or a Netflix blog post of how they optimized their Network infrastructure to serve so many customers.

your work is truly incredible. You're next level of next level.

phiresky · on May 3, 2021

Thank you, I really appreciate it. It's pretty fun to do this kind of thing for yourself, but it's really rewarding to be able to share it with other people.

f430 · on May 3, 2021

have you thought of getting this to work with network replicated p2p sqlite js library? forgot what it was called.

also does this mean that static sites now can scale without relying on db???

what other implications and application do you see?

wonderful work!!!! please let me know how i can donate

tlb · on May 2, 2021

The question I had is answered by this line of code:

  xhr.setRequestHeader("Range", "bytes=" + from + "-" + to);

I am a little surprised you can just do that. In https://github.com/phiresky/sql.js-httpvfs/blob/master/src/l...

dathinab · on May 2, 2021

Range headers are a pretty standard tools to e.g. continue interrupted downloads and similar.

Any well designed system, especially if it has static sources and is server cached should support it.

Surprisingly many web-frameworks don't support it out of the box, or don't support it well.

Either way gh-pages are static content and probably with some server side regional caches, so I'm not surprised it works.

hinkley · on May 7, 2021

Range headers are also how DownloadThemAll works.

If you're pulling a single TCP stream across a crowded network, you get maybe 1/40th of the available bandwidth. If you do four range requests instead, you might see >3/40th of the bandwidth.

This place we were contracting at, the managers were widely rumored to stream sports games at their desks, and our release cycle happened to fall on a game day. My poor coworker was downloading our installer every week, and the ETA was over 40 minutes. "Are you using DTA?" "No, what's that?" <fiddle fiddle> ETA: 12 minutes.

12 minute pauses in the middle of a manual process are a lot easier to stomach than 40 minutes. Especially if something goes wrong and you have to do it twice.

yread · on May 2, 2021

I've learnt about this by using https://www.biodalliance.org

It's an embedded genome viewer, you can just point it at a multigigabyte reference files and .seg files and it loads super quick

punnerud · on May 2, 2021

Here is direct link to GitHub with the usage: https://github.com/dasmoth/dalliance/search?q=bytes%3D%27&ty...

fanf2 · on May 2, 2021

One of the heaviest users of range requests is (or was) the Adobe Acrobat PDF plugin.

scottlamb · on May 3, 2021

Also .mp4 files. The format is designed for seekability, and browsers take advantage of this.

sp332 · on May 3, 2021

You can even point VLC at a .iso file on a web server, and seek around in it.

fy20 · on May 3, 2021

Progressive JPEGs work well for this too, so you could have the same file used for a tiny thumbnail and large preview and full sized photo by sending different range requests. However you need to know how many bytes to request.

I'm surprised this isn't used on mobile browsers to lower data usage. I'm sure with a little research you could figure out what a good mapping from pixel size to byte size should be to give good enough results.

duskwuff · on May 3, 2021

A browser doesn't have enough information to use this optimization. At the point where it's about to request an image, it doesn't know how large the resource will be, whether it'll be a progressive JPEG, or even whether it'll be a JPEG at all. Making range requests blindly would probably be a net loss -- for every progressive JPEG that the browser managed to save some time on, it'd have to make follow-up requests for many more non-progressive JPEGs, non-JPEG images, and progressive JPEGs which it didn't get enough data from on the first try.

_flux · on May 3, 2021

You are correct.

However, one could use this approach: download as usual, and in a streaming fashion process the data and if it's a progressive JPEG, you can close the connection before you have received everything; and then you can cache the prefix and later download the rest if needed.

Fast clients will just swallow the whole file, while slow clients would be able to benefit from it.

It wouldn't work for pipelined HTTP connections though without cancelling the whole pipeline, so maybe not a very practical solution given the performance benefit that already gives. And HTTP/2 maybe doesn't support cancelling a transfer either, so.. ?

Maybe a direct "Accept" or "Prefer" header to indicate that it's enough to send just something useful for an icon would be a more ideal solution, but it would require server-side support.

0x0 · on May 3, 2021

I remember putting together a sloppy http range implementation that initially only supported single ranges, it had quite the explosive effect on adobe reader when it didn't get the expected response to its multi-range requests :)

iudqnolq · on May 3, 2021

I'm surprised that works, iirc pdf isn't defined in order and can't be parsed streaming

tonyedgecombe · on May 3, 2021

Linearised PDF files are defined in order and can be read from the start.

https://blog.idrsolutions.com/2010/02/linearized-pdf-files/

kindall · on May 3, 2021

I think that's what the "optimize for Web" checkbox does

IshKebab · on May 3, 2021

Wow so it actually does something! I wish programs would use such vague descriptions. (Or more of them had helpful instant tooltips.)

maskros · on May 3, 2021

Linearized (a.k.a. web optimized) PDF files only help for displaying the first page quickly. The rest of the file is still in pretty much random access order.

e12e · on May 3, 2021

For what it's worth, there's a format for content like pdfs that's optimized for viewing on screens (as opposed to printing), djvu: https://en.m.wikipedia.org/wiki/DjVu

Oh, wow initial release 1998,now I'm feeling a bit old...

crazygringo · on May 2, 2021

Not all webservers support/enable it, so YMMV.

But as long as you're dealing with a known server that does, then gravy!

stabbles · on May 2, 2021

nginx enables it by default. Another ingenious use of range request is zsync, it allows you to diff compressed binaries on a remote with local ones, so that you only have to download what has changed on an update. AppImage uses this

altfredd · on May 3, 2021

> Not all webservers support/enable it

Could you provide an example of server that does not?

AFAIK, Range is supported by all major CDNs, so not supporting it in web server would be a death knell for it's real-world adoption.

crazygringo · on May 3, 2021

I can't think of a specific one, but a decent proportion (maybe a quarter?) of HTTP downloads I attempt don't support resuming a partial download, therefore don't support Range. (I.e. resuming always starts from the beginning.)

I would assume this is often because the site in question isn't using Apache etc. to serve a file directly, but is either essentially proxying it to some custom-built file serving service, or a script that processes/authenticates the file in some way, and they just never bothered to implement Range.

enneff · on May 3, 2021

Depends on what’s being served. Any decent static file server should support it, but if the content is at all dynamically produced then the authors would have to think to implement it and rarely do.

e12e · on May 3, 2021

I guess "all" that's needed for write would be a webdav server with support for PATCH x-update-range? :)

https://tools.ietf.org/html/rfc5789

https://sabre.io/dav/http-patch/

Unfortunately, solid, stand-alone webdav servers are harder to come by than decent http2/1.1 servers.

ianlevesque · on May 2, 2021

This is easily the most clever web programming hack I’ve seen this year. Bravo. I had seen this used for video or audio of course but it never occurred to me you could use it for databases. There are probably a ton of other formats this is good for too.

Valgrim · on May 3, 2021

I wonder if this could be used to serve dynamic maps.

blacha · on May 3, 2021

This is pretty much exactly what we do to serve aerial/satellite imagery maps.

We convert the imagery into Cloud optimised geo tiffs and store them in S3 https://www.cogeo.org/ then the browser can request the tiles directly from S3.

Even the big imagery providers are now storing their imagery as COGs, eg https://registry.opendata.aws/sentinel-2-l2a-cogs/

durkie · on May 3, 2021

I believe this is protomaps approach: re-encode the mbtiles (sqlite-based ) format in to something that can be requested with a http range request and thus served from a single dumb webserver that doesn't need to understand sqlite or mbtiles parsing

bdon · on May 3, 2021

This is the approach I took with http://github.com/protomaps/pmtiles , though it's optimized for the very specific use case of going from Z/X/Y integer coordinates to binary blobs, and takes shortcuts to accomplish that (fixed-width keys and root index page)

config_yml · on May 3, 2021

Funny enough I was looking into mbtiles serverless solutions before I went to bed, now I start my day browsing HN and I find this clever solution. Love HN for this type of stuff :)

sroussey · on May 2, 2021

The innovation here is getting sql.js to use http and range requests for file access rather than all being in memory.

I wonder when people using next.js will start using this for faster builds for larger static sites?

misterdata · on May 2, 2021

See also https://github.com/bittorrent/sqltorrent, same trick but using BitTorrent

phiresky · on May 2, 2021

Yeah, that was one of the inspirations for this. That one does not work in the browser though, would be a good project to do that same thing but with sqlite in wasm and integrated with WebTorrent instead of a native torrent program.

I actually did also implement a similar thing fetching data on demand from WebTorrent (and in turn helping to host the data yourself by being on the website): https://phiresky.github.io/tv-show-ratings/ That uses a protobufs split into a hashmap instead of SQLite though.

westurner · on May 2, 2021

This looks pretty efficient. Some chains can be interacted with without e.g. web3.js? LevelDB indexes aren't SQLite.

Datasette is one application for views of read-only SQLite dbs with out-of-band replication. https://github.com/simonw/datasette

There are a bunch of *-to-sqlite utilities in corresponding dogsheep project.

Arrow JS for 'paged' browser client access to DuckDB might be possible and faster but without full SQLite SQL compatibility and the SQLite test suite. https://arrow.apache.org/docs/js/

https://duckdb.org/ :

> Direct Parquet & CSV querying

In-browser notebooks like Pyodide and Jyve have local filesystem access with the new "Filesystem Access API", but downloading/copying all data to the browser for every run of a browser-hosted notebook may not be necessary. https://web.dev/file-system-access/

hfmuehleisen · on May 3, 2021

DuckDB can directly & selectively query Parquet files over HTTP/S3 as well. See here for examples: https://github.com/duckdb/duckdb/blob/6c7c9805fdf1604039ebed...

tyingq · on May 2, 2021

Would also be great to add (efficient) search to a static blog.

edoceo · on May 2, 2021

yea, sqlite FTS5 has been pretty amazing for quick search solutions (but I use english only)

nanna · on May 3, 2021

Definitely. Just need to add a layer to the static site generator for it to populate the SQLite DB, right?

intergalplan · on May 3, 2021

I'd also version the DB in the URL, else you could end up changing the file out from under someone who's already got the page loaded, with who-knows-what results depending on how different the file is. You could just prefix a head to every range request to check for changes, but that adds overhead and doesn't actually completely close the gap, so it'd still be possible to read a file different from the one you intended. Cost is more disk usage, depending on how many copies you keep around, but at least keeping the most recent "old" version seems reasonable unless you're skating really close to the quota on whatever system you're using.

M2Ys4U · on May 3, 2021

Requests could use the If-Unmodified-Since header to guard against DB changes without the overhead of a separate HEAD request.

If the underlying DB has been changed then the server should respond with 412 Precondition Failed.

jhgb · on May 2, 2021

Microsoft Access Cloud Edition, basically?

nbevans · on May 3, 2021

Sort of. Access had a "Forms" feature that let you create basic GUIs on top of your database. Also, the OP's project is (currently) only providing a read-only view of the SQLite database. Adding write support is possible but will be far less impressive to the HN crowd because SQLITE_BUSY will rear its ugly head ;-)

jhgb · on May 3, 2021

I was mostly referring to the shared file access approach.

mvanaltvorst · on May 2, 2021

I'm curious, in what manner could this method speed up Next.js builds? That's all done locally, which negates the effect of HTTP range requests, right?

city41 · on May 2, 2021

I'm guessing they mean rather than build a static Next site that generates 10k+ pages (or whatever large means in the given context), it instead creates one page that just queries the data from the client.

I have one Next static site that has about 20k pages and takes about 20 minutes to build and deploy. I think that's an acceptable build time. But I do know of other people around the net who have mentioned having sites with 20k-ish pages taking an hour+ to build. For them I could see the desire to try this sqlite trick.

nindalf · on May 2, 2021

You should write a post about this if you implement it. My humble suggestion for blog post title - 'Speed up your Next.js builds with this author's one weird trick! Vercel hates him!'

lsmurray · on May 3, 2021

I generate my entire blog on nextjs from a sqlite data base already. https://lsmurray.com/blog/database-first-development

It’s got a ton of rough edges but the boilerplate is there to get a proof of concept pretty quickly

altano · on May 3, 2021

For really large sites Next.js already has Incremental Static Regeneration which is usually the right solution to fast [re]builds: https://www.smashingmagazine.com/2021/04/incremental-static-...

throw123123123 · on May 3, 2021

Its not the same because you have to rebuild all the pages if you change your data source. In this implementation you can upload a new data set and it will work.

Its just a different stack.

gitgud · on May 2, 2021

Generating 20k pages in 20mins is impressive, 16 pages a second on average.

In my experience, it can take a couple of minutes just to deploy 20 pages, but that could just be the overhead of Typescript and SASS compilation too...

nicoburns · on May 3, 2021

Hugo claims <1ms a page. Which would mean 20k pages in under 20 seconds. 20k pages in 20 mins is not fast!

gawin · on May 3, 2021

Confirmed: Build a 30k pages site with Hugo in 53 seconds, including grabbing JSON content from Contentful and deploy on AWS S3.

Took some effort to get everything under 60 seconds, speed was the main reason to use Hugo.

city41 · on May 3, 2021

Oops, my memory was off. It's 10,925 pages and the last build took 18 minutes.

yoz · on May 3, 2021

As everyone else has been saying, this is amazing work. It sounds like the biggest issue with loading the page is the initial sql.js download - it's about 1.2MB, is that right?

Might it be feasible to easily strip down SQLite so that it only compiles the parts for read-only use? The browser version is obviously somewhat read-only but that's because of the sandbox. I'm talking about excluding the code for CREATE, UPDATE, INSERT and everything else which is just for writing. The aim here would be to produce a significantly smaller WASM binary.

I'm guessing that the answer is no, there's no easy way of doing this without significant rewrites of SQLite's core, but... I can't be the only one to think of this, surely?

phiresky · on May 3, 2021

The actual transferred data for the sqlite code should only be 550kB (gzip compression).

Stripping out the write parts is a good idea. SQLite actually has a set of compile time flags to omit features [1]. I just tried enabling as many of those as possible, but it didn't seem to reduce wasm size much, though I might be doing something wrong. There's also no easy flags to disable CREATE / UPDATE / INSERT .

[1] https://www.sqlite.org/compile.html#omitfeatures

hinkley · on May 7, 2021

I smell a series of pull requests...

duskwuff · on May 3, 2021

I'd be curious whether there's any changes which could be made in the file format to optimize for read-only usage. The SQLite format probably has some features which aren't needed in this context -- information about free pages and autoincrement counters isn't relevant in a read-only file, for instance.

phiresky · on May 3, 2021

I think it wouldn't change much - SQLite is already pretty optimized towards reads, for example a write always replaces a whole page and locks the whole DB. The free pages can easily be removed by doing VACUUM beforehand which should be done anyways to balance the b-trees.

The storage of SQLite is already really efficient, for example integers are always stored as varints so small ones only take a byte. The only thing I think could maybe be improved for this use case is changing the structure of the b-tree to be more columnar - since right now all the data from different columns is intermingled with the btree structure itself, querying a subset of columns has a high overhead.

singularity2001 · on May 3, 2021

Sounds feasible to me. Either by replacing all those functions on the C side with empty shells or maybe even with wasm-opt ( but probably the OP has already used it removed all garbage collectible paths.)

BMorearty · on May 2, 2021

This is fantastically creative. And the author does a great job of starting out by describing why this is useful.

And then using SQLite to insert and update DOM elements? Holy cow, icing on the cake. Unlike the first part, there’s no explanation of why you’d want to do that. But by that point I was so drawn in that I didn’t care and was just enjoying the ride.

phiresky · on May 2, 2021

Yeaah I felt like at that point the article was already long enough so I didn't bother describing the DOM part too much - even though I spent more time implementing that than I did implementing the rest ;)

Basically SQLite has a virtual table mechanism [1] where you have to define a few functions that figure out how to scan your "fake" table / which indices to use and then how to read / write the actual data. I hook into this mechanism and redirect the request to DOM functions like querySelector() etc. Then there's the issue about SQLite being fully synchronous, but I have to run it in a WebWorker - and the WebWorker can't actually access the DOM and it can only communicate asynchronously with the main thread... So I have to do some weird stuff with SharedArrayBuffer and Atomics.wait to make that work [2].

[1] https://www.sqlite.org/vtab.html [2] https://github.com/phiresky/sql.js-httpvfs/blob/master/src/v...

nbevans · on May 3, 2021

"But by that point I was so drawn in that I didn’t care and was just enjoying the ride."

:-) There a high amount of SQLite content/articles/blogs on the web that can provide this effect. SQLite is to programmers like the stars are to astronomers. A wonder.

arafalov · on May 2, 2021

Amazing, especially - for me - that the FTS5 full-text search just works. Longer term, I am if it were possible to split the DB code into read and write parts and cross-compile only read part for delivery to the browser.

rzzzt · on May 2, 2021

If you are interested in full-text search on the client, Lunr is also an option: https://lunrjs.com/docs/index.html

tehbeard · on May 2, 2021

Lunr needs the full index on client though, right?

Being able to use fts-5 without the penalty of having to pull down the whole index make it work much better at larger scales, even with the penalty of additional network requests.

formerly_proven · on May 2, 2021

This could be genuinely interesting for tools like e.g. sphinx-doc, which currently has a client-side search that does indeed ship the entire index to the client.

tehbeard · on May 2, 2021

There's probably a dataset size tradeoff with small enough number of documents, but even then this could have the option of if db is < x MB total, fetch all in async task and then use that.

joshuanapoli · on May 2, 2021

Time to create the index for lunr limits the size of the data-set that can be used. If you have a lot of tiny documents, then it is more practical to scan the documents directly, rather than using the lunr index.

Arnavion · on May 3, 2021

I wrote a similar thing in Rust for a Factorio mod manager. Mods are hosted on the remote HTTP server as ZIP files, and the mod manager needs a single `info.json` file from the ZIP for the mod metadata. So the mod manager avoids downloading the whole mod and then unpacking it by building a file abstraction that uses HTTP range queries to download just the chunks it needs. For ZIP files the directory is stored at the end at an unknown offset, so the read pattern is to gradually seek backwards from the end until you find the start of the directory, then find the file entry, then seek and read the file.

I didn't fiddle with the window sizes like the submitted article (the chunk is fixed to 8KiB), but I did optimize it so that reading chunk N+1 of the file reused the response reader of chunk N rather than make a new request. Furthermore I keep an LRU cache of only the last three chunks in memory, because the ZIP files are each only read once.

[1]: https://github.com/Arnavion/fac-rs/blob/2d2622a1c9934719ce65...

[2]: https://github.com/Arnavion/fac-rs/blob/2d2622a1c9934719ce65...

bob1029 · on May 2, 2021

Modeling the DOM in SQL... Further evidence that anything we can imagine has some stable representation in third normal form. Is it fast? Maybe not. But is it correct? Provably.

cma · on May 3, 2021

There was a group that rewrote the Chromium DOM itself in a data oriented design (which learns from database design and sort of applies to cache utilization and locality) and got a 6X speedup in some places:

https://meetingcpp.com/mcpp/slides/2018/Data-oriented%20desi...

bob1029 · on May 3, 2021

This paper is basically what we do, except we have a big container type, aptly named "Domain.cs". Inside, you will find a public List<T> of every type. We decided to emulate SQL throughout the vertical (i.e. each List is a table) in order to make mapping to SQL a trivial affair. None of our Domain types contains any complex type as a property. Everything can go 1:1 to SQL. We use LINQ (or SQL) to produce projections as appropriate.

There isn't actually just 1 big domain instance either. It's more like one per user session, and then a global instance.

The impact this had on reducing complexity and bugs is incredible. I haven't seen a null ref exception in a long time. Also, being able to dump your entire universe to disk by serializing a single object is really nice.

banana_giraffe · on May 2, 2021

In a similar vein, I've mentioned this before, but if you're doing Python stuff, you can use the apsw package (not the one in PyPi, though) to write a VFS layer that SQLite will use to read the database.

I've used this for the same basic idea as this article, only letting me store SQLite databases in AWS's S3 that I can access with AWS APIs so they don't need to be public. It works well, though it's absolutely not for every use case, the overhead is considerable.

I even used it once to read SQLite database files in a zip file stored in S3 without having any local storage to use. Not one of my prouder moments, but hey, I coded my way out of the corner someone else designed for me.

killingtime74 · on May 3, 2021

This one? https://rogerbinns.github.io/apsw/

banana_giraffe · on May 3, 2021

Yep, exactly that one. There's a simple example of a VFS implementation on the examples page that's a reasonable starting point:

https://rogerbinns.github.io/apsw/example.html

Once you wrap your head around how you need to pass parameters to the helper, it's really straightforward, you just need to implement the xOpen and xRead calls.

killingtime74 · on May 3, 2021

Thank you!

laurencerowe · on May 2, 2021

In the genomics world, Tabix indices enables similar use cases. An ordered TSV file is compressed in chunks (bgzip) and a Tabix index created to allow range based access by mapping from the index -> chunk. This allows a browser based genome browser zoomed into a section of the genome to fetch information from a multi gigabyte file.

http://www.htslib.org/doc/tabix.html

Ultimatt · on May 2, 2021

now if only tabix and most hokey bioinformatics formats would die and just be replaced with a formal schema spec in SQLite...

laurencerowe · on May 2, 2021

Honestly I think Tabix's bgzipped TSV is one of the less hokey bioinformatics formats, at least compared to the various custom binary formats floating around.

For web browser based genome browsers I suspect this (very cool!) sqlite hack would require many more http requests.

crazygringo · on May 2, 2021

This is hilariously clever.

Using the "Range" HTTP header to read chunks of the database file absolutely works!

But to be clear, there's no write equivalent, is there? You can't use "Range" with a PUT request.

anonydsfsfs · on May 3, 2021

The write equivalent would be the PATCH method using a "message/byteranges" body: https://tools.ietf.org/id/draft-wright-http-partial-upload-0...

crazygringo · on May 3, 2021

Wow that's fascinating, thanks. That would actually turn HTTP into a kind of random-access filesystem, if adopted.

It's amazing but also slightly terrifying in the knowledge that then someone's going to write an SMB-over-HTTP connector.

TimTheTinker · on May 2, 2021

TA was clear - there's no way to write, since static file hosting doesn't support dynamic write to begin with.

However, I imagine a service to support your scenario could be written in a standard back-end server language like Go or JS. The challenges involved would be much greater, however -- how to handle concurrency in particular. I suspect one would do better to just run PostgreSQL behind a web API.

csomar · on May 2, 2021

That's basically re-inventing the Database but on the client side. We have gone a long way but we are closer to having the server side just as a data store.

NoInkling · on May 3, 2021

Even if there was, I can't imagine your everyday static host ever supporting it.

sztanko · on May 3, 2021

Apart from the genius of the author (conceiving a sequence of relatively understandable steps that unlock a potentially huge area), this highlights how efficient a SQLite can be in terms of slow resource usage.

dreix · on May 2, 2021

Over the last few months I tried to think of a clever way to set up a legacy site for a dictionary that I serve on a VM just because I also need to run sqlite. Since I want to make sure it'll run for longer than me paying for the VM this is the best possible solution. At some point no more updates will happen and it's going to be a static website. So bundling it like this is incredible. I can run multiple backups on different hosts with no additional costs.

scottlamb · on May 3, 2021

If you just do an occasional key/value lookup, you don't need 1.2 MiB of WebAssembly. [1] That might already exceed your total database size.

I'd solve it via sharding: divide the database into N pieces via range- or hash-sharding. [1] Choose an N that's large enough for each piece to be reasonably small. When you look up a key, fetch the shard of interest.

You can put each piece into separate files (a little simpler to code, and most static servers will use pre-gzipped files for "Content-Encoding: gzip requests" easily, but you waste more disk space due to internal fragmentation) or one file (with range serving and an index of the byte range offset for each piece).

The format for each piece can be anything, eg json (simple) or an sstable-like format (more efficient). [3]

[1] Content-Length of https://phiresky.github.io/youtube-sponsorship-stats/sql-was...

[2] hash-sharding means: piece[i] has all the keys where hash(key) % N = i.

[3] https://github.com/google/leveldb/blob/master/doc/table_form... although they just say "formatted according to the code in block_builder.cc" instead of describing the most relevant part.

spamalot159 · on May 2, 2021

This is really cool! I wonder what the restrictions are and if we would ever be able to write to a SQLite db like this in the future. This could push more to the front end without needing to write apis.

phiresky · on May 2, 2021

The main restriction is that the DB really needs well fitting indexes, otherwise querying is really slow and fetches a lot of data.

Regarding writing:

You could of course implement a writing API with POST requests for changing pages of the database - but then you would lose most of the benefits of this (not requiring any special kind of server).

I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

Interestingly in SQLite that's already exactly what the WAL mode does: It's a second file next to the database that's just a set of pages that are overlaid over the main file when read queries happen - which allows concurrent readers and writers since the database itself isn't in an undefined state even when write transactions are happening.

So you could enable WAL mode and disable WAL auto checkpointing, then you get a downloadable WAL file that can be read by normal SQLite and written back to the main file. It would be neat, but I'm not sure what the actual use case would be ;)

cxr · on May 2, 2021

> I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

If ever the intent were to involve eventually persisting those changes, then it would be worthwhile looking at remoteStorage, which works like this.

graderjs · on May 3, 2021

I was thinking you could implement a write API using the GitHub API, every write can be its own commit.

stagas · on May 2, 2021

> I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

Perhaps adding IPFS to the mix for persisting data would be interesting, I'm sure there are use cases in peer to peer applications. Anyway, amazing innovation thank you for writing this :)

ItsMonkk · on May 3, 2021

Seems like it might be possible to add WebRTC(or some other peer-based live system, I don't know the performance characteristics of WebRTC) to get realtime writes. Those edits would then be added to the WAL, and every say, 10 minutes anyone who has been active for 10 minutes could checkpoint and push the changes to github.

It's slightly more centralized than perfect, but man do you get a lot for a little.

WrtCdEvrydy · on May 2, 2021

Since you can do static hosting from a git repo, I wonder if you could directly push your changes to your git repo and have your CI/CD solution just deploy it instead?

There has to be a git.js implementation out there and you could move the DB to it's own repo and create an https access token (for Github)... the issue there is that someone could use that token to commit whatever to your database repo.

wh33zle · on May 2, 2021

Maybe forcing the user to login with GitHub would be an option? And the changes to the database could be modelled as pull requests?

iveqy · on May 3, 2021

This is both a clever hack and a brilliant solution. But it also worries me a bit. Mostly because I've seen a lot of Visual Basic + MS Access solutions. They work fine on a single computer, but they you put a database on a network share to be able to share it between a few computers and the performance is often horrendous. If you're doing a lot of data processing it's often best to do it as close to the data as possible.

But as always, it's seldom the tools, but the right tool used for the wrong usecase that is the problem.

isoprophlex · on May 3, 2021

I have the feeling I'm watching some semi-major new web tech being born... out of the blue, in a blog post.

Fucking amazing, mad props. Beautiful work!

wheybags · on May 2, 2021

This is an really awesome idea! I have a plan for a static-ish site (updated daily) that I was going to use sqlite for anyway, but server side. I will definitely look into this approach instead!

singularity2001 · on May 3, 2021

   >>> From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from.

Beyond static hosting : Now imagine also implementing a virtual file system that SENDS chunks of the database with HTTP Range requests when SQLite tries to write from the filesystem

Or more generally: I predict a WASI implementation which will treat ANY server resource as a virtual file, replacing REST.

hutrdvnj · on May 3, 2021

First of all, this is a very cool web hack, I like it very much.

I have a question. It's a 668.8MB database file. What does actually happen if the query has to scan 300 mb before finding the right answer? Wouldn't it be better to do the work up front and deliver the answers as static json files? Sure you loose the flexibility of dynamic queries, but do you really have that flexibility in non trivial cases (e.g. 300 mb search)?

willvarfar · on May 3, 2021

If your statistic or whatever can be precomputed, you can precompute it and put it in db table rather than compute it each time by reading the 300MB.

simonw · on May 2, 2021

Using HTTP range requests like this is just SO clever.

Procrastes · on May 3, 2021

I'm also impressed with the implementation for the blog itself. I love this sort of clean, version-tracked sort of implementation.

tlarkworthy · on May 3, 2021

works on ObservableHQ

https://observablehq.com/@tomlarkworthy/phiresky-sqlite-quer...

drtournier · on May 2, 2021

Solution works really well to databases which will not be updated frequently, like a standalone site.

Although one should be aware of one very important git behavior - git does not diff binary files (like SQLite dbs). That means 2 things:

1. Each db update will generate a new file in git, maintaining the whole old file in history, instead of the diff in bytes. This will accumulate a lot of clutter in the repo

2. As git does not diff binaries, there is a very small risk of corruption (especially if you work in multiple OSs, because of CRLF)

Ref - https://robinwinslow.uk/dont-ever-commit-binary-files-to-git

phiresky · on May 2, 2021

Git does actually diff binaries and stores them very efficiently :) If you do a single small UPDATE in your db file git will only store the changed information. It's just kinda slow, and for most binary files the effort git spends to try and compute deltas for binary files is useless - which is why it has a bad reputation for binary files.

Note that the diffs that git shows you are completely unrelated to the deltas it uses to compress it's database - which are always "binary deltas" and not line-based diffs.

Also I'm not sure why you mean that db corruption possibility has something to do with whether or not it stores diffs?

xucheng · on May 2, 2021

You can just store the database in text format (e.g. csv) in the git and turn it to SQLite db when building the website.

laszlokorte · on May 2, 2021

Minor nitpick: Git does not store diffs for any file format but always the full file for each version. So it does not really matter that its binary (except for not being able to VIEW the diff, but I guess you could even implement a plugin for that) but just that it's a big file. Even a huge text file would be fully stored per version.

/edit: The sibling comments mentions that git can infact delta compress older commits for storage efficency. But my point was that git commits are not deltas but full snapshots.

iudqnolq · on May 3, 2021

This is mostly correct. Git's core object model is snapshots, which can then optionally be compressed. That should be transparent though.

tehbeard · on May 2, 2021

You are only gonna encounter corruption if u either a) messed up the gitconfig for line endings or b) named the database mydbfile.txt

Gehinnn · on May 2, 2021

TL;DR: Compile SQLite to JS with emscripten, implement a virtual fs in JS to stream chunks of a statically hosted (readonly) SQL database.

If queries make use of indices, only a fraction of the database needs to be downloaded.

Also, you can use SQLite to query the DOM.

pudmaidai · on May 2, 2021

I can't figure out exactly how it knows which chunk to download. Does it always download the whole index first? Or does it include it in the built JS file itself?

cookguyruffles · on May 2, 2021

Both the index and table data are btrees. These are trees - the root node sits in some known location (offset) in the file, referenced by the file header and metadata. As SQLite traverses the tree, it encounters new descendents it would like to visit, presumably identified by their byte offset in the file, which is all needed for this VFS magic to issue a suitable range request.

- SQlite opens the file and reads 4kb worth of header -> range request for byte 0-4096

- headers/metadata refers to index table with root node at 8192kb

- user issues SELECT * from index WHERE name = 'foo'

- SQLite reads root node from the file (range request for 8192kb..)

- Root node indicates left branch covers 'foo'. Left branch node at address 12345kb

- Fetch left branch (range request for 12345kb)

- New node contains an index entry for 'foo', row 55 of data page at 919191kb

- SQLite reads data page (range request for 91919191kb..)

etc etc etc

jaza · on May 3, 2021

Thanks, I too was struggling to understand how it's able to do such efficient targeted range requests, you explained it nicely.

formerly_proven · on May 2, 2021

SQLite has runtime-pluggable VFS support, i.e. you give it a struct with functions for opening a file, reading some bytes, writing some bytes, synchronizing file contents, closing a file. This project provides such a VFS module, that, because it actually runs in the browser, performs HTTP requests to read data. Emscripten provides a way to run a mix of C/C++ code in the same environment as some JavaScript code inside the browser. The reason SQLite has this pluggable VFS support is to properly support embedded systems, different locking APIs, and things like database encryption.

https://www.sqlite.org/vfs.html

mvanaltvorst · on May 2, 2021

That's part of SQLite. It has been optimised to reduce disk reads, because those can be slow on spinning hard drives. Coincidentally, this translates well into an optimised algorithm that minimises the amount of HTTP range requests to make.

fulafel · on May 2, 2021

B-Tree indexes are designed to work like this, to require a low number of IO operations. The index contains pointers to other places in the index.

pudmaidai · on May 2, 2021

So the answer is “yes, it has to download the index first”? None of these comments answer my question.

phiresky · on May 2, 2021

The B-Tree is a tree that in this case is perfectly balanced. So if you do a query with an index in a database it will fetch an logarithmic amount of data from the index and then a constant amount of data from the table.

For the example the wdi_data table is 300MB and an index on it is 100MB in size. This index has a tree depth of 4 - which means SQLite has to read exactly 4 pages (4KiB) to get to the bottom of it and find the exact position of the actual row data.

you can check the depth of the b-trees with `sqlite3_analyzer`.

rakoo · on May 2, 2021

Everything in SQLite is stored in B-Trees. Data or indexes. So you don't need to download the whole index first; you only need to download the necessary pages of the trees to access data, whether it's part of an index or actual data

detaro · on May 2, 2021

At least the information describing what tables and indices there are and where to find them - and then it gets what it needs once a query is run. Just like sqlite would if running from a local file on disk.

tehbeard · on May 2, 2021

It just needs to download the pages (x KB chunks) to traverse from the header to the particular index and table schema, a fraction of the whole data.

fulafel · on May 3, 2021

It has to download some parts ("pages") of the index as the query execution proceeds, and some header/schema description pieces of it first before execution starts.

rjeli · on May 2, 2021

Nice. Reminds me of Bellard’s https://vfsync.org/ . It’s file system pages served over http, to back his jslinux browser VMs

irae · on May 3, 2021

The Hacker News demographic, mostly US and EU based, might be under appreciating how impactful this is because of CDNs. For anyone not in the continent of your central database, this means absurdly faster interactions for data visualization.

I moved from US to Brazil 3 years ago, and I still notice the latency when a site runs their backend only in one location. This cleaver solution makes interacting with the graph supper snappy even compared to enterprises that do have database servers in Brazil. Very impressive!

Seattle3503 · on May 3, 2021

How close are we to hosting searchable torrent indices on IPFS?

slver · on May 3, 2021

I wonder sometimes, what if we focus on producing native JS VM hardware, would it perform same as or better than WASM? Let's run everything on JS.

villasv · on May 2, 2021

Wicket stuff, loved it. Not sure if this will ever be “production ready” or merged by sql.JS-like projects but cool proof of concept.

th0ma5 · on May 2, 2021

Searching the Common Crawl does similar byte level access to S3 based on a smaller index of indexes. Really great, actually.

shkkmo · on May 3, 2021

Hmm... Is it possible to use this with two seperate databases, one a remote read only and one a local writable DB that could be re-exported to a static file host? Having just glanced at the code it looks like you would have to load two seperate libraries, one from this project and one from the original sql.js

bilater · on May 2, 2021

This is awesome. Got my wheels turning :)

modeless · on May 2, 2021

Super cool. I wonder if an approach like this could ever work for a writable database. I guess the issue is that you couldn't have fine grained access control without a whole lot of server side validation, at which point you might as well just run a regular database server.

diedyesterday · on May 5, 2021

This hack reminds me of Godel's hacking of an axiomatic mathematical system, to prove and plant his incompleteness theorem;

This is really a great hack: using a system in a unexpected and creative way for which it wan not originally deigned, but which is also very useful and nice.

aflag · on May 3, 2021

Very clever. I wonder if there are databases optimised for this use case. I can imagine something that always requires indexes to do the queries and stores data in disk in ways to make it easy to fetch only the bits you need.

aasasd · on May 3, 2021

That would be equivalent to always putting your data in b-trees or other structures, according to the request patterns, without keeping the tables themselves. Sort of how you need to do that in Redis for any kind of sane request strategy other than key-value lookups.

aflag · on May 3, 2021

Hm, yeah, I think a key-value store would be easier to implement. I haven't looked at redis for some time now, but last time I did, persistence was done through snapshotting and everything would really be loaded into memory at start time. So that wouldn't work for this use case, where all you can do is serve a static file.

But my question revolves around databases assuming that the disk they access is local or at least a fast network storage. I wonder if there are any databases optimized to access slow storage over low bandwidth, where you're really trying to optimize the amount of data read more than anything else.

aasasd · on May 3, 2021

Well, every database already optimizes disk access (at least until the recent years with the ‘just add SSDs’ attitude). However, they tend to assume that indexes should be loaded into memory. For this use-case, you'd want a database that can use indexes themselves from disk and treat them like partitioned storage: e.g. when reading data for years 2019 to 2021, only request parts corresponding to that, and not previous years. Dunno whether SQLite can have indexes partially in memory—with its niche of low-end devices and apps, it's quite possible that it can.

Actually, this sort of partial access (i.e. partitioning) is rather easy to implement by addressing separate data files by name, instead of using numeric ranges into a database. Basically just put the data into files named by the years (in my example); or bucket the data into chunks of arbitrary size and use the chunks as files. Elementary to extend this to multiple fields in the index. In short, partitioning based on actual field values can be much easier in the static-http approach than using opaque ranges. Probably also more effective if something like http2 allows requesting several files in one request—since you can avoid requesting too little or too much.

niutech · on May 9, 2021

For similar results, you can use SQL with IndexedDB running fully client-side: https://github.com/agershun/alasql

dawe35 · on May 3, 2021

Hi, I'm working on SkySQL, maybe it's useful for you: https://github.com/upshot-tech/SkySQL

whoomp12342 · on May 3, 2021

stahp, you gonna get gh pages shut down

tedk-42 · on May 2, 2021

Very clever! Keep in mind this is a read-only dataset with the really clever bit being the ability to scan through the dataset and support sqlite query syntax.

nsonha · on May 3, 2021

This is perfect for my need, been looking for a way to add search to my static site completely free of server. Now I can use sqlite as index.

Benjamin_Dobell · on May 3, 2021

This certainly does look like an interesting solution, I'd be keen to try it myself. However, just in case you didn't already know about Lunr (https://lunrjs.com/), it is fairly commonly used to implement search on static websites.

e.g. https://squidfunk.github.io/mkdocs-material/

There are of course other similar libraries too.

EDIT: Whoops, just saw a few comments below Lunr is already mentioned.

nsonha · on May 3, 2021

yeah I have come across Lunr and maybe a couple of other things in my research and I think for blogging it'll work well. What I'm interested in finding out is what works for a larger static site, that won't require you to load the index up-front. I'm also curious about how this sqlite thing picks the correct range to load (haven't looked at the code) and what the worst case might be.

masklinn · on May 3, 2021

> I'm also curious about how this sqlite thing picks the correct range to load

Indexes are usually btree, it finds what and where the relevant index is (probably from the schema queries), then goes trawling through the b-tree.

That’s how db engines normally work, they don’t liberally go through the entire db contents.

If there’s no index then it finds the start of the table and scans it sequentially (which is exactly what it sounds like).

pmarreck · on May 3, 2021

This is super clever. If you add handling POST somehow as random access writes, you could have a read-write DB hosted pretty much anywhere

winrid · on May 3, 2021

Doesn't the webserver have to seek from the beginning of the ~600mb file to the range you want, unless the file is in memory?

philshem · on May 3, 2021

No, see here, for example

https://news.ycombinator.com/item?id=27018194

winrid · on May 4, 2021

That doesn't answer the question.

Yes the web server supports range requests. Yes it only returns 50kb.

But what mechanism is letting it scan to just those ranges in the binary file. Doesn't the file system make you seek to the starting block and then read from there?

The point is, while it looks very efficient, there might be a crap ton of IO going on for your little 50kb response.

EDIT: probably the webserver is doing an fseek(), and performance will vary based on file system impl. This is something I will need to benchmark.

ric2b · on May 9, 2021

For most filesystems it should be no issue to seek directly to a specific offset on a file, it's just like accessing an array, if you know where it starts and the offset you just need to add them and tell the disk to read that address.