This is really clever. I've been wanting to try the WASM version of SQLite for something - this is a really smart usage of it.
My https://datasette.io/ project is built around a similar idea to this: the original inspiration for it was Zeit Now (now Vercel) and my realization that SQLite read-only workloads are an amazing fit for serverless providers, since you don't need to pay for an additional database server - you can literally bundle your data with the rest of your application code (the SQLite database is just a binary file).
If you want to serve a SQLite API in an edge-closest-to-the-user manner for much larger database files (100MB+) I've had some success running Datasette on https://fly.io/ - which runs Docker containers in a geo-load-balanced manner. I have a plugin for Datasette that can publish database files directly to Fly: https://docs.datasette.io/en/stable/publish.html#publishing-...
Thanks for making Datasette! I've been using it to build a project to make US ranked choice election results more accessible. It's very intuitive and easy to use. The main difficulty for me is how to get around the limitation that the sqlite files have to be colocated on the same server. These datasets can get pretty large, and I can't host them on Github, and since I can't put the datasette db on an S3 bucket, I've been exploring mounting AWS Elastic File System to the Docker container. Is there a better way?
Google Cloud Run works well with database files up to a couple of GBs. Beyond that the best solution at the moment is to run a VPS - I've had good results with DigitalOcean droplets.
FYI, I'm using Google Cloud Run for this at the moment. Scales to zero, issues the SSL cert and provides the HTTPS endpoint. That's what I'd be benchmarking your service against. Let me know if you actually do something in this space.
hey, thanks. That sounds interesting, but this is a volunteer project. I'm donating my time, so I don't think I can justify a service like that right now. The datasets are around 200MB each, just over the 100MB limit of Github.
I’m using the WASM SQLite (in the browser) for a BI product [1] I’m working on.
It’s been working extremely well and it runs fully sandboxed in a separate worker thread with zero network traffic after the source data has been transferred across. With computational notebooks running alongside it, you can do some pretty powerful analytics with very low latency and it’s great to be able to do inserts and complex queries entirely client-side.
SQLite is one of the greatest software projects I know.
classic serverless --> "embedded database" exists and is often used
neo-serverless --> I suspect that this is a marketing term used to attract cool people to new cloud offerings (like "jamstack" for services such as netlify). There is no good replacement here because the term was tied to this new kind of infrastructure really early. But anytime I hear someone repeat "of course serverless does not mean there is no server!" I die a little more.
Naming things is hard but at least we ought to try to come up with terms that are not blatantly misleading.
Look, the industry has settled on the term "serverless" to mean: "You just upload some application code and the system automatically provisions servers to run it."
No one has the power to change this. It's like saying "I think the word 'spoon' is stupid, let's all call it 'scooper' instead." This is our language now, it is what it is. Vendors offering serverless solutions have to call it "serverless" whether they like it or not because that's the word customers understand. You can keep complaining about it, but it won't change. You might as well accept it and move on.
"Cloud" is a stupid term too, your servers aren't really floating in the sky. But here we are.
If only somebody would come up with a Common Interface for application Gateways that would let the webserver dynamically select based on URL routing which executable to provision temporarily to generate that page.
CGI is in no way more understandable than serverless as a concept, especially when it shares its acronym with the very popular computer generated imagery.
The english language has a bazillion "wrong" constructions. What is this, kindergarten? Serverless clearly has gotten a specific meaning in the industry, which means "without manually spinning up servers".
And this is how the English language has become what it is. If someone politely points out that the use of a word is incorrect, instead of thanking and correcting themselves, people just shrug and say: "it is what it is" and continue to spread the incorrect usage. I even wonder why bother with grammar in a language like this.
> The english language has a bazillion "wrong" constructions.
Sometimes I complain about some of those too.
Personally, I'm looking forward to the next loop around, when the concept is rediscovered again. It will need a new name. Unfortunately, that's probably going to be 20 or 30 years.
I think the idea is that if enough of us push back, then people will stop using that term. After all, "the industry" is made up of people whose minds we can attempt to change.
The "industry" is comprised of much more than techies -- there are millions of marketing people out there who get hyped up at the shallowest sounding buzzword and we absolutely never will out-shout them. Nor do customers who love buzzwords.
Buzzwords can be fickle things. “Wireless” used to mean radio. Who calls the internet “the information superhighway” anymore? “Computer” used to be a job title. How many job ads capitalise “PC/MAC” even though it’s a contraction of Macintosh and not an initialism like Personal Computer is? The first “tablet” I bought was a Wacom — an accessory, not a computer in its own right. When did we stop calling pocket computers “PDAs”?
"Digital" used to mean something that related to fingers or toes. Then it meant "electronic". Now it means "distributed via network instead of physical medium", even though it's the first "D" in "DVD".
It's not just marketers. The vast majority of engineers are just fine with this term. The set of people complaining about it are a tiny minority.
I'm the lead engineer on Cloudflare Workers. We didn't originally call it "serverless", but after talking to lots of engineers who said "Oh so it's serverless?" we decided to go with that term. The decision was not made by marketers.
There's nothing "ouch" about it. People understand a term to mean a thing, we are doing that thing, we adopt the same term, now everyone understands what we're doing. Communication is good.
<something>“less” has been in use for a long time and typically means a paradigm shift in how the <something> is performed. Horseless carriages still had means of locomotion; driverless cars will still have a means of effecting driver functions; eggless recipes typically have some other substance to perform the job of eggs. In my mind <something>”less” is not as objectionable as <something>”-free” like “sugar-free” which isn’t to say that something isn’t sweet, but that the sweetening comes from something else.
I really thought we'd have grid computing, aka utility computing. Running our self-contained software agents, persisting data and state in ubiquitous tuplespaces.
That's my excuse for being so slow to grasp "cloud". When AWS made EC2 public, I really didn't understand why anyone would want to use it.
Cloud era "serverless" just pisses me off. The applicable cliche is "Every sufficiently complex C project recreates LISP, poorly."
It's seems like we're doomed to repeatedly reinvent the past. Championed by youngsters who never read the book.
Maybe this explains "worse is better".
I know, I know. I've become the old crank. All you noisy kids get off my lawn.
That's Greenspun's 10th rule. The full version of which is, "Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of CommonLisp."
Robert Morris' addendum is, "Including Common Lisp."
If you don't believe Robert Morris, tell me what LOOP does. :-)
(Note, this is the Robert Morris who wrote the first major internet worm, cofounded Viaweb with Paul Graham, and cofounded YCombinator, again with Paul Graham.)
I wouldn't be surprised if the term originated as a marketing tactic (e.g., AWS) to seed in people's heads the idea of not having to care about servers
You hit the nail on the head. It is all about marketing for the new front-end developers who don't want to learn "server side" manipulation db, files, etc.
Personally the whole indexdb, localstorage really break the web page as a stateless model. Why do you need save so much local data to just maintain the session? Stop putting everything in the web browser.
I can see the value of serverless in general (not this hack necessarily) in small startups that don't have sysadmins on payroll but still want to rapidly deploy products with confidence that it will just work - all without worrying about infrastructure, scaling, etc.
What I'm curious is if (1) serverless is cheaper than hiring competent sysadmins who can maintain the infrastructure instead and (2) are these savings worth being locked into a chaotic architecture and boring proprietary tooling that is forced on you for the profit of the Google, Microsoft, and Amazon monopolies?
I've recently entered the job market and personally find no joy in working in serverless environments because of the latter.
Personally I suspect that the chaotic tooling will waste more dev time than what the sysadmin time will cost you. So that it will be a net loss even before you consider lock-in and fees.
While a good argument in favor of the vendor locked-in serverless services can be made at the start of a project, it looks like a net loss during any other phase of the project.
It comes down to if you want to have generic and reusable sysadmins vs. cloud serverless specialists.
> I think I will always struggle to understand the popularity of the term serverless.
AWS marketing is very good.
Serverless was a creation of the AWS Lambda team / AWS marketing department "I helped start the serverless movement..." from the LinkedIn of Tim Wagner (https://www.linkedin.com/in/timawagner/)
They probably call them serverless because you are mainly billed by the number of requests instead of the number of hours it runs for (although for Dynamo, there is a cost per table and r/w)
Almost like how the English language has been literally hijacked and destroyed.
Language does change and evolve. The misuse and subsequent additional meaning for “literally” to now also be a simile for “figuratively” is a good example of how this isn’t necessarily always a good thing.
If a change diminishes the clarity, efficiency or even the beauty of the language for no good reason, then it's a bad change, at least for the purposes of communication.
> One of the definitions of literally that we provide is "in effect, virtually—used in an exaggerated way to emphasize a statement or description that is not literally true or possible." Some find this objectionable on the grounds that it is not the primary meaning of the word, "with the meaning of each individual word given exactly." However, this extended definition of literally is commonly used and is not quite the same meaning as figuratively ("with a meaning that is metaphorical rather than literal").
That seem to confirm my impression that "literally" is used to indicate exaggeration, rather than being figurative.
I've also experimented with this, unfortunately the 50ms CPU time isn't enough for datasets larger than 1.5MB. And the wasm init add at least 100ms to each request even when "hot" in cache.
Also, out of a cost perspective, running a VM with SSD for cheap with SQlite will give much more requests than a CF worker for much less.
Adding writes to this is also very limited due to the max 1sec write per KV key limit.
I've also tried full-text-search in worker by pre-indexing the content, works very fast even with a JS engine - less than 5ms to make a search in 5MB of text.
It runs out of CPU-time at 6MB of index though.
There's someone that made a WASM for full-text search too, it's definitely faster and can handle quite a lot more text.
I was thinking of making it possible for SQLite to be used with static pages.
My idea is to modify SQLite to use ajax with the HTTP Range header to fetch B+ pages from the server as they are needed. SQLite already has a VFS (virtual file system) so this shouldn't be too hard.
I am not sure how fast it would be and it would waste a lot of bandwidth. That's why I haven't made it yet.
This would only be useful for using it with Github pages.
Hm, interesting. I made this, if you want to take a look: https://github.com/TomasHubelbauer/sqlite-javascript. You can test it using the Demo link in the readme and then use the prefilled value in the Load from URL prompt. Based on this I find your idea to be very doable. I didn't do it yet, but while working on this I had async fetching of pages using Range requests on my mind. Shame I didn't get to it when I was working on this project, could be easy to put together a demo. Similar to this I made https://github.com/TomasHubelbauer/fatcow-icons which parses a big ZIP file piece-wise on demand. As you scroll, new parts of the archive are fetched and icons extracted so you don't need to download the whole thing.
I have a sqlite 'on top' of a chrome cache filesystem. My usecase is to torrent the database as i also have support for files. The caveat is that i primarily use as a key-value store (the btree storage part of sqlite).
They are used to share structured and unstructured data in a p2p way where a torrent could be a database(structured) or a fileset(unstructured) .
But all of this is part of a bigger project that is basically a new sort of p2p "app browser" with a local-first emphasis but where every app can share its functionality though RPC (locally or remotely) with distribution going over DHT's and torrents.
Yes. Maybe you could increase the sizes of B+ pages ?
The only reason that databases use B+ trees rather than a red-black tree or avl trees is because of the overhead of reading data from the hard drive.
This would be a interesting hack.
You can increase the page size [1]. That will increase the size of the B tree nodes [2] (and other pages too but that's probably what you want):
> The upper bound on [the number of keys on an interior b-tree page] is as many keys as will fit on the page.
I think a tricky part of this idea would be the locking. Usually SQLite relies on the locking of the underlying filesystem. You could add your own mechanism that causes a lock to be assigned to a single client connection, but what if it never unlocks? (On a single machine you can tell if the client process has crashed.) You could add a timeout but what if the client process then does respond?
Even with an artificially slow connection it seems to be reactive enough. Actually fetching pages on-demand can only be better for the bandwidth, the real issue is going to be latency
I haven't looked at the code but this sounds cool.
There is also something called Kademlia which is a distributed hash table for keeping track of seeds for torrents.
https://en.wikipedia.org/wiki/Kademlia
So we start adding sqlite.
Then comes some sort of small framework for templating.
Then we add session storage.
Then we add an api endpoint so we can serve a SPA.
I think after that it's really time to create a VM in sqlite so we can emulate linux and run somethings else in the 'serverless' endpoint
Lol I did this (sqlite via rest, custom urls, static file hosting) plus authentication/authorization and smtp with email templates. Everything needed for SPAs, no one was interested.
Using SQLite compiled to Wasm in order to push computation closer to the user is a powerful idea. I'm partial to the method of serving up the SQLite files directly and building applications around the SQL.js library [1], which includes math extensions and the ability to embed Javascript udfs. I wrote a data visualization using SQLite as the data store [2] and can attest that it's refreshing to use SQL inside of a static website.
I did exactly that for a minor app I built for a relative (a vocabulary) - a single html page, loading a remote sqlite file, allowing for indexed user searches with no perceivable network latency. the downside is having to transfer 4mb to the client upfront, of course, but gzip shrinks that on the wire down to 1.5mb, which while heavy is acceptable nowadays.
I would agree and here's why: when compared to PHP scripts of old, the concept is the same. You have some bit of executable code that gets called when a specific URL is hit.
"Don't care where it runs, don't want to manage it, I want to pay only for when I actually use it, thanks for managing it, here's some extra cash for the effort"
But you can only apply that metric if you remove the end users from actually touching the machines (virtual or real).
This is the "less" in serverless.
Turns out that this is also something people want: not having to be on the hook for actually administering machines, installing security patches etc etc.
So managed (platform as a) service + fine grained billing
I would have preferred something like daemonless, processless, or something that more closely describes what's being offered. My suggestions also are flawed but less flawed than serverless, imho.
> Recently, folks have begun to use the word "serverless" to mean something subtly different from its intended meaning in this document. Here are two possible definitions of "serverless":
Classic Serverless: [...]
Neo-Serverless: [...]
SQLite was already "Classic Serverless". Now it's "Neo Serverless" as well.
Thanks for sharing your writeup and the potential use cases therein.
We've been building an experimental shared file system that specifically targets the FaaS setting. It supports SQLite and gets big performance gains over NFS/EFS, especially on read-mostly workloads, due to improved local caching and lock elision. See: https://arxiv.org/abs/2009.09845
Did I read it right you get 1 second responses? Why even bother with lambdas? A t3 nano reserved instance costs 2 bucks a month or something! Surely most folks don't anticipate 1000x scaling in minutes? Coupled to elastic beanstalk you can get reaonable scaling as well!
Might be useful for services that aren't continually used? For example, month or year end processes. Not convinced I would use SQLite for a service like that though, seeing as AWS has serverless RDS
This was just a proof-of-concept using the smallest lambdas available. I definitely would opt for a EC2 instance. Though lambda would be enough for something like a configuration service.
Is a lambda easier to configure and set up? With Elastic Beanstalk and flask (and the sample code they provide) I can get an api running in a few minutes..
Both are trivial and take only minutes if you're already familiar with them, but Lambda doesn't involve managing instances, their OSes, or making sure they scale up / down (at the instance level) the way you want.
I was thinking a bit about an "on-edge DB" recently, perhaps something as simple as some way to persist and sync IndexedDB data between user sessions. Not usable for everything but probably great for a small CMS or a blog.
For running a write-enabled DB on a CF worker a major problem is that the only storage option that I could find has really low write limits [0], 1000 writes a day for the free option. "Durable Objects" is a beta API, perhaps its transactional-storage-api [1] has better limits?
edge-sql [1] allows arbitrary SQLite queries to be executed over an immutable external data set. The demo uses a Forex data set stored in Workers KV.
Client issued arbitrary queries is one of the use cases for GraphQL and publishing immutable data sets on the web is the main use case for Simon Wilson’s Datasette [2].
In-memory SQLite compiled to WASM works in the browser and Node.js too. In the future, we can expect proper ACID operations on any WASM runtime that supports fsync in WASI [3], a POSIX-like API.
I have a Google Cloud Function that uses SQLite to store a user's previous day tweets and do some minor sorting and filtering of data. Works very well and saves me the cost of a "real" SQL instance.
CDNs allow serving files closest to the user cutting out 100s of ms of latency.
But they are limited to read only files.
Cloud flare workers allow running simple programs at the CDN node closest to the user.
SQLite is embeddable C code that is like a normal SQL server just without the network parts - just the sync functions embedded into your process.
This wraps SQLite with the networking and CDN abilities of cloud flare, allowing you to http-get a subset of data dynamically using SQL (that would not be possible with the simple “download 100% of file X”).
In AWS Lambda you can run arbitrary processes in their serverless service e.g. my company runs ffmpeg in Lambda. Why does this have to be WASM in javascript engine? Why can't they just run sqlite by javascript forking to a C++ process (or using sqlite binding of javascript)? Is this a POC for wasm?
Very cool. I had similar idea and glad to see it implemented and working well.
Workers KV imposes 25MB limit per key. Worker memory limit is 128MB. Concatenating several values from the store or using sqlite's ATTACH DATABASE should make possible querying of about 100MB large databases, would be my guess.
This is great for any time you have a mostly read-only database (which is true for most publishing use-cases - anything you might consider a static site generator for) which you want to provide API access to with lightning-fast speed because replies will be served from a CDN edge.
The data served from the edge can be modified without touching code, just produce a new version of your SQLite database.
A small DB to search in a list of storefronts which typical retail websites have, a database of flags or promotion codes which you don't want to send to the browser are some examples I came up with.
Personally I would write a code generator to embed the data in the code but to each their own.
With this, suppose you want faster response than Disk, then every serverless function can have its own replica of data & power responses off of that. It is like a cache through sqlclient. But wait, can't we do that already running in memory SQLLite? why is it a big deal now?
My https://datasette.io/ project is built around a similar idea to this: the original inspiration for it was Zeit Now (now Vercel) and my realization that SQLite read-only workloads are an amazing fit for serverless providers, since you don't need to pay for an additional database server - you can literally bundle your data with the rest of your application code (the SQLite database is just a binary file).
If you want to serve a SQLite API in an edge-closest-to-the-user manner for much larger database files (100MB+) I've had some success running Datasette on https://fly.io/ - which runs Docker containers in a geo-load-balanced manner. I have a plugin for Datasette that can publish database files directly to Fly: https://docs.datasette.io/en/stable/publish.html#publishing-...