Hacker News new | past | comments | ask | show | jobs | submit login
LiteFS (fly.io)
648 points by danielskogly on Sept 21, 2022 | hide | past | favorite | 154 comments



This is distributed SQLite 3, running (I assume at least partially managed?) LiteFS[5] for you. Which is pretty cool!

What I'd like to have seen is how this compares to things like rqlite[1] or Cloudflare's D1[2] addressed directly in the article

That said, I think this is pretty good for things like read replica's. I know the sales pitch here is as a full database, and I don't disagree with it, and if I was starting from scratch today and could use this, I totally would give it a try and benchmark / test accordingly, however I can't speak to that use case directly.

What I find however and what I can speak to, is that most workloads already have database of some kind setup, typically not SQLite as their main database (MySQL or PostgreSQL seem most common). This is a great way to make very - insanely, really - fast read replica's across regions of your data. You can use an independent raft[3][4] implementation to do this on write. If your database supports it, you can even trigger a replication directly from a write to the database itself (I think Aurora has this ability, and I think - don't quote me! - PostgreSQL can do this natively via an extension to kick off a background job)

To that point, in my experience one thing SQLite is actually really good at is storing JSON blobs. I have successfully used it for replicating JSON representations of read only data in the past to great success, cutting down on read times significantly for APIs as the data is "pre-baked" and the lightweight nature of SQLite allows you to - if you wanted to naively do this - just spawn a new database for each customer and transform their data accordingly ahead of time. Its like AOT compilation for your data.

if you want to avoid some complexity with sharding (you can't always avoid it outright, but this can help cap its complexity) this approach helps enormously in my experience. Do try before you buy!

EDIT: Looks like its running LiteFS[5] not LiteStream[0]. This is my error of understanding.

[0]: https://litestream.io/

[1]: https://github.com/rqlite/rqlite

[2]: https://blog.cloudflare.com/introducing-d1/

[3]: https://raft.github.io/

[4]: https://raft.github.io/#implementations

[5]: https://github.com/superfly/litefs


LiteFS/Litestream author here. You bring up a lot of great points that I'll try to address.

> What I'd like to have seen is how this compares to things like rqlite or Cloudflare's D1 addressed directly in the article

I think a post comparing the different options is a great idea. I'll try to summarize a bit here though. LiteFS aims to be an analogue to Postgres replication but with built-in failover. Postgres uses log shipping to copy database state from a primary to its replicas, as does LiteFS. LiteReplica is probably the closest thing to LiteFS although it uses a dual GPL/commercial license model. LiteFS uses Apache 2.

There are Raft-based tools like rqlite or dqlite. These have higher consistency guarantees, however, they tend to be more complex to set up -- especially in an ephemeral environment like Kubernetes. LiteFS has a more relaxed membership model for this reason. As for Cloudflare's D1, they haven't released many details and I assume it's closed source. It also requires using a custom JavaScript client instead of using a native SQLite client.

There are also several eventually consistent stores built on SQLite such as Mycelial[1]. These work great for applications with loose consistency needs. LiteFS still maintains serializable isolation within a transaction, although, it has looser guarantees across nodes than something like rqlite.

> Most workloads are already have database of some kind setup, typically not SQLite as their main database (MySQL or PostgreSQL seem most common)

Yes, that's absolutely true. I don't expect anyone to port their application from Postgres/MySQL to SQLite so they can use LiteFS. Databases and database tools have a long road to becoming mainstream and it follows the traditional adoption curve. I wrote a Go database library called BoltDB about 10 years ago and it had the same adoption concerns early on. Typically, folks try it out with toy applications and play around with it. Once they get more comfortable, then they create new applications on top of it. As more people use it, then late adopters get more comfortable and it further builds trust.

We're committed to LiteFS for the long term so we'll be making updates, fixing bugs, and we'll keep trying to build trust with the community.

[1]: https://mycelial.com/


Note: I apologize if this is overstepping, its hard to tell!

I think a strong - extremely strong - selling point is the point I made about "prebaked" data for your APIs, since the entire strength of these SQLite based systems reside in their fast read capacity (as mentioned elsewhere and in this article, its very fast for read heavy applications, which is most) you could take on an angle around that to get people "in the door" by showing a pathway of how this fits inside your existing data warehouse / storage model.

We found we liked the SQLite durability to do this. It was a bit smarter than just a plain cache (such as Redis) with better durability and (for our needs) comparable enough performance (I think in absolute terms, a tuned Redis instance will always be faster, but up to a certain point, speed isn't everything, especially when factoring cost).

We found it was cheaper - by a good margin - to do this over caching everything AOT in a redis cluster, and we could therefore much more cheaply go multi-region and have DB's sitting next to our customers that acted as a nearline cache.

The complexity - which is an area where this might help in the future, and why I'm mentioning it - is shipping changes back. What we ended up doing is setting up a write Redis cluster that clients write to, and we take those writes and trigger a propogation job back to the database. This allowed us to run a much slimmer redis cluster and made us feel more comfortable doing cache eviction since we could verify writes pretty easily. You could do this with memcache or whatever too.

Sounds convoluted, but it worked really well. It allowed us to keep our centralized database intact without having to spin up expensive instances to be multi-region or commit to ever growing Redis cluster(s). the SQLite flat file model + history of durability made things the perfect tradeoff for this use case. Of course, YMMV, however it was a novel solution that used "enterprise grade" parts all the way down, which made it an easy selling point.

You might find it worth exploring this more.

As far as the comparisons go, I think it'd be cool to see a deep dive, and run a test suite against each of the major SQLite as a distributed database model. For that, I don't think it has to be open source to do a reasonable comparison?


> I think a strong - extremely strong - selling point is the point I made about "prebaked" data for your APIs.

I think caches are an excellent use case for LiteFS early on. Sorry I didn't make that point in my previous reply. It's a good way to get benefits out of LiteFS without committing to it as your source of truth. Also related, Segment built a custom SQLite-based solution[1] for distributing out cached data that worked well for them.

[1]: https://segment.com/blog/separating-our-data-and-control-pla...

> We found it was cheaper - by a good margin - to do this over caching everything AOT in a redis cluster

Do you remember specifics of the cost difference? I can imagine it'd be pretty significant since you don't need to spin up servers with a bunch of RAM.

> Note: I apologize if this is overstepping, its hard to tell!

Not overstepping at all! It's great to hear folks' feedback.


I'm a little fuzzy on the numbers, but it was 2-3x (or there about), since we didn't have to run multiple gigabyte clusters anymore. Even with auto-scaling based on demand, the clusters were more expensive simply due to the fact that if you wanted users to have a good experience you had to make sure you had a a reasonable amount of their data AOT cached, which itself was alot of complexity to manage.

With SQLite, we could just scale instances during peak demand (so you could read from different instances of the same data if there was a bottleneck) and scale back down again, without (!usually) losing the data.

It was a really complex - but fun - project all told, but its underpinnings were really simple.

How Segment approached the problem isn't dissimilar to how we did it, honestly.

The only thing that we (may) have done different is we had failover. If SQLite didn't respond our API layer could then talk directly to a database service to get the data. That was surprisingly complex to do.

Its entirely possible that even more robust setups than ours was when we did this would yield higher cost savings. We did this before we hit our next scale of customers, just to add a little more context


> I think caches are an excellent use case for LiteFS early on.

I mean, folks can do stuff like this on Fly with Redis backed by disk, too: https://fly.io/blog/last-mile-redis/


Cloudflare D1 is in closed beta so everything is presumably tentative, but atm they don't support transactions (!) and it doesn't appear to be on their near-term product roadmap. That one really surprised me.


> This is distributed SQLite 3, running (I assume at least partially managed?) Litestream for you.

Oh, I forgot to touch on this point. LiteFS uses some concepts to Litestream (e.g. log shipping), however, it doesn't use Litestream internally. It has much stricter requirements in terms of ensuring consistency since it's distributed so it performs an incremental checksum of the database on every transaction. It has additional benefits with its internal storage format called LTX. These storage files can be compacted together which will allow point-in-time restores that are nearly instant.


Thanks.

> LiteFS still maintains serializable isolation within a transaction, although, it has looser guarantees across nodes than something like rqlite.

Picking up a term from the consistency map here [0], what guarantees LiteFS makes across nodes?

[0] https://jepsen.io/consistency


Good question. Right now, LiteFS operates with async replication so its possible to have a transaction written to the primary get lost if the primary fails before it's replicated out. For that short window, you could read a transaction that no longer exists. From that standpoint, I believe it would technically be Read Uncommitted.

However, during normal operation it'll function more like Snapshot Isolation. LiteFS does provide a transaction ID so requests could wait for a replica to catch up before issuing a transaction to get something closer to Serializable.


I think you’re being too harsh on yourself, isolation levels don’t typically account for replication lag and network partitioning into account. For example on MySQL you can have async replication and serializable isolation. Replicas in this mode might never receive updates.


Isolation levels in ACID are typically used for single instances. However, if you're talking about a distributed system then consistency diagram on Jepsen is more applicable[1]. Raft, for example, can ensure read consistency across a cluster by running reads through the consensus mechanism.

LiteFS aims to be the analogue to Postgres/MySQL replication and those generally work great for most applications.

[1]: https://jepsen.io/consistency


Are you adopting the TH3 approach for your codebase? One of the distinguishing things about SQLite is the DO-178B TH3. Not sure if any other open source database have that. http://www3.sqlite.org/th3.html


The TH3 test is proprietary and, IIRC, it might only be available to SQLite Consortium members and that's $125k/year. We do pay for SQLite support at Fly.io but not quite to that level.

We do have plans to run against the Tcl test suite[1] although most of that test suite is not applicable to LiteFS since it tests higher level constructs.

Since LiteFS acts on the raw pages, it really just functions similar to a VFS. Out of the 130K source lines of code in SQLite, only 4.8k are for the Unix VFS. As such, the testing coverage from the SQLite test suite mostly tests non-VFS code.

[1]: https://github.com/superfly/litefs/issues/17


> This is distributed SQLite 3, running (I assume at least partially managed?) Litestream[0] for you.

Litestream and LiteFS are by the same author and serve different purposes.

https://litestream.io/alternatives/ goes into this. It includes LiteFS and rqlite.

[Edited for clarification]


I did see this, its a little light on details in the way I was thinking about this, however I appreciate the call out, especially for others!

It is missing the D1 comparison though.


I'm happy to add D1 onto the alternatives list once it's released. I believe it's still in closed beta though.


This is really cool! Unfortunately i primarily am interested in offline databases so perhaps i'm just not the target audience. However i have to ask, on that note, does this have any application in the offline space?

Ie i wonder if there's a way to can write your applications such that they have less/minimal contention, and then allow the databases to merge when back online? Of course, what happens when there inevitably _is_ contention? etc

Not sure that idea would have a benefit over many SQLite DBs with userland schemas mirroring CRDT principles though. But a boy can dream.

Regardless, very cool work being done here.


Thanks! It depends on your particular use case. If you need all nodes to allow writes then, unfortunately, no it doesn't support that. LiteFS uses physical replication so there's no way to merge pages from separate writers. You're probably better off looking at eventually consistent SQLite tools like Mycelial[1].

However, if you have a single writer and just need replicas to get updates when they periodically connect then yes, LiteFS could fit that use case.

[1]: https://mycelial.com/


Shameless plug but would love for you to try out Ditto[1]! CRDT-based replicated database that works peer-to-peer including automatically with WiFi and Bluetooth.

[1] https://www.ditto.live


> Developing against a relational database requires devs to watch out for "N+1" query patterns, where a query leads to a loop that leads to more queries. N+1 queries against Postgres and MySQL can be lethal to performance. Not so much for SQLite.

This is misleading AFAICT. The article(s) is actually comparing remote RDBMS to local RDBMS, not Postgres to SQLite.

Postgres can also be served over a UNIX socket, removing the individual query overhead due to TCP roundtrip.

SQLite is a great technology, but keep in mind that you can also deploy Postgres right next to your app as well. If your app is something like a company backend that could evolve a lot and benefit from Postgres's advanced features, this may be the right choice.


You could run Postgres over UNIX sockets although you will still get higher latency than SQLite's in-process model. Also, running a Postgres on every app instance on the edge probably isn't practical. Postgres has some great advanced features if you need them but it's also much more heavy weight.

With LiteFS, we're aiming to easily run on low resource cloud hardware such as nodes with 256MB or less of RAM. I haven't tried it on a Raspberry Pi yet but I suspect it would run fine there as well.


SQLite almost certainly is the better edge RDBMS than Postgres, if only because it has less features taking up space.

However, "local SQLite vs. remote Postgres/MySQL" remains a false dichotomy when talking about network latency.


It's an application design choice. It's perfectly reasonable to consider those two options when designing a system. The constraints of the system determine which is better for the application.

I'd pick a centralized network-reachable database with a strong relational schema for write-heavy applications, and a lighter in-process system for something that is mostly reads and where latency matters. It's not a false dichotomy — but more like a continuum that certainly includes both extremes on it.


There's still plenty of overhead in serializing data over a unix domain socket to a different process, waiting for that process to be scheduled, waiting for it to serialize a response, waiting for the client process to be scheduled. SQLite avoids all of that.


Isn't this all talking to S3 anyway, not to mention the network trips intrinsic to the system before it gets to S3? I mean, I'm sure there's some performance win here, but I'm surprised it's so significant.

It's not like the address-space separation is without benefits... heck, if it weren't, you could simply have embedded the whole application inside Postgres and achieved the same effect.


You may be confusing Litestream and LiteFS.

Litestream writes everything to S3 (or similar storage).

LiteFS lets different nodes copy replicated data directly to each other over a network, without involving S3.

In either case, the actual SQLite writes and reads all happen directly against local disk, without any network traffic. Replication happens after that.


I think I was definitely confusing it with Litestream as the blog post made reference to it (and I did find that confusing).

That said, unless I've misunderstood the LifeFS use case, you're still going over the network to reach a node, and that node is still going through a FUSE filesystem. That would seem to create overhead comparable (potentially more significant) to talking to a Postgres database hosted on a remote node.

It just doesn't seem that obvious that there's a big performance win here. I'd be curious to see the profiling data behind this.


It's an "eventual consistency" transaction scheme for SQLite: reads happen locally and immediately with the nearest edge copy available, it's only writes that need to be forwarded to a (potentially far away) primary DB over the network somewhere, and transactions will then take some amount of network time to fan back out to edges, so eventually the edges will read the latest transactions, but during that fan out will read the previous transaction state.


Local for the application code on the node you're talking to (just like the read would be local for a Postgres process running on the node), but there's still a network trip to get to the node, no?

Even if the reads are happening locally, if they're going through FUSE (even a "thin" pass-through that does nothing), that means they're getting routed from kernel space to a FUSE daemon, which means you're still doing IPC to another process through a kernel...


The performance boost that matters most here is when your application reads from the database. Your application code is reading directly from disk there, through a very thin FUSE layer that does nothing at all with reads (it only monitors writes).

So your read queries should mostly be measured in microseconds.


> So your read queries should mostly be measured in microseconds.

You should check out the read latency for read-only requests over unix domain sockets with PostgreSQL. You tend to measure it in microseconds, and depending on circumstances it can be single-digit microseconds.

Regardless of whether your FUSE logic does nothing at all, It sure seems like there's intrinsic overhead to the FUSE model that is very similar to the intrinsic overhead of talking to another userspace database process... because you're talking to another userspace process (through the VFS layer).

When the application reads, those requests go from userspace to the FUSE driver & /dev/fuse, with the thread being put into a wait state; then the FUSE daemon needs to pull the request from /dev/fuse to service it; then your FUSE code does whatever minimal work it needs to do to process the read and passes it back through /dev/fuse and the FUSE driver, and from there back to your application. That gets you pretty much the same "block and context switch" overhead of an IPC call to Postgres (arguably more). FUSE uses splicing to minimize data copying (of course, unix domain sockets also minimize data copying), though looking at the LiteFS Go daemon, I'm not entirely sure there isn't a copy going on anyway. Memory copying issues aside, from a latency perspective, you're jumping through very similar hoops to talking to another user-space process... because that's how FUSE works.

There's a potential performance win if the data you're reading is already in the VFS cache, since that would bypass having to through the FUSE filesystem (and the /dev/fuse-to-userspace jump) entirely. The catch is, at that point you're bypassing SQLite transaction engine semantics entirely, giving you a dirty read that's really just a cached result from a previous read. That's not really a new trick, and you can get even better performance with client-side caching that can avoid a trip to kernel space.

I'm sure there's a win here somewhere, but I'm struggling to understand where.


> There's a potential performance win if the data you're reading is already in the VFS cache, since that would bypass having to through the FUSE filesystem.

That's a typical case for a lot of databases. Most of the "hot" data is in a small subset of the pages and many of those can live in the in-process page cache.

> The catch is, at that point you're bypassing SQLite transaction engine semantics entirely, giving you a dirty read that's really just a cached result from a previous read.

It's not bypassing the transaction engine semantics. For WAL mode, SQLite can determine when pages are updated by checking the SHM file and then reading updated pages from the WAL file. Pages in the cache don't need to flushed on every access or even between transactions to be valid.

> I'm sure there's a win here somewhere, but I'm struggling to understand where.

The main goal of LiteFS is to make it easy to globally replicate applications. Many apps run in a single region of the US (e.g. us-east-1) and that's fast for Americans but it's a 100ms round trip to Europe and a 250ms round trip to Asia. Sure, you can spin up a multi-region Postgres but it's not that easy and you'll likely deploy as separate database and application servers because Postgres is not very lightweight.

LiteFS aims to have a minimal footprint so it makes it possible to deploy many small instances since SQLite is built to run on low resource hardware.

As far as comparisons with Postgres over UNIX sockets, I agree that the performance of a single instance is probably comparable with a FUSE layer.


> That's a typical case for a lot of databases. Most of the "hot" data is in a small subset of the pages and many of those can live in the in-process page cache.

Yes, though most database engines end up managing their own cache and using direct IO, rather than the VFS cache.

> It's not bypassing the transaction engine semantics. For WAL mode, SQLite can determine when pages are updated by checking the SHM file and then reading updated pages from the WAL file. Pages in the cache don't need to flushed on every access or even between transactions to be valid.

That sounds a lot like at least the SHM & WAL checks wouldn't be cached by the VFS, but as I've been looking at the design more carefully, I'm starting to think I understand the idea here. Basically, the SHM & WAL get updated separately, so you might read a stale version, but since you aren't elected to be a writer, that just means you're looking at stale data, not creating an integrity problem.

> The main goal of LiteFS is to make it easy to globally replicate applications. Many apps run in a single region of the US (e.g. us-east-1) and that's fast for Americans but it's a 100ms round trip to Europe and a 250ms round trip to Asia. Sure, you can spin up a multi-region Postgres but it's not that easy and you'll likely deploy as separate database and application servers because Postgres is not very lightweight.

So, I get that multi-region Postgres can be tricky to set up, if you're doing multi-leader, but this seems about as complicated as a "single leader, many followers" set up, and given that what you're trying to do is shave off the hundreds of milliseconds from partially circumnavigating the earth at the speed of light, I'm not sure the perceived performance differences are significant (or really even measurable) compared to fluctuations in network latency of requests to the region-local node.

> LiteFS aims to have a minimal footprint so it makes it possible to deploy many small instances since SQLite is built to run on low resource hardware.

This part I'm getting and the objective a lot of sense to me (and certainly running local postgres instances on every node wouldn't be an obvious approach to me). I hadn't thought of FUSE + SQLite as a way to get there, so this is an interesting and surprising approach. I'm looking forward to how this plays out.

> As far as comparisons with Postgres over UNIX sockets, I agree that the performance of a single instance is probably comparable with a FUSE layer.

Interesting. I was thinking I was missing something. Thanks for all the insight.


Yes, FUSE adds overhead. However,

1. kernel page cache fully removes the read overhead for in-memory pages

2. There's a FUSE_PASSTHROUGH mode that removes that overhead for all reads & writes. I haven't studied exactly what writes LiteFS needs to observe (just journal vs all data writes), but at least for reads it seems to pass them straight through. We could well submit a FUSE_PASSTHROUGH_READ patch to the kernel, and use that to remove all read overhead. The patch should be trivial, since the full FUSE_PASSTHROUGH mode is there already.

Disclaimer: I wrote the FUSE framework LiteFS uses, https://bazil.org/fuse

https://source.android.com/docs/core/storage/fuse-passthroug... https://lwn.net/Articles/674286/


I had totally forgotten about FUSE_PASSTHROUGH. That's a great insight. Thanks.


Wow, thank you so much for explaining the overhead around FUSE usage, that’s really eye opening and makes having some performance benchmarks down the line even more interesting.


That's a really detailed, interesting reply. I am now really interested to understand more about the comparative overhead here.


TL;DR "Postgres doesn't fit well our edge services business model" which sure fine but the article is indeed biased/misleading by completely ignoring and not mentioning the option of running postgres and app in a single server.

The gains in latency with sqlite won't matter as soon as throughput starts to dominate.


Anyone want to chime in with the largest app they've deployed where prod Postgres was reachable over a Unix domain socket?


Used to work developing standalone kiosks and things (think along the lines of stuff you'd find at a museum as the basic type of thing I'd work on), so absolutely not the use case you're thinking of.

In a number of cases, we were working with data that would really benefit from actual GIS tooling so PostGIS was kind of the natural thing to reach for. Many of these kiosks had slow, intermittent, or just straight up no internet connection available.

So we just deployed PostGIS directly on the kiosk hardware. Usually little embedded industrial machines with passive cooling working with something like a generation or two out-of-date Celeron, 8GB RAM, and a small SSD.

We'd load up shapefiles covering a bunch of features across over a quarter million square miles while simultaneously running Chromium in kiosk mode.

I know for a fact some of those had around 1k DAU. I mean, never more than one simultaneously but we did have them! I'm sure it would have handled more just fine if it weren't for the damn laws of physics limiting the number of people that can be physically accessing the hardware at the same time.

That said, we had the benefit of knowing that our user counts would never really increase and due to real-world limitations we'd never be working with a remote database because there was literally no internet available. In general I'd still say colocating your database is not an excuse to allow N+1 query patterns to slip in. They'll be fine for now and just come back to bite you in the ass when your app _does_ outgrow colocating the app/database.


I'm surprised no one is chiming in from the LAMP era. Maybe I was just in a bubble, but I feel like before modern cloud providers, folks deploying on VPSs tried to keep operational complexity lower by running their database on the same box until they couldn't anymore.

In my experience, I just cached the living hell out of the apps to avoid the I/O from the database. Providers in those days had huge issues with noisy neighbors, so I/O performance was quite poor.

[0]: https://en.wikipedia.org/wiki/LAMP_(software_bundle)


Back in my consulting days one of my clients had an application with nearly all of the business logic being done in-database. The app was essentially a frontend for stored procedures galore and it sat on an absolute monster of a system for its time (~256GB of RAM, 24 or 32 Cores I think). It handled something like 100k DAU with no issue. That was for just the transactional system. They replicated data to other (similar style) systems for less mission-critical things like reporting, billing, etc.

I want to be clear though, I would never recommend anyone do this. For its time it was impressive but I suspect by now it's been re-engineered away from that model, if for no other reason than its a lot of eggs in one basket.


It's a great way to deploy if you have the RAM and disk.

Bump up the shared buffers, have a multithreaded or pooled driver for your app server so connections are not being opened and closed all the time.

It really 'flies' under such a scenario.

Largest app was a big analytics and drug tracking application for a small pharmaceutical company. Not sure of the size but it was very reporting heavy.


I would wager that the median IT shop/dept is running off a single node RDBMS, the majority of which also have the spare resources to support running the primary application (which has 1-100 DAU).

No practical way to test this, of course.


Had an instance with almost 2TB on disk in 2015, it's probably a lot more by now, if it's still running as-is. Was for analytics/data-warehousing. Though the big table was partitioned so queries were always nice and snappy. The machine had 128gb ram back then, but probably never needed that much. The applications working with the data ran on that same server, so processing latency was usually <10ms even for larger queries.


is it cheating when pgbouncer is being accessed via socket? lol


Yes, that is cheating.


I don't have any context around this: why is that cheating?


Because the point of the Unix domain socket (in this argument) is eliminating network round trips.


Oh gotcha, I assumed they meant all 3 we're on one host. I'm not too familiar with the pg ecosystem.


Not me personally, but my understanding some Cisco products have Postgres "embedded." If they're not using unix domain sockets, they're at least using loopback.

...come to think of it, they may not be the only network device maker that does this.


Are there significant limits to the size of an app that could be deployed alongside Postgres versus the size of an app that could be deployed alongside SQLite?


No, the big difference is write speed. SQLite is limited in the amount of writes it can do at the same time(typically 1). This is generally called "concurrent writes". Last I checked SQLite can't really get past 1 write at the same time, but it can emulate concurrent writes in WAL mode such that it isn't normally a problem, for most applications. Postgres has no such limit(though there can be limits to updating a particular row/column of data concurrently, obviously).

Otherwise both are generally limited to the physical resources of the machine(memory, disk, etc). Generally speaking you can scale boxes much much farther than your data size for most applications.


> it can emulate concurrent writes in WAL mode

Seems a bit unfair to call WAL mode emulation of "true" concurrent writes as I'm pretty sure a write-ahead-log (WAL) is exactly how other databases implement multiple concurrent writes. It's just always-on rather than being opt-in.


Well, but that's not really what it's doing. WAL1 mode in SQLite is different than in your typical RDBMS. WAL mode in SQLite is just getting the write path outside of the read path, but there is still only 1 write allowed at any given time. The "fake" concurrency is done by letting write transactions queue up and wait for the lock.

See [here](https://sqlite.org/wal.html) under "2.2. Concurrency" where it says:

> "However, since there is only one WAL file, there can only be one writer at a time."

SQLite is awesome, I'm a huge fan, but if you need to do lots and lots of writes, then SQLite is not your friend. Luckily that's a very rare application. There is a reason you never see SQLite being the end point for logs and other write-heavy applications.


We use managed postgres on GCP, and it is served over unix socket.


Kind of by definition, GCP's managed postgres is served up over the network. I'm guessing you mean "unix socket" in the sense of the socket API, not unix domain sockets.


A better comparison might be Postgres to BedrockDB.

BeckrockDB is a MySQL compatible interface wrapped around SQLite and as such, is intended for remote RDBMS use.

Expensify has some interesting scaling performance data for their use of it.

https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...


Isn't "n + 1" typically a degenerate case of ORM rather than a relational thing? The whole point of rdbms is you can do a join instead of iterated lookups.


You see it in direct code too, from new or naive coders (the good old "it was fast enough on my test data…") or when proof-of-concept code "accidentally" escapes into production.

Even within the database you'll sometimes find examples: stored procedures looping over a cursor and running queries or other statements in each iteration. This doesn't have the same latency implications as N+1 requests over a network, but still does add significant time per step compared to a set based equivalent.

If you think about it, the database will sometimes do this itself when your database is not optimally structured & indexed for your queries (in SQL Server, look at "number of executions" in query plans).


> You see it in direct code too, from new or naive coders

Many moons ago in one of my first coding projects I not only used a separate query for each record, but a separate database connection too. In addition to this, I didn't use a loop to generate these queries, I copy and pasted the connection setup and query code for each record, leading to a 22k code file just for this single web page.

There were probably on the order of 1000 records (they represented premier league football players), and amazingly this code was fast enough (took a couple of seconds to load - which wasn't so unusual back in 2007) and worked well enough that it actually had ~100 real users (not bad a 14 year old's side project).


It is a result of representing application state as complex entity structures. I am not sure I would call it degenerate. n+1 would be quite natural if relational databases were theoretically pure. It only falls down in the real world because we accept deviation from the theory to allow implementation.

Joins are one possible workaround to the n+1 problem. However, they push the problem into the application to unpack, which detracts from having a nice high level abstraction in the database for fetching data. The many queries of n+1 would be ideal if you had an ideal database. You are querying on behalf of many different entities so many different queries is conceptually sound. But since we don't have ideal databases...

That's not to say that joins don't have their place. In the relational model, joins are essential. But the n+1 problem happens when you are not working in the relational model.


Yeah most N+1 issues can be mitigated by optimizing the queries.

You definitely don't need an ORM to get to N+1 hell, a for-loop in vanilla PHP will do.


Yeah, but you have to go out of your way to do that. If you're in touch with the SQL you're executing at all, the join is the obvious easy way to do it, it's not some arcane optimization.


You do see it often in things like canned ecommerce software. They have this ecosystem that encourages you to add widgets to category listings, product listings, etc, that all create queries keying off the page context.

Where widgets might be "similar items", "people that bought x, bought y", "recent purchases of this widget", "best sellers in this category" and so on.


That’s a really big “if”. In my experience, most application engineers aren’t very good at SQL because they don’t have to be; most apps just don't ever reach a scale where it matters, and something like Rails or Django with N+1 queries all over the place performs just fine.


Hence my 2 decade losing battle against ORM at every place I've worked :)

I'm, like, 1 for 10 maybe.


A for-loop in vanilla anything will do. Nothing special about PHP. Ruby, Python, Java, etc...


Of course. PHP is the one you will mainly encounter in the wilderness, though.


Most common place I saw N+1 is ActiveRecord (Ruby on Rails).

Because by default, no relationships are loaded and the way AR "loads" stuff makes is extremely easy to create N+M queries on 5 lines of ruby. Then, those people will tell you that you run out of database before you run out of ruby...


I get the impression that there's still quite a bit of overhead involved in talking to PostgreSQL that you don't get with SQLite. A SQLite query is pretty much a C function call that works directly with data structures and local disk.


Yep. The poison is still in the cup, it's just a smaller dose. Keep drinking if you think that fact is enough to save you...

I see bad judgement calls coming from small numbers, often due to failing to do the cost x frequency math properly in your head. If you're looking at an operation that takes 3ms or 3μs that is called a million times per minute, or twenty thousand times per request, you don't have enough significant figures there and people make mistakes. 3ms x 12345 = ~40s, not 37035ms. Better if you use a higher resolution clock and find out it's actually 3.259 ms, leading to a total of ~40.23s

Point is, when we are doing billions of operations per second, lots of small problems that hide in clock jitter can become really big problems, and a 5-10% error repeated for half a dozen concerns can lead to serious miscalculations in capacity planning and revenue.


Select N+1 is a fundamental anti-pattern of relational database usage. Reducing the latency per round trip doesn't change this fact.


I disagree. I think it makes a big difference.

https://sqlite.org/np1queryprob.html#the_need_for_over_200_s... talks about this in the context of Fossil, which uses hundreds of queries per page and loads extremely fast.

It turns out the N+1 thing really is only an anti-pattern if you're dealing with significant overhead per query. If you don't need to worry about that you can write code that's much easier to write and maintain just by putting a few SQL queries in a loop!

Related: the N+1 problem is notorious in GraphQL world as one of the reasons building a high performance GraphQL API is really difficult.

With SQLite you don't have to worry about that! I built https://datasette.io/plugins/datasette-graphql on SQLite and was delighted at how well it can handle deeply nested queries.


Perhaps the overhead of chatty queries is diminished in this special use case.

But it doesn't change the fact that standalone database server processes are designed to support specific queries at lower frequencies. This is one of the main points of The SQL language is to load precisely the data that is needed in a single statement.

Relying on this is a design pattern would only scale in specific use cases and would hit hard walls in changing scenarios


This approach is very appealing to me :) curious about how people handle schema migrations when using this approach.

I segment sqlite files (databases) that have the same schema into the same folder. I haven't really had a case where migrations was really a concern, but I could see it happening soon.

Seems like in my deployment, I'm going to need an approach to loop over dbs to apply this change... I currently have a step of app deployment that attempts to apply migrations... but it is more simplistic because the primary RDBMS (postgresql) just appears to the application as a single entity which is the normative use-case for db-migrate-runners.


Yeah schema migrations are going to be interesting. Presumably those end up making a HUGE change to the WAL (since they could affect every stored row in a large table), which means LiteFS then has to send a truly giant chunk of data out to the replicas.

I wonder how this would present itself in real-world usage? Would the replicas go out-of-date for 10-30s but continue serving read-only traffic, or could there be some element of cluster downtime caused by this?


> Presumably those end up making a HUGE change to the WAL (since they could affect every stored row in a large table), which means LiteFS then has to send a truly giant chunk of data out to the replicas.

Yes, a large migration could end up sending a lot of data out to replicas. One way to mitigate this is to shard your data into separate SQLite databases so you're only migrating a small subset at a time. Of course, that doesn't work for every application.

> Would the replicas go out-of-date for 10-30s but continue serving read-only traffic, or could there be some element of cluster downtime caused by this?

Once WAL support is in LiteFS, it will be able to replicate this out without causing any read down time. SQLite is still a single-writer database though so writes would be blocked during a migration.


> To improve latency, we're aiming at a scale-out model that works similarly to Fly Postgres. That's to say: writes get forwarded to the primary and all read requests get served from their local copies.

How can you ensure that a client that just performed a forwarded write will be able to read that back on their local replica on subsequent reads?


That's a fun one.

A couple years ago someone posted a solution to that here. I'm not sure if it works for SQLite, but it worked for Postgres. The basics of it were that each replica was aware of the latest transaction ID it had seen. On a normal read you'd deal with the usual set of eventually consistent issues. But on a read-after-write, you would select a replica that was ahead of the write transaction.

Ultimately that's a very odd flavor of sharding. What I don't recall is how they propagated that shard information efficiently, since with sharding and consistent hashing the whole idea is that the lookup algorithm is deterministic, and therefore can be run anywhere and get the same result. WAL lag information is dynamic, so... Raft?


LiteFS provides a transaction ID that applications can use to determine replication lag. If the replica is behind the TXID, it can either wait or it can forward to the primary to ensure consistency.


Is that transaction ID made available to SQLite clients? As a custom SQL function or a value in a readable table somewhere?


It’s available as a file handle. If you have a database file named “db” then you can read “db-pos” to read the replication position. The position is a tuple of the TXID and the database checksum.


Can the app read it concurrently while it is updated by LiteFS?


Looks like LiteFS invalidates the cached file contents on every update of position. This means there is never stale data to be read.

(The client needs to make sure to consume that in a single pread/read syscall, or it could observe a sheared state.)

Disclaimer: I wrote the FUSE framework LiteFS uses, https://bazil.org/fuse

https://github.com/superfly/litefs/blob/52e269d4b04070690ce2... https://github.com/superfly/litefs/blob/52e269d4b04070690ce2... https://github.com/superfly/litefs/blob/a5cf33d1a3a91873d4ad...


Very clear. Thanks for the great explanation.


Yes, the app can read it whenever. There's a brief lock obtained on the current database position internally to read it out. The position only gets updated at the end of the transaction.


Another solution that I've used successfully in the past for web apps is to set a 10s cookie every time a user performs a write, and then route their read requests to the lead server until that cookie expires. That way they're sure to see the impact of the write they just made.


If write traffic is low enough I could see that working, but 10s is an awful long time. Anything less than a 1000:1 read:write ratio would start running into capacity problems wouldn't it?


That's never been a problem on any system I've worked on. If you assume a web application with logged in users who are occasionally updating their data, the percentage of users who will have performed a write within the last 10s is always going to be absolutely tiny.


The absolute rate is the problem. You can't shed that load to other machines so the percentage of users or percentage of traffic doesn't matter. This is basically a variant of Little's Law, where server count has a ceiling of 1.

100 edits a minute from distinct sessions is 1000 sessions pinned at any moment. If they read anything in that interval it comes from the primary. The only question is what's the frequency and interval of reads after a write.


Yeah a transaction ID solution would work pretty well. Every request would return the highest transaction ID seen by the replica; writes forwarded to the primary would also return the transaction ID after that write; any request with a transaction ID higher than the replica would be forwarded to the primary.


I think that may have been the case. Try the local replica, if it is too old, hit the primary, rather than shopping around for the next best replica.


They don't have ACID guarantees with this setup.


ACID is typically used to describe single database instances and from that perspective, LiteFS doesn't change the ACID semantics that SQLite has. For distributed consistency models[1], it gets more complicated as LiteFS currently just supports async replication. We'll introduce stronger guarantees in the future although we currently provide some information such as transaction IDs so you can manage replication lag if you need stronger consistency.

[1]: https://jepsen.io/consistency


10 years ago fly.io is the company I wanted to build. Something with massive technical depth that becomes a developer product. They're doing an incredible job and part of that comes down to how they evangelise the product outside of all the technical hackery. This requires so much continued effort. AND THEN to actually run a business on top of all that. Kudos to you guys. I struggled so much with this. Wish you nothing but continued success.


5 years ago, fly.io was basically poor man's deno.com / oven.sh [0]. In my (incorrect) opinion, tptacek changed fly.io's trajectory single-handedly.

[0] https://ghostarchive.org/varchive/r-1hXDvOoHA


What is @tptacek’s relation with fly.io, and how did he change it?


tptacek is a security researcher and principal at fly.io [0]. And as an outsider looking in, seems to have been the eng behind some of their defining features [1]:

- udp support

- container2vm overhaul

- private networks aka 6pn

- some key flyctl (cli) commands like flyctl ssh, flyctl proxy

- metrics

- litefs

- perhaps, the imminent overhaul of the orchestration layer (?)

- the upcoming authz layer

[0] https://archive.is/Zwzlh

[1] https://community.fly.io/u/thomas/activity/topics | https://fly.io/blog/author/thomas/


> seems to have been the eng behind some of their defining features

> - litefs

That glory goes to https://news.ycombinator.com/user?id=benbjohnson


GAH NO.


No? No. No!


You won't change my mind with 3 No's. A fourth however...


I had literally nothing to do with Fly.io's switch from Javascript to Firecracker. Happened months and months before I joined.


I don't want to come off as downplaying others engs at Fly (esp those from Compose.io) [0], but I wanted to point out how there's a marked difference between the period from when you joined them and years preceding that. A catalyst?

Of course, I claim no inside knowledge, so I may very well be mistaken (:

[0] https://news.ycombinator.com/item?id=9937023


I wonder if using FUSE has had any appreciable impact on performance, particularly read performance. I ask because FUSE has historically had a reputation for being slow, e.g. with the old FUSE port of ZFS.


I think FUSE performance has significantly improved since 2010, the year in which ZFS-On-Linux became available I believe.

This is just one of many examples: https://www.phoronix.com/news/MTI1MzM

In contrary, in some cases FUSE is even faster than doing a regular kernel mount(). There is an experimental research distribution called distri that is exclusively relying on fuse mounting and figured out that FUSE was faster for doing mounts. https://michael.stapelberg.ch/posts/tags/distri/


You're conflating zfs-fuse and zfs-on-linux (ZoL).

The first one is as it's name suggests.. a user space implementation of ZFS.

ZoL (now unified with OpenZFS) is implemented as a kernel module and as such does _not_ run in user space. It performs significantly better as a result.

FUSE is still slow, which is why there's ongoing effort to replace things like NTFS-3G (the default NTFS implementation in most linux distros) with an in-kernel implementation: https://news.ycombinator.com/item?id=28418674


How do I conflate zfs-fuse with ZoL? I was trying to highlight that if someones memory of "poor fuse performance" is because of zfs-fuse, then someones memory must be quite old, as ZoL has been available for quite some time and since then also fuse itself has received quite some performance gains.

Edit: Also I don't want to imply that FUSE is near in-kernel filesystems, but it is certainly performing much better than 12 years ago.


Often, the SQLite database would fit in RAM, and reads would be served straight from the page cache, with no overhead.

Disclaimer: I wrote the FUSE framework LiteFS uses, https://bazil.org/fuse -- and I also have some pending performance-related work to finish, there...


The pain is that this approach is suitable on VPS/IaaS where disk volume is supported. As a solo dev, I only use PaaS kind of infra, there are just a few PaaS i'm aware of that support attachable disk. Fly, Render, .. nothing else?


Where is the data actually being stored in this setup? A copy on each machine running the application? If so, is there another copy somewhere else (e.g. S3) in case all nodes go down?

Also, what happens if the Consul instance goes down?

If my application nodes can't be ephemeral then this seems like it would be harder to operate than Postgres or MySQL in practice. If it completely abstracts that away somehow then I suppose that'd be pretty cool.

Currently finding it hard to get on board with the idea that adding a distributed system here actually makes things simpler.


> Where is the data actually being stored in this setup? A copy on each machine running the application?

Yes, each node has a full copy of the database locally.

> If so, is there another copy somewhere else (e.g. S3) in case all nodes go down?

S3 replication support is coming[1]. Probably in the next month or so. Until then, it's recommended that you run a persistent volume with your nodes.

> What happens if the Consul instance goes down?

If Consul goes down then you'll lose write availability but your nodes will still be able to perform read queries.

> If my application nodes can't be ephemeral then this seems like it would be harder to operate than Postgres or MySQL in practice.

Support for pure ephemeral nodes is coming. If you're comparing LiteFS to a single node Postgres/MySQL then you're right, it's harder to operate. However, distributing out Postgres/MySQL to regions around the world and handling automatic failovers is likely more difficult to operate than LiteFS.


> If Consul goes down then you'll lose write availability but your nodes will still be able to perform read queries.

That's a tricky choice of words, since it looks like you lose Consistency, while retaining Availability and Partition Tolerance. If Consul is down everyone reads stale data, but no writes. Right?

Of course, it's harder for Consul to go down than it is for your database to go down, so the Venn Diagram of "Consul unhappy, Database Happy" is fairly heavily populated with "user error". Which is why 'when in doubt, use Consul' is not terrible advice.


Will this replicate the filesystem/volume between instances? Can for example use text files or different kind of filebased database? Or it requires me to use sqlite?


This is only for SQLite. But you can put text files (or even binary files - see https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sq...) in a SQLite database and you may find you get better performance than reading files from disk: https://www.sqlite.org/fasterthanfs.html


> Second, your application can only serve requests from that one server. If you fired up your server in Dallas then that'll be snappy for Texans. But your users in Chennai will be cursing your sluggish response times since there's a 250ms ping time between Texas & India.

> To improve availability, it uses leases to determine the primary node in your cluster. By default, it uses Hashicorp's Consul.

Having a satellite office become leader of a cluster is one of the classic blunders in distributed computing.

There are variants of Raft where you can have quorum members that won't nominate themselves for election, but out of the box this is a bad plan.

If you have a Dallas, Chennai, Chicago, and Cleveland office and Dallas goes dark (ie, the tunnel gets fucked up for the fifth time this year), you want Chicago to become the leader, Cleveland if you're desperate. But if Chennai gets elected then everyone has a bad time, including Dallas when it comes back online.


Good catch. LiteFS has an option to mark nodes as candidates or not. For example, you can run a cluster with a couple candidate nodes in one region (e.g Chicago) and every other region may just run a single non-electable replica.

We don’t support any kind of tiering for candidates in different regions. That’s not a bad idea though.


I can imagine a database which tries to solve both of these domains.

A centralized database handles consistency, and vends data closures to distributed applications for in-process querying (and those closures reconcile via something like CRDT back to the core db).

Does this exist?


(except for the "reconcile back" to primary) Yes? https://martin.kleppmann.com/2015/11/05/database-inside-out-...

See also: materialize.com, readyset.io, aws elastic-views (forever in preview).


I actually gave fly.io a whirl over the weekend. was not fun. spent a lot of time on the forums and its pretty clear it has some way to go before it can give AWS or Linode a run for their money.

For instance, we run kubernetes on multiple VPS providers, including public clouds with serverless onramp/offramps deployed on edge location. Anything under 15 minutes are processed by serverless. Anything longer is offloaded to one of the VPS containers available in every part of the world.

I have some more feedbacks ready if you are interested, its a neat idea but not exactly as seamless and easy as the idea proposed since public clouds already offer a way to do this.


Currently, I am running multiple application servers (and lambda functions) using AWS Fargate that access sqlite files (databases) on an EFS share. So far so good, although my use cases are fairly simple.


Well that was fast... [1]

Are the readable replicas supposed to be long-lived (as in, I don't know, hours)? Or does consul happily converge even with ephemeral instances coming and going every few minutes (thinking of something like Cloud Run and the like, not sure if Fly works the same way)? And do they need to make a copy of the entire DB when they "boot" or do they stream pages in on demand?

[1] https://news.ycombinator.com/item?id=32240230


> Are the readable replicas supposed to be long-lived

Right now, yes, they should be relatively long-lived (e.g. hours). Each node keeps a full copy of the database so it's recommended to use persistent volumes with it so it doesn't need to re-snapshot on boot.

We do have plans to make this work with very short-lived requests for platforms like Lambda or Vercel. That'll use transactionally-aware paging on-demand. You could even run it in a browser, although I'm not sure why you would want to. :)


Sound like a drop in solution to add high availability to WunderBase (https://github.com/wundergraph/wunderbase). Can we combine LiteFS with Litestream for Backups, or how would you do HA + Backups together?


Yes, it should fit with WunderBase to provide HA. I'm adding S3 support which will work the same (in principle) as Litestream. I'm hoping to have that in about a month or so.


Can you comment on how you added SFTP support to litestream so that it would work with rsync.net[1][2] ?

How does that compare/contrast with what my grandparent is alluding to ?

[1] https://github.com/benbjohnson/litestream/issues/140

[2] https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...


Litestream does log shipping so it just takes a chunk of the write-ahead log (WAL) and copies it out a destination. That can be S3, GCP, SFTP, etc.

LiteFS will do the same although I'm only targeting S3 initially. That seemed to be the bulk of what people used.


OK, thank you.

As with the litestream component, it would be wonderful to have an SFTP transport for LiteFS.

When you say 'S3' I assume you mean "S3 compatible API" so that would be fairly open and portable but SFTP would be even more so.

Thanks again.


Seems neat, until you try to do schema migrations. Unless they can guarantee that all containers’ SQLite instances have the same scheme without locking I’m not sure how doesn’t run into the same issues as many NoSQL.

CouchDB had this same issue with its database per user model and eventually consistent writes.


In my experience distributed database migrations tend to be two phase though so it can be compatible during the migration. The first migration to add something, release the apps to use it, then a second migration to cleanup.


I don't see how this is different to a world where you use Postgres over the network, with lots of DB clients. You "just" do the schema migrations like everyone who can't shut down the business does: incrementally and in a backwards-compatible manner. After the migration, deploy new code that takes advantage of the new schema. Repeat until done.


Unrelated, but why does the map on their homepage show a region in Cuba? That must be wrong.


I think it's in Miami and the icon is confusingly centered over it. The icons look like pins but I think they're supposed to be hot-air balloons. This explains why some of them appear to point to spots in the ocean.

https://fly.io/docs/reference/regions/


Ahh. Great catch. Their balloon definitely looks like a pin.


ELI5?


LiteFS makes it so that you can have a SQLite database that is transparently replicated to a cluster of machines. The use case is if you have your app running on a server running in Chicago, then users in Europe will have 100ms latency to your server and users in Asia will have a 250ms latency. That's on top of the time it takes for your server to process the request. Many people target sub-100ms latency to make their applications feel snappy but that's impossible if users are waiting on their request to be sent halfway around the world.

Traditionally, it's complicated to replicate your database to different regions of the world using something like Postgres. However, LiteFS aims to make it as simple as just spinning up more server instances around the world. It connects them automatically and ships changes in a transactionally safe way between them.


> Traditionally, it's complicated to replicate your database to different regions of the world using something like Postgres

what makes it complicated?


It’s a tool to allow you to keep your database on the app server and backed up on s3. This means you don’t need a separate database server. Having a database on the same machine as the app is very fast, but it is usually risky. The backups made by LiteFS lessen the risk.


That's Litestream, not LiteFS.


Does anyone else bump into the issue, that the fly.io website does not load if requested via IPv6 on Mac? I tried Safari, Chrome and curl and neither work:

  $ curl -v https://fly.io/blog/introducing-litefs/
  *   Trying 2a09:8280:1::a:791:443...
  * Connected to fly.io (2a09:8280:1::a:791) port 443 (#0)
  * ALPN, offering h2
  * ALPN, offering http/1.1
  * successfully set certificate verify locations:
  *  CAfile: /etc/ssl/cert.pem
  *  CApath: none
  * (304) (OUT), TLS handshake, Client hello (1):
  curl: (35) error:02FFF036:system library:func(4095):Connection reset by peer

Requesting via ipv4 works

  $ curl -4v https://fly.io/blog/introducing-litefs/
  *   Trying 37.16.18.81:443...
  * Connected to fly.io (37.16.18.81) port 443 (#0)
  * ALPN, offering h2
  * ALPN, offering http/1.1
  * successfully set certificate verify locations:
  *  CAfile: /etc/ssl/cert.pem
  *  CApath: none
  * (304) (OUT), TLS handshake, Client hello (1):
  * (304) (IN), TLS handshake, Server hello (2):
  * (304) (IN), TLS handshake, Unknown (8):
  * (304) (IN), TLS handshake, Certificate (11):
  * (304) (IN), TLS handshake, CERT verify (15):
  * (304) (IN), TLS handshake, Finished (20):
  * (304) (OUT), TLS handshake, Finished (20):
  * SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256
  * ALPN, server accepted to use h2
  * Server certificate:
  *  subject: CN=fly.io
  *  start date: Jul 25 11:20:01 2022 GMT
  *  expire date: Oct 23 11:20:00 2022 GMT
  *  subjectAltName: host "fly.io" matched cert's "fly.io"
  *  issuer: C=US; O=Let's Encrypt; CN=R3
  *  SSL certificate verify ok.
  * Using HTTP2, server supports multiplexing
  * Connection state changed (HTTP/2 confirmed)
  * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  * Using Stream ID: 1 (easy handle 0x135011c00)
  > GET /blog/introducing-litefs/ HTTP/2
  > Host: fly.io
  > user-agent: curl/7.79.1
  > accept: */*
  >
  * Connection state changed (MAX_CONCURRENT_STREAMS == 32)!
  < HTTP/2 200
  < accept-ranges: bytes
  < cache-control: max-age=0, private, must-revalidate
  < content-type: text/html
  < date: Wed, 21 Sep 2022 16:50:16 GMT
  < etag: "632b20f0-1bdc1"
  < fly-request-id: 01GDGFA3RPZPRDV9M3AQ3159ZK-fra
  < last-modified: Wed, 21 Sep 2022 14:34:24 GMT
  < server: Fly/51ee4ef9 (2022-09-20)
  < via: 1.1 fly.io, 2 fly.io
  <
  <!doctype html> ...


Could easily be a path MTU issue? I don't currently have working IPv6, but on TCP/IPv4, the server->client handshake packets I get back from fly.io are full length packets. If your tcpv6 MSS is set incorrectly, and some other not great things are happening in the path, then you might not be able to receive large packets.


> Could easily be a path MTU issue?

Perhaps: https://community.fly.io/t/ipv6-pmtud-issue/5081


How could I find out if this is the case? And, what can I do about it? Disabling IPv6 obviously fixes it, but that it not a solution...


I'm not aware of a simple to use test site for this on IPv6 (I've got one for IPv4, but my server host doesn't do IPv6 either, and using a tunnel will limit the ranges I can test, etc)... so you'll need to test ye olde fashioned way.

I don't have a mac, but if the man page[1] is right, something like this should work to see how big of a packet you can successfully send and receive:

      ping -6 -D -G 1500 -g 1400 fly.io
(you may need to run as root to set packet sizes). You should get back a list of replies with say 1408 bytes, then 1409, etc. The last number you get back is effectively the largest IPv6 payload you can receive (on this path, it could be different for other paths), and if you add the IPv6 header length of 40, that's your effective path MTU.

Use tcpdump to see what TCP MSS is being sent on your outgoing SYN packets, for IPv4, the MSS is MTU - 40 (20 for IPV4 header, 20 for TCP header), for IPv6, the MSS should be MTU - 60 (40 for IPv6 header, 20 for TCP header). If your TCP MSS is higher than the observed path MTU to fly.io, that's likely the immediate cause of your problem.

If you're using a router, make sure it knows the proper MTU for the IPv6 connection, and enable MSS clamping on IPv6, if possible --- or make sure the router advertisement daemon shares the correct MTU.

Hope this gets you started.

[1] https://ss64.com/osx/ping.html


Thanks, that was of great help!

Just for reference, the ping command is a little different

  sudo ping6 -D -G 1500,1400 fly.io
I set an MSS to 1492 which pfsense (my router) translates to an MSS clamp of 1492-60 for IPv6 and 1492-40 for IPv4. This is a German Deutsche Telekom Fiber connection. Now everything works fine, I can request fly.io (and also discovered that https://ipv6-test.com was not working before and now does with the MSS clamping)

Does MSS clamping have any disadvantages? Are there any alternatives in my case?


Excellent, happy to help, glad you figured out the right command!. 1492 MTU is consistent with PPPoE, or a network where they didn't want to run a separate MTU for PPPoE and straight ethernet.

The only downside to MSS clamping is the computational expense of inspecting and modifying the packets. On a residential connection, where you're running pfsense already, it's probably not even noticeable; but your ISP wouldn't be able to do clamping for you, because large scale routers don't have the processing budget to inspect packets at that level. I've seen some MSS clamping implementations that only clamp packets going out to the internet, and not the return packets... that can lead to problems sending large packets (which isn't always very noticeable, actually; a lot of basic browsing doesn't send packets large enough to hit this, unless you go to a site that sets huge cookies or do some real uploading)

The alternative would be to run a 1492 MTU on your LAN, but that has the marginal negative of reducing your maximum packet size for LAN to LAN transfers.


Just tried it on my linux box:

  * Trying 2a09:8280:1::a:791:443...
  * Connected to fly.io (2a09:8280:1::a:791) port 443 (#0) ALPN,
  * offering h2 ALPN, offering http/1.1 CAfile:
  * /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs TLSv1.0
  * (OUT), TLS header, Certificate Status (22): TLSv1.3 (OUT), TLS
  * handshake, Client hello (1): TLSv1.2 (IN), TLS header, Certificate
  * Status (22): TLSv1.3 (IN), TLS handshake, Server hello (2): TLSv1.2
  * (IN), TLS header, Finished (20): TLSv1.2 (IN), TLS header,
  * Supplemental data (23): TLSv1.3 (IN), TLS handshake, Encrypted
  * Extensions (8): TLSv1.2 (IN), TLS header, Supplemental data (23):
  * TLSv1.3 (IN), TLS handshake, Certificate (11): TLSv1.2 (IN), TLS a
  * header, Supplemental data (23): TLSv1.3 (IN), TLS handshake, CERT
  * verify (15): TLSv1.2 (IN), TLS header, Supplemental data (23):
  * TLSv1.3 (IN), TLS handshake, Finished (20): TLSv1.2 (OUT), TLS
  * header, Finished (20): TLSv1.3 (OUT), TLS change cipher, Change
  * cipher spec (1): TLSv1.2 (OUT), TLS header, Supplemental data (23):
  * TLSv1.3 (OUT), TLS handshake, Finished (20): SSL connection using
  * TLSv1.3 / TLS_AES_256_GCM_SHA384 ALPN, server accepted to use h2
  * Server certificate:
...

Same on my macbook


Works fine over IPv6 from this Mac.


still wondering why projects like https://www.reactivated.io/ are using fly.io


My dream would be if this supported geo-partitioning. In my field people are pretty sensitive about GDPR so would love to box in EU PII in EU servers.


Since the scope is a SQLite database, that means you just run a EU database, with no replicas outside of EU.

This level of database replication isn't going to put individual rows in different geographies.


This reads like a professor who's so steeped in research that he's forgotten how to communicate to his students!

What exactly are we talking about here? A WebSQL thats actually synced to a proper RDBMS? Synced across devices? I'm not clear about an end to end use case.

Edit: Honestly, this line from the LiteFS docs[0] needs to be added to the top of the article:

> LiteFS is a distributed file system that transparently replicates SQLite databases. This lets you run your application like it's running against a local on-disk SQLite database but behind the scenes the database is replicated to all the nodes in your cluster. This lets you run your database right next to your application on the edge.

I had no idea what was being talked about otherwise.

[0]: https://fly.io/docs/litefs/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: