Inability to scale with sustained usage (1+ person\*year of data) is the fatal p...

jakelazaroff · 2024-10-02T17:16:04 1727889364

Author here! A few thoughts:

1. That "in existing approaches" qualifier is important — local-first is still very much a nascent paradigm, and there are still a lot of common features that we don't really know how to implement yet (such as permissioning). You might be correct for the moment, but watch this space!

2. I think most apps that would benefit from a local-first architecture do not have the monotonically growing dataset you're describing here. Think word processors, image editors, etc.

3. That said, there are some apps that do have that problem, and local-first probably just isn't the right architecture for them! There are plenty of apps for which client-server is a fundamentally better architecture, and that's okay.

4. People love sorting things into binaries, but it doesn't have to be zero-sum. You can build local-first features (or, if you prefer, offline-first features) into a client-server app, and vice versa. For example, the core experience of Twitter is fundamentally client-server, but the bookmarking feature would benefit from being local-first so people can use it even when they're offline.

dustingetz · 2024-10-02T17:45:58 1727891158

My claim is that the partial sync problem is intractable unless you can partition your dataset into small topic documents. If this claim is correct, it is not "momentarily correct", it is inevitably correct, i.e., "incapable of being avoided or evaded". If the claim is not correct, I welcome any of the many researchers in this space to correct me!

jakelazaroff · 2024-10-02T18:58:06 1727895486

I understand your claim! Mine is that even if it's correct, it's not "the fatal problem in this category", for the reasons I outlined.

matlin · 2024-10-02T15:15:08 1727882108

A lot of the newer local first systems, like Triplit (biased because I work on it), support partial replication so only the requested/queried data is sent and subscribed to on the client.

The other issue of relying on a just the server to build these highly collaborative apps is you can't wait for a roundtrip to the server for each interaction if you want it to feel fast. Sure you can for those rare cases where your user is on a stable Wifi network, on a fast connection, and near their data; however, a lot computing is now on mobile where pings are much much higher than 10ms and on top of that when you have two people collaborating from different regions, someone will be too far away to rely on round trips.

Ultimately, you're going to need a client caching component (at least optimistic mutations) so you can either lean into that or try to create a complicated mess where all of your logic needs to be duplicated in the frontend and backend (potentially in different programming languages!).

The best approach IMO (again biased to what Triplit does) is to have the same query engine on both client and server so the exact query you use to get data from your central server can also be used to resolve from the clients cache.

vardump · 2024-10-02T11:06:48 1727867208

Just an idea: perhaps all of the end devices should have at least some high reliability storage. This would enable local applications that require high data durability and integrity.

Probably it'd require ECC RAM to prevent in memory bitrot, multiple copies of blocks (or even multiple physical block devices) with strong checksums.

Perhaps this data should somehow "automagically" sync between all locally available devices, again protected with strong checksums at every step.

(This idea requires some refining.)

tevon · 2024-10-02T17:22:38 1727889758

We've thought a fair amount about this. Our approach is the use sqlite on-device. Think about it more as a partially replicated db instead of a cache.

Then locally available devices can compare changelogs and sync only the delta.

No need for a checksum, since you can use monetonically increasing version numbers and CRDTs!

vardump · 2024-10-03T11:13:45 1727954025

> No need for a checksum, since you can use monetonically increasing version numbers and CRDTs!

How does that help against random bit flips?

hahn-kev · 2024-10-02T12:26:23 1727871983

I wonder if you could combine it with BitTorrent to get the distributed nature

tevon · 2024-10-02T17:23:11 1727889791

You definitely can! We're using kademlia for sync and discovery, which works quite well

ForHackernews · 2024-10-02T10:17:49 1727864269

Extremely curious what kinds of data your app would produce such that it outgrows the memory available on a single client device. I have about 2 person-decades' worth of writing, art, and coding that occupies less than 10 GB (and could probably be made smaller).

dustingetz · 2024-10-02T10:53:11 1727866391

I'm not talking about media content, rather database records. PKM apps host in-process relational queries, the records must fit in working memory and query engine must traverse the working set indexes to return answers "instantly"

ForHackernews · 2024-10-02T11:05:38 1727867138

Again, I guess I'm struggling to imagine what kind of database your app would need that doesn't fit in a 20 Mb sqlite file? What are all these jillions of records? You're talking about full-text indexing?

dustingetz · 2024-10-02T11:17:04 1727867824

PKM apps are trees of strings! It's fast until its not. Even if you can sync the global dataset to the device storage, the query engine needs data in process memory and still has to traverse it with device levels of compute not cloud compute, "instantly" i.e. without making the UI feel sluggish. If you feel otherwise, use Roam or Tana for a year, even in single-player mode. The entire category is bottlenecked on this scale problem. And now add team support, because you want to sell this to teams and make money, right? Designing for casual, personal-sized datasets is a viable architecture in very few apps. Google Maps is one shining counterpoint, because the content has a natural locality to it – you only need to sync content near where you are geographically!

paulgb · 2024-10-02T11:24:33 1727868273

I think you’re misunderstanding the overall architecture here. Instead of syncing the whole tree of strings, the way you would generally represent a PKM with Yjs is to make each logical document a Yjs document (especially given the assumption that offline periods are short.)

You could still build a server-side search index over those documents, which never needs to be sent to the client.

dustingetz · 2024-10-02T12:42:24 1727872944

this gives up relational query, i.e. you knowledge graph is no longer a graph. Notion, Roam and Tana all require relational query. What real world app category are you attempting to model that matches document structure? If the domain can be modeled as topic documents and the documents are small, like an individual google doc, sure this set of constraints may be useful. But that does not match PKM!

paulgb · 2024-10-02T12:49:42 1727873382

It doesn’t give it up, it just moves that aspect to the server. The server can still build an index over the documents in whatever way it likes, perform expensive queries, and only send the results to the client.

An example that matches that document structure is Figma; each document is individually small enough to be synced with the client, document metadata is indexed on the server, and queries over documents take place on the server.

dustingetz · 2024-10-02T14:08:23 1727878103

This (moving relational ops to server) does not match the definition of local first provided in the article and gives up most of the value prop the article enumerates in the conclusion. I agree that Figma is a candidate to be implemented primarily as a document CRDT, same as the google doc example i already provided.

paulgb · 2024-10-02T14:28:16 1727879296

Well, the problem an index over multiple documents solves for is also not present in the application presented by the article. The plan for an individual trip (or even a lifetime of trips for most people) is not going to exceed a size that can be handled and indexed on the client.

dustingetz · 2024-10-02T14:32:06 1727879526

You're right, the approach totally works for PKM apps with less than, say, 1 person*year of data, which is literally the first sentence I wrote at the top of this thread. (But – there are a lot of architectures that work with casual datasets! Like, store everything in a text file. Or fork bitcoin and run a full node on device, for that matter. What are we trying to accomplish here?)

paulgb · 2024-10-02T14:36:09 1727879769

Yes, but you described it as a fatal problem. My point is:

- this app didn’t need fancy graph querying, so didn’t have to implement it.

- if it did, there’s a natural way to extend this approach to support it.

ForHackernews · 2024-10-02T11:46:14 1727869574

> And now add team support, because you want to sell this to teams and make money, right?

I mean, I don't, personally. I'm writing a couple small apps to scratch my own itches and I might sell them to anyone else who wants an individual copy for personal use.

Remember when you could just buy a copy of a program and use it on your own computer? And it would never get updated to remove functionality or break because some servers were shut down? That's the experience I'm seeking from local-first software.

I think designing for casual, personal-sized data is extremely easy if you give up the idea that every program needs to be some bloated Enterprise-Ready junkware.

dustingetz · 2024-10-02T12:43:49 1727873029

Ok, then this constraint (casual, personal-sized data) should be the headline as the entire architecture is downstream of it

ForHackernews · 2024-10-02T14:16:15 1727878575

Sure, if you are so committed to the quantified self that you are producing hundreds of megabytes of valuable data every day, then maybe it's impractical for you to keep it all on devices with mere terabytes of local storage and only 32 GB of RAM.

You are Google's dream user. :)