Hacker News new | past | comments | ask | show | jobs | submit login
Faster CRDTs (2021) (josephg.com)
326 points by bpierre 22 days ago | hide | past | favorite | 144 comments



> And why 32 entries? I ran this benchmark with a bunch of different bucket sizes and 32 worked well. I have no idea why that worked out to be the best.

If you were using 2-byte ints, this is likely because cache lines are 64 bytes, so 32 entries would be exactly one cache line, letting each cache line hold an entire bucket, thus reducing those expensive main memory transfers.


I really like the way Knuth benchmarks many of his later programs. He basically puts a counter for how many times something has to be loaded from memory. Would be curious to know if you could approximate how many times you have to clear cache lines, in the same way?


Yeah when benchmarking by batch sizes it's common to see huge jumps associated with the memory hierarchy:

- word size (64bits) - cache alingment fetch size (generally 64bytes as mentioned above) - OS page size (4-16kb) - L1 size (~80kb/core) - L2 (low megabyte number)


With lots of bizarre artifacts if you don’t force alignment.


What are some real world apps using CRDTs that have really good experiences?

IIRC Notion was supposed to be one of them but realistically taking notes with two people in Notion is almost unusable compared to Google Docs.


Thymer[1] uses CRDTs for everything. It's an IDE for tasks and planning. It's a multiplayer app, end-to-end encrypted and offline first, optionally self-hosted, and an entire workspace is a single graph. So CRDTs were the logical choice.

All operations in Thymer get reduced to a handful of CRDT transformations. It doesn't matter whether you are moving or copying text, changing "frontmatter" attributes, dragging cards, uploading files, or adding tags. It's all done with the same handful of CRDT operations. Although this was a lot of extra work up front (we're not using any libraries) the benefits make it totally worth it. When your application state is a single graph you can move text between pages, link between pages (with backlinks), have transclusions, and do all sorts of cool stuff without having to worry about synchronization. CRDTs guarantee that all clients converge to the same state. And because CRDTs are by their nature append-only you get point-in-time versioning for free! We did end up having to make a couple of compromises for performance, though. Version history is not available offline (too much data) and in some cases we resort to last-writer-wins conflict resolution. On balance I think CRDTs are very much worth it, especially if you design an app with CRDTs in mind from day one. I probably wouldn't use CRDTs if I had to retrofit multiplayer in a more conventional AJAX app. Mutations in CRDTs are first applied optimistically, and then when the authoritative sequence of events is determined all clients need to revert their state to the last shared state and then re-apply all events in the correct order (thereby guaranteeing that all clients end up in the same state). Sometimes your app might need to revert and re-apply days worth of changes if you've been offline for a while. This all happens behind the scenes and the user doesn't know how many tree transformations are happening in the background but I guess my point is that CRDTs affect the design of the entire application. Most apps that are popular today were designed back when CRDT transformations were not yet well understood.

[1] https://thymer.com (almost ready for beta)


This looks really cool, signed up for the beta!


Today Notion is a last-write-wins system with limited intention-preserving operations for list data (like block ordering). Text is last-write-wins, each block text or property is a last-write-wins register. We're working on a new CRDT format for block text.


Do you use last-write-wins using the received order of operations on the server or using a logical clock?

I believe that for Notion the collaboration use case is more important than async editing, so a CRDT makes more sense, but for text editing apps that favor asynchronous collaboration maybe explicit conflict resolution is more reasonable.


No clocks on the write side


Most of iCloud's services use CRDTs underneath I believe. That includes Notes, Reminders and possibly Photos as well. FoundationDB is some of the backend as well. I was told this by a drunken former Apple SRE in a bar :)


Indeed, Apple Notes has a reasonably sophisticated CRDT, including support for tables: https://github.com/dunhamsteve/notesutils/blob/master/notes....


The other day I saw a bug in it, when editing a Note that another person was also editing. We were editing separate paragraphs. My cursor kept resetting to the beginning of a word after each letter I typed (and the Swype-style keyboard had the same problem). My point is, even Apple Notes doesn't get it quite right.


one would hope that is what they are using it for, since they bought one of the best foundations for a db to exist: https://news.ycombinator.com/item?id=9418255


Linear: https://linear.app/

See their Local First Conf talk: https://youtu.be/VLgmjzERT08


Thinking about it more, I’ll provide another that people will not immediately appreciate.

Any and all networked video games with some form of rollback or correction. Best effort with a fallback to Rollback might actually be the “best” ie ergonomic experience for CRDTs that are most widely used.

Again, not academically a CRDT because technically game state is not perfectly replicated to every client. Each client only gets partial game state.

Additionally, game clients require low latency syncing, which could academically be considered “coordinating”. Even tho the client actually accepts and renders the input’s results locally probabilistically before any conflict resolution / rollback is returned to the client for correction.

Again people are likely going to be pedantic but with three post now, I’d like to hope, people might see the common theme:

The most popular, highly ergonomic, best implementations of CRDTs actually break the academic rules of CRDTs.

This is a relatively typical trap of an overly academic mental model. Most real world algorithms and data types are actually more creative than their academic “rulesets”. eg Timsort.

Especially if you’re building a product for actual use (as apposed to for review in a paper), then don’t fall into the over engineered/academic trap. Be creative, learn the academic rules, then intentionally break those rules, build what actually adds value and make it ergonomic rather than try to perfectly implement a concept that academics defined so stringently it’s only useful for other academics.


> The most popular, highly ergonomic, best implementations of CRDTs actually break the academic rules of CRDTs.

There's a popular, highly ergonomic implementation called Automerge[0] that would beg to disagree with you.

[0]: https://automerge.org/


FWIW, never heard of it before but I like it.

Meanwhile, a lot of people on these threads would say Automerge is not a CRDT. By definition Automerge.getConflicts means Automerge has conflicts and is not conflict-free.

I on the other hand prefer and applaud Automerge for breaking the rules! I hope it continues to see more success.


I think you are grossly misunderstanding what does and does not make a CRDT. The existence of conflicts is fine - they just need to be solved automatically and in an eventually consistent manner so that all clients come up into the same state once they’re all online. “Conflict-free” refers to there not being any unsolvable conflicts or ambiguity in solving those conflicts. That’s why git isn’t but automerge is:

> The values of properties in a map in automerge can be conflicted if there are concurrent "put" operations to the same key. Automerge chooses one value arbitrarily (but deterministically, any two nodes who have the same set of changes will choose the same value) from the set of conflicting values to present as the value of the key.

> Sometimes you may want to examine these conflicts, in this case you can use getConflicts to get the conflicts for the key.

Basically it’s saying may be you want to look at other “older” values that may have been observed even though the conflict was resolved (eg showing to the user previous copies of the values).


https://automerge.org/docs/documents/conflicts/

"The only case Automerge cannot handle automatically, because there is no well-defined resolution, is when users concurrently update the same property in the same object (or, similarly, the same index in the same list). In this case, Automerge picks one of the concurrently written values as the "winner", and it ensures that this winner is the same on all nodes"

"The object returned by getConflicts contains the conflicting values, both the "winner" and any "losers". You might use the information in the conflicts object to show the conflict in the user interface."

To me, this seems identical to how git works. Specifically git fetch (does automatic resolution storing all of the changes), vs git merge (showing the conflicts in the user interface).


Automerge is "conflict-free" by the definition of "conflict-free" used in CRDTs. i.e. conflicts are automatically resolved in the same way every time by the algorithm. The .getConflicts allows you a neat interface to inspect a concurrent change to the same value. There is no conflict, no failure to merge. It automatically merges as usual. If you choose to act on the information returned by .getConflicts then that is a new change, it does not alter the past.


If you fetch from a repo with a slightly different history for a named branch than what you have locally, git fetch will fail. Thus git fetch operations are not part of a CRDT.

Similarly, the order in which you synchronize (fetch or merge) from remote sources is order dependent and does not eventually converge to the same result (e.g. the commit hashes will diverge for a simple reason that the current time is part of the commit hash).

This is even more true for git merge which does not even attempt automatic conflict resolution if the same file is edited multiple times.

Automatic conflict resolution is a key requirement of CRDTs and another is that when nodes are brought online they are guaranteed to converge to the exact same state. This is demonstrably not true for git & trying to handwave it away doesn't change anything.

Not sure why you're trying to shoehorn git into the CRDT definition when it's its own thing.

> What I'm describing here is very similar to decentralized version control (e.g., git and mercurial). In git, you commit locally and eventually merge and push your changes into a remote repository. Unlike git, CRDTs don't have merge conflicts. Conflict resolution is handled by the CRDT algorithm itself. Although not exactly a CRDT, git will be a useful model to refer back to since it is something already familiar to most developers and incorporates similar concepts.

https://vlcn.io/blog/intro-to-crdts


Do you care that you're wrong, or... ?

Pedantry doesn't enter into the discussion -- the definitions at play here aren't subject to any kind of subjective interpretation.


I rarely care about such trivial things as being wrong.

I do care about being intellectually honest. I haven’t once said that you or anyone else in this thread is wrong. If you really read everything Ive written, you’ll see I’ve always offered multiple perspectives.

In my words, your perspective has always been accurately shared. I care about that.

Meanwhile, our differences are philosophical. I’m not interested in your limited perspective. Just like you’re not interested in my broader perspective.

I’m not asking you to believe or say anything. You don’t like the additional perspective I’ve shared, ok, move on.


> What are some real world apps using CRDTs that have really good experiences?

Google Wave and Google Doc's use CRDTs.

There's a fun blog post or two on how an academic paper completely botched it's benchmarks on the underlying algorithms by putting together a piss-poor implementation of Google Wave's algorithm and proceeding to claim the algorithm itself sucked even though the piss-poor implementation was such a mess it outperformed everything from every angle..



Google docs also uses OT not crdt. OT is easier to develop by an order of magnitude and even offline mode is possible to implement in a good enough way.


Nice blog post, but I really wish the author knew how to use apostrophes. (Or the difference between its and it's, if it's not the typographical character he doesn't like.) It's just so jarring to be reading along and be tripped up all the time with grammatical mistakes.


Google docs was originally ot based as well. I'm not sure about the current state of it.


The code editor Zed uses them for collaboration. It was discussed in a Developer Voices episode - https://youtu.be/fV4aPy1bmY0?si=0O6fCmK8NziFf9Hi


Muse. I love how I can write something on my iPad with the pencil and see the strokes real time on the mac app.

I then switch to my mac to drag images and screenshots and see it synced to the ipad in real time. Then i annotate w the apple pencil.

One of the best “whiteboarding” tools I’ve used.


Figma uses a CRDT-ish approach. I'm working on one at the moment, but it'll be a while before it's publicly available. At the moment, I'm using Loro: https://loro.dev


Not a public high bandwidth example, but a consulting team I was on implemented CRDTs in a private mobile app used for a compliance documentation use case. More of a precaution, the actual number of end-users editing a record at any given time is expected to be 1, but while reviewing their system, we found that the clients system can support multiple and sometimes their employees don't communicate/follow procedures. We first tried to convince them to make changes on their end, but their development team straight up refused.


Saga: https://saga.so is a collaborative workspace with a strong focus on speed/simplicity.


LegendKeeper[1] seems like a pretty slick worldbuilding app that's local-first with a CRDT sync layer. They have a free demo, but I haven't used it enough to fully vouch for it being a really good experience.

[1] https://www.legendkeeper.com/


I'm the creator of MonsterWriter. MonsterWriter uses a very simple CRDT for JSON payload (references index, export config). For the collaboration on the actual text it uses OT. When I started to implement this 7 years ago. CRDT weren't a thing.


It's not a CRDT, it's LWW at the cell level, transmitted pretty much a typing speed so conflicts are low


I think a lot of things. CRDTs predate the term "CRDT", they seem to be a mathematical construct/pattern found in the wild, that smart people codified and formalised.

IME CRDTs just lay out the rules you need to follow for syncing eventually consistent nodes properly, ie without weird surprises or data loss.

As for concrete examples:

- in the famous Amazon Dynamo Whitepaper, they perfectly describe a CRDT in section 4.4. (https://www.allthingsdistributed.com/files/amazon-dynamo-sos...), though I don't think this made it to modern Dynamo DB

- I believe CouchDB follows all the CRDT laws

- TomTom uses them

I think the TL:DR is if you're going to sync two data stores that can be written to without coordination, learn the CRDT laws and make sure you follow them. They're fairly simple mathematical properties at the end of the day.


Doesn’t google docs use crdt?


I believe it uses Operational Transforms


OT vs CRDT: https://stackoverflow.com/questions/26694359/differences-bet...

In my opinion, the problem with CRDTs is they don't really have a great concept of access control and enforcing rules. Since everyone's a peer, you can only do things like "last write wins" etc.

In a video game, for instance, I could disconnect and then spam the peers with tons of updates, claiming they were all done in the past, and they'd have to accept them all.

PouchDB and IPFS lets nodes sort of make decisions about who to replicate from, but it's rudimentary: https://pouchdb.com/guides/replication.html

Hypercore is much better because it is an append-only log where every chunk is signed: https://docs.pears.com/how-tos/replicate-and-persist-with-hy...

However, it's still not enough, because hypercore only supports a single-writer. There has been some work on hypercore multiwriter in the last few years (which I followed with interest) but none of it is Byzantine fault-tolerant, and it was ultimately dropped. In fact, if you leak the key, anyone you can probably corrupt a hypercore rather easily and prevent it from making forward progress.

This is why at the end of the day, you still go back to the ideas of preventing forks. The problem is that blockchains do it via global consensus, which is hugely wasteful and overkill. But you do need something to run essentially "smart contracts", i.e. software that runs "in the cloud" which is infeasible to tamper with. Like Internet Computer's canisters for example... https://internetcomputer.org/docs/current/concepts/canisters...

Ultimately, these architectures will end up being far more reliable and resilient than the Web, because the Web is based around hosting and location ("where" something is) rather than content ("what" something is). IPFS and cids are a step towards that future, but other projects like ICP and Freenet are leapfrogging them... they have smart contracts based around WebAssembly.

Incidentally, I interviewed Ian Clarke about the original freenet, and he recently announced the new freenet, which is based around webassembly, and the swarming is done using "small-world topology" instead of kademlia DHT: https://www.youtube.com/watch?v=yBtyNIqZios

(Here is my interview with him a couple years ago: https://www.youtube.com/watch?v=JWrRqUkJpMQ)


Author here. I had a chat with PHV about this (a principle author of the Local First Software paper).

He sees the access control problem like scribbling on your copy of the US Constitution. If you print out the constitution, you can make whatever marks on the page that you want. But nobody else is obliged to copy your changes.

Its the same as Git in that sense. I can fork a repository and submit pull requests. But I can only push my changes upstream if I have write access to the repository.

If you need a single authoritative "truth" - like a video game - then you can still do that by having one peer act as a hub and it can decide which edits to merge & broadcast. If a video game used a CRDT, you might still want a server - and that server might still want to reject any incoming edits that are older than a few seconds.

But I generally agree - I don't think CRDTs add a lot of value in multiplayer video games. CRDTs make sense when a local user wants to edit some data, and those changes are useful locally even if you never share them with other peers. So, creative work is the use case I think about the most - like video editing, writing, design, programming, and so on.

In the video game industry, I think the best use for CRDTs would be to make Unity / Unreal / etc collaborative spaces.


> In the video game industry, I think the best use for CRDTs would be to make Unity / Unreal / etc collaborative spaces.

Unreal has a collaborative implementation - https://www.youtube.com/watch?v=MPIpOdNmNGE It's demo'ed here for Virtual Production. I believe that it's called the "concert framework" in the engine itself. I'm unsure what it's implemented based on though.


Ah, thank you.


> What are some real world apps using CRDTs that have really good experiences?

Since people loved my other hot take :)

I suppose "good experience" can be subjective, but Blockchains are also in practice a distributed trust-less CRDT that is used by a LOT of people.

- With redundant validation, across more than 2 nodes, used to solve for conflict resolution.

- and proof of work/stake used to solve for distributed trust.

Again, academically, maybe not a CRDT because they will not "always converge". Hard forks can happen. However, pragmatically, Blockchains are a CRDT.


Author here

> Again, academically, maybe not a CRDT because they will not "always converge". Hard forks can happen. However, pragmatically, Blockchains are a CRDT.

The term "CRDT" is an academic term. So the academic definition matters. They're not "pragmatically" a CRDT because that's simply not how we decide what is and isn't a CRDT.

The proper definition is any algorithm, replicated across multiple computers in a network with these 3 properties (at least according to wikipedia):

1. The application can update any replica independently, concurrently and without coordinating with other replicas. 2. An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur. 3. Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.

Blockchains meet criteria 2 and 3. But in my opinion, property 1 is too much of a stretch.

In a blockchain, you can't meaningfully update your local replica independently and without coordination. Sure - you can add local transactions. But blockchains care about global consensus, not the local state. A transaction in a blockchain isn't generally considered to have happened until its been included in a block & witnessed throughout the network.

This is in stark contrast to, say, Apple Notes (which uses a CRDT internally). In Notes, I can edit a note on my phone and the note changes. There's no consensus required before my new edits are considered to be legitimate.


Help me understand this more clearly. At what level of abstraction does the CRDT definition apply? At the data structure level, the application level, the end user level?

We can, for instance, use a simple directed acyclic graph of immutable data types as a data structure that can handle asynchronous updates and result in eventual consistency across all computers.

We have a node in this DAG that says there’s a meeting at 1pm. Two people update this meeting time. One updates it to start at 2pm, and the other updates it to start at 3pm, and these changes happen simultaneously.

The data structure now has a tree with the original parent node (1pm), and two new child nodes with new times (2pm and 3pm). All computers are updated cleanly and contain the same updated data structure. All conflicts handled. Zero conflicts exist at the data structure level.

At the application level, the app now shows there’s a meeting “at 2pm or 3pm”. And all apps on all computers will reflect this same, consistent information.

But to the people, there is a conflict in the meaning of this data. Is the meeting at 2pm or 3pm? This is somewhat analogous to a git merge conflict, where the “what to do about it” depends on the meaning to a human.

Like I get the impression that a lot of people don’t consider the “meeting at 2pm or 3pm” to be a CRDT because it “doesn’t converge to a single value”.

But from a physics perspective, there’s some binary data, some changes happen, and the new binary data is now replicated to all end user devices in the same state.


The CRDT definition applies at the data structure level.

Let’s model your meeting example as a set. Any local changes would atomically remove the current meeting time and add a new one.

Multiple concurrent edits could result in two meeting times in the set. In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

Or, put slightly differently: the CRDT’s job is purely to make sure the peers can always agree on the same state. Any semantic conflicts within that state are an application-level concern.

Slightly shameless plug, but I wrote a post showing how a CRDT fits into a toy application if you want to see how the layers fit together: https://jakelazaroff.com/words/building-a-collaborative-pixe...


> In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

So then Git uses a CRDT internally? Merge conflicts are user experience "semantic conflicts" within the Git application.

The internal state in the Git commit tree contain all commits consistent across the synced replicas. Without further inputs, all the repos would render the exact same user experience conflict.

"The data type i.e. Git's internal state is conflict free - but there are times the user needs to correct the semantic state."

"The Git application wants to guarantee semantic state, so unless the semantic conflicts are resolved, the Git internal state i.e. 'data type' will continue to be conflict-free using the 'rollback technique' to reconcile conflicts with the local copy's 'data type'."

Roughly: git fetch would be the conflict-free update of the internal data type. While git merge would be the application level, semantic conflict resolution.


> So then Git uses a CRDT internally?

No. Factually, inarguably, no.

If you want to argue against this claim, OK, have fun, but it's not interesting to engage, because there's no interesting conversation to be had. However, if you want to understand why this claim is true, great, ask whatever questions you want to ask, I'm happy to help you understand.


I don’t know enough about Git’s internals to be able to say. My gut says no; the (state-based) CRDT “rules” are that merges must be commutative, associative and idempotent, and I think if you tried to merge conflicting changes on multiple remotes in different orders, you would get different internal states. Not sure though!


Yeah I think you’re correct: Git’s internal data structure is a grow-only set CRDT (well, graph) of commit objects. Grow only sets of hashed contents are one of the simplest CRDTs out there - the network just implements set union.

If git only had commit and fetch, it would be a crdt (though not a very useful one). But git also needs to merge commits / branches together. And it does that via diff-match-patch, which doesn’t pass the crdt rules. (Since it’s not idempotent, associative, automatic and in many cases deterministic).


The content-addressed database for commit objects (but not their names): possibly, yes. It has trivial merge semantics, just keep everything and deduplicate.

But that's equally true of every content-addressed system. And entirely untrue for every named object in git, which are a critical component of git being git, instead of an opaque blobstore with zero semantics.


Yes I agree. I still think its wrong to call Git a CRDT because git is a lot more than its content-addressable system. All that other stuff on top - you know, the parts you use to track and merge branches? That stuff isn't a CRDT.

Maybe its like asking if Wikipedia is a relational database. I assume wikipedia is implemented on top of a relational database. But the resulting wikipedia website? No, not a relational database.


Thank you, I appreciate you taking the time to agree and offer additional nuance. I agree with your nuance. That is the nuance that makes Git not a CRDT.

I can also appreciate academics need a time and place to be academic about their definitions (e.g. papers, and in-depth answers).

> What are some real world apps using CRDTs that have really good experiences?

I also feel, academics could do more to appreciate questions like this one are non-academic. The question above could just as easily be interpreted by many as "I'm building a multi-player application. I want it to be a really awesome experience but I don't know if I'll get the concurrency right. I've heard concurrency is hard. I've heard good things about CRDTs tho, should I be using them?".

To which we should absolutely be mentioning truly pure CRDT based applications/frameworks, but not mentioning similar but nuancedly different applications/frameworks, hurts 1. any less educated readers' ability to instinctively understand what a CRDT is; and 2. how it might help them create the next great multi-player application. It also hurts someone that has spent the last N nights trying desperately to make their pure CRDT application/framework fun and valuable but its just not happening. Allowing them to loosen up and be a bit more creative, helping to empower their application/framework because the universe of "asymptotically approaching a CRDT; but will definitely not have concurrency bugs/issues" (which doesn't have a nice academic name + explicit ruleset yet) is so much larger "than strictly a CRDT".

Meanwhile, I could if this was a paper or lecture have said "... are examples asymptotically approaching a CRDT; So don't really worry just because something is not strictly a CRDT doesn't mean you're gonna have bugs in your concurrency implementation". But "in passing" on a comment thread that should have been 30 seconds of all our time, who wants to specifically write all of that nuance every comment? In fairness I even tried to differentiate academic vs non-academic.

If you're still reading this thread (especially anyone that has been downvoting me) I hope you can appreciate this context.


Yeah the question asking about real world use cases is a non-academic question. But that still doesn't mean git is a crdt.

Maybe its like someone asking "What are some other uses for javascript?" and someone says "Unity games!". There might be a way to use javascript when making unity games, but most unity games aren't made with any javascript. "Oh well C# is sort of close to javascript though?" - Yes, but its still not javascript. I don't think calling C# "mostly javascript" is a useful or helpful idea. To the expert its wrong, and for the novice its confusing.

We don't draw a line from "actually really javascript" to haskell or something and say C# is 80% javascript. "Javascript" as a term just means javascript. Not the space of things similar to javascript.

Its the same with CRDTs. "CRDT" doesn't mean "eventually consistent distributed data store". It just refers to the 'pure' thing. You know what else meets the definition of a CRDT, but is very dissimilar from a database? An integer, sent between peers, and merged using the MAX() function. You could also argue that the unit type is a (degenerate) CRDT. Just like how a straight line is sort of a triangle with a side length of 0.

Thats how I see the claim that "Git is a CRDT - well sort of". To experts, its wrong. And to everyone else its kinda misleading. Git isn't a CRDT in the same way C# isn't javascript. (And a web browser, V8, a javascript program, nodejs - all of these things are also not javascript.)

Now, Git's content-addressable blob store is a CRDT, so it might be fair to say git is "using a CRDT internally". But I hope you can see how that claim is kinda confused too. Git's branches and commit merging don't obey the CRDT merge rules. Because git has that stuff added, it makes it no longer a CRDT. Its the same as if you add a lump to the side of a triangle, its not a triangle any more.


I like the max integer example because it’s clearly a CRDT and also not particularly useful on its own. I get the impression that people think CRDTs are some sort of magical thing that gives you bug-free sync, but it’s pretty easy to explain why naively using a single max integer to implement a distributed counter would be buggy. CRDTs are just tools, and like any tool they can be misused!


> Sure - you can add local transactions. But ...

> A transaction in a blockchain isn't generally considered to have happened until it’s been included in a block & witnessed throughout the network.

For the sake of argument:

- It’s not meaningful for Apple Notes’ local edits to only be local. Sure they will render, but if a tree falls and no one hears is, did it make a sound? So let’s ignore local edits.

- A sentence in Apple Notes isn’t generally considered to have happened until it’s been rendered in every client and users can see it if they look.

Is that any less fair of a statement?? Is this game of nuanced wording selection at all meaningful tho to answer the question “Is Apple Notes a popular application of CRDTs?”

Meanwhile, keep in mind in a blockchain entire sets of nodes could be firewalled from each other for days, each subsection performing consensus and creating blocks. Days later the firewall could collapse and the nodes will need to automatically reconcile. If I abstract away the internal replicas and say the two “offline parts” of the CRDT need to keep “local” state and reconcile once both/all parts come back online. Would this abstraction help meet the strict definition?

I know you probably still strongly disagree - but coming from a non-academic background these word games around CRDTs hurt rather than help people’s understanding and use of them.

Instead we should be asking, how can we improve Git as a CRDT (where perfect "conflict-free" is the enemy of good enough) to better automerge more conflicts? How can we improve the general theory of CRDTs based on what we learned from implementations in “video game A”, “blockchain Z”, and “web application framework J”?


> - It’s not meaningful for Apple Notes’ local edits to only be local. Sure they will render, but if a tree falls and no one hears is, did it make a sound? So let’s ignore local edits.

Why would we ignore local edits? Notes is a note taking application. Not a communications platform. Local edits are the main point of the application.

> - A sentence in Apple Notes isn’t generally considered to have happened until it’s been rendered in every client and users can see it if they look

Of course it has! If I copy my favorite crepe recipe into a note on my phone, then open it up later, I don't care if that note has synced to my laptop. (Or anyone else I might be collaborating with). Local edits are meaningful and important.

The same is true of git. I use git locally on lots of projects that I don't even have any collaborators on, so I can track my work.

From the database standpoint, these are AP systems. They stay available (to reads and writes) in the face of network partitions. The blockchain is .. well, the blockchain is weird. But I think its closest to a CP system. You can't make transactions when you're disconnected from the chain. (If you could, you would risk double-spending).

I consider blockchains closer to traditional databases than eventually-consistent datastores built on CRDTs, because they have coordination and traditional atomic transaction semantics.


I would say blockchains fit this definition. What I get away from that is that this definition is not restrictive enough.

Similarly, a list of operations with Lamport timestamps fits this definition. However, just like the blockchain, it makes no effort to preserve user intent.

To go to an extreme, a merge operation which replaces the current content with the empty string, is a CRDT per this definition.


Blockchains absolutely do not satisfy the definition of a CRDT.


How would you change the definition so that they do not?


I don't need to change the definition. The existing (canonical, well-understood, etc. etc.) definition of a CRDT clearly excludes blockchain data structures.


My understanding is that blockchains agree upon a shared state via consensus (i.e. coordinating with other nodes) rather than by a deterministic algorithm.


It is deterministic: the local updates must be grouped into a block linking to the longest-known chain, and the block is broadcast; upon receiving a block, all nodes must locally revert operations that are no longer part of the deepest chain, and apply operations that are.

It is a bit like last-writer-wins, except it uses the block history count instead of timestamps. Clearly they lose a lot of user intents.

I feel like CRDTs are associated with intent preservation. Perhaps rule 2 ought to include that inconsistency resolution ought to not discard operations. That would also exclude LWW and the “empty string” merge as CRDTs.


IMO semantic “intent preservation” is an application-level concern that can affect which CRDTs are chosen, but it’s irrespective of whether the underlying data structure is a CRDT. The rules are that state merges must be commutative, associative and idempotent; if blockchains fulfill all three criteria, I don’t see why they wouldn’t count.


> The rules are that state merges must be commutative, associative and idempotent;

Yes!

> if blockchains fulfill all three criteria, I don’t see why they wouldn’t count.

Blockchains definitely do not fulfill all three criteria!

> IMO semantic “intent preservation” is an application-level concern that can affect which CRDTs are chosen, but it’s irrespective of whether the underlying data structure is a CRDT.

You're correct!

Whether or not a data structure is a CRDT is entirely a function of whether or not its Merge operation is associative, commutative, and idempotent. Application-level concerns, including (but not limited to) semantic intent preservation, are totally irrelevant.


Okay, back to my original skeptical position lol. I don’t understand why CRDTs seem to attract so many people claiming that their favorite computing concepts count!


I'm not sure what your original skeptical position is, so shrug?


The "blockchain" notion of determinism you're referring to here, is far weaker than the notion of determinism used in the distsys context.

This discrepancy was pretty well documented in the Jepsen analysis of the Radix DLT -- https://jepsen.io/analyses/radix-dlt-1.0-beta.35.1

tl;dr: blockchains are not deterministic in the distsys sense


I don’t know Radix, so perhaps they use a different merge algorithm than the standard one. The article indicates, though, that nondeterminism was a bug that the company fixed. It does sound like it relies on validators, so perhaps they don’t have open membership, thus don’t allow updates from arbitrary nodes without going through those validators?

Either way, the standard blockchain algorithm doesn’t behave like that.

To be fair, I don’t have a stake in this beyond wanting unambiguous definitions.


The cat runs, eats and has fur. The cat is a dog.


"Plato had defined Man as an animal, biped and featherless, and was applauded. Diogenes plucked a fowl and brought it into the lecture room with the words, "Here is Plato's man."" (Diogenes Laertius, Lives of Eminent Philosophers)


I think blockchains are overkill because they're trying to get a global consensus about every transaction in the world, which is very expensive (with electricity, stake etc.). They're trying to solve the additional problem of preventing "forking" of a history. Whereas if you allow forks then CRDTs are more lightweight.

Internet Computer canisters, Merkle DAGs (like IOTA), Hashgraphs and Holochains are probably a better architecture than Blockchains.

Having said that, I think Merkle Trees are the future, for minimizing the amount of sync. They let you see exactly what changed, and so on.

https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

https://github.com/canvasxyz/okra-js

There's already great software done by Dat (later called Hypercore, now called Holepunch) that has an append-only log based on the SLEEP protocol. That's essentially all you need, in my opinion. They've built everything from key-value databases (hyperbee) to full-scale file systems (hyperdrive) on top of it. Beaker browser has been built on top of it.

I mean, merging state-based CRDTs is cool, but you may as well just use operation-based CRDTs with something like Hypercore or the older PouchDB.


If anyone is familiar, I would like to ask:

How does Okra compare to something like Quadrable: https://github.com/hoytech/quadrable-solidity


I mean, two people contending over one piece of state is never going to be a good experience. Thankfully this is mostly not useful in the first place.


In practice, Git

Some CRDT purists would say "its not perfectly conflict free so its not a CRDT".

Sure[0], but for the rest of us that are pragmatic about best effort conflict resolution Git is likely the most successful CRDT application.

[0] https://en.wikipedia.org/wiki/No_true_Scotsman


This comes up every time but there’s three criterion for CRDTs and git fails 2 of them. Even ignoring the requirement for automatic conflict resolution (which git can’t meet since automatic resolution fails as a matter of course) and ignoring that the conflict resolution algorithm has to be part of the data type (it’s not), it fails the requirement that separate different copies must eventually converge when brought online but that’s demonstrably false as people may use different conflict resolution mechanisms AND the commit graph of a resolved conflict may itself then be different resulting in different histories from the process of brining git online.

This is because the commit graph itself isn’t CRDT. If I commit A and then B but someone’s clone only has B applied, you can’t synchronize even automatically; different git histories don’t resolve in any way automatically at all and once you pick a solution manually your copy will not have the same history as anyone else that tries to redo your operations.

No true Scotsman doesn’t apply here because there is a very precise definition of what makes a CRDT that is a clear delineating line.


> 1. The application can update any replica independently, concurrently and without coordinating with other replicas.

> 2. An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur.

> 3. Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.

[0]

Again, in theory it fails 2 and 3. However, in practice 3 is a normal part of working with git in a team. Barring a hard fork in git - which is equivalent to a deep copy of a CRDT. Like any deep copy of a data type, a CRDT's deep copy can be used in non-conformant manners (forks are VCS specific jargon for a CRDT deep copy; or shallow copy sometimes).

> If I commit A and then B but someone’s clone only has B applied, you can’t synchronize even automatically; different git histories don’t resolve in any way automatically

Maybe I don't understand your point specifically, but this example seems entirely solved by --rebase. In practice --rebase is typical, and best described as "do your best to automatically resolve histories; I'll handle any of the complex conflicts".

All that said, I already agreed: "academically Git is not a CRDT". However, and I'm happy to disagree with you, in practice Git is the most popular CRDT.

[0] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...


given how easy it is to run into merge conflicts doing normal things with git, I can't say that I'd agree that in practice git is a CRDT either.


CRDT literally means Conflict-free Replicated Data Type. Expecting CRDTs to be conflict-free isn't purism, it's simple validation. Git is, inarguably, not a CRDT.


There are CDTS that have "multiple versions", which look an awful lot like conflicts to me, ie, the Multi-Value Register in this paper:

https://inria.hal.science/inria-00555588/


Multi-value registers are CRDTs for sure. Conflict-free doesn't mean that values can't have concurrent histories (or, as you say, "multiple versions") -- it means that the merge operation always succeeds.


What's the definition of a conflict, then? Equations welcome.


Feel free to read up on CRDTs; I'm confident this will answer your question.

The short answer is, roughly, that a conflict is a discrepancy in state which cannot be mechanically resolved.


I've read a few CRDT papers. Perhaps you could name a specific one.


> CRDT literally means Conflict-free Replicated Data Type.

Git could be "conflict-free" with a simple `rand(ours, theirs)`.

It would be useless, but technically "conflict-free". Is the addition or removal of that rand function really, pragmatically the difference in the answer to "what is a CRDT?"


Adding extra rules on top of git to try to turn it into a CRDT doesn't make git one, even if you succeed (and rand would not succeed). You can do that with a pencil and paper, but that doesn't make paper a CRDT.


This is definitely not true.

Are you open to understanding why this isn't true? If so, I'm happy to help you come to the correct understanding; please ask whatever questions are necessary. Otherwise, then, well, okay!


`rand` wouldn’t work because all peers must reach the same state without coordination.


I really wish someone would make a git-like tool on top of CRDTs. I want conflicts when merging commits like git does, but I also want the crdt features - like better cherry-picking and cleaner merging when merging several branches together. And of course, live pair programming.

CRDTs store a superset of the information git stores - so we should be able to use that information to emit git style conflicts when we want. But I think nobody has implemented that yet. (Pijul probably comes closest.)


I suspect a major reason why CRDTs haven't been a clear dominator in VCSes is that the "conflict free" decision is not necessarily the correct decision. It's merely a consistent one.

When editing is relatively "live", those are small enough that they're probably also correct. But adding your month of changes to a dozen other people's month of changes doesn't mean it's going to build correctly (or even look sane) when you change the same file. Manually seeing the issue and fixing it gives you a chance to correct that, at the time it's relevant, rather than "it all changed, good luck".

---

If you're interested in distributed-VCS systems that have some different semantics than git, https://pijul.org/ might be interesting. And Jujutsu is... complicated, as it abstracts across a few and brings in a few things from multiple VCSes, but it's mostly git-like afaict: https://github.com/martinvonz/jj

No doubt there are others too (if anyone has favorites, please post! I'm curious too)


Fossil (https://fossil-scm.org) is actually a proper grow-only set from what I can tell: merge conflicts upon synchronization are encoded in the DAG as forks of the history at the conflicting points in time. If no action is taken after a fork, all clones will eventually see the exact same view of the history.

The tooling will normally bail with a "would fork" message if a `fossil update` (and, hence, local resolution of those conflicts) wasn't undertaken before pushing, but this is not at all necessary to preserve useful operation.


> the "conflict free" decision is not necessarily the correct decision. It's merely a consistent one.

Yep. But there's no reason you couldn't build a system that supports both approaches - conflict-free when live pair programming, but conflicts are still generated when merging long lived branches. As I say, text CRDTs store all the data needed to do this. Just, nobody (as far as I know) has built that.


CRDTs seem to give the best experience when they correctly model the "intent" of changes.

But a diff between two different states of raw text can't convey the intent of a code change (beyond very simple changes).

This is why I think CRDTs haven't caught on for VCSes and I'm not sure they _could_ without some kind of structured editing.


I remember jj, pijul or another CRDT-git website offering a javascript demo (I can't find it now). I tried that user 'A' removes a line, user 'B' modifies that line, and it converges to that line becoming the modifications. I don't think that automatic conflict resolving is the future.


Yeah, I don't think so either. Conflicts are good - the knowledge that someone else also did something [here] is useful, and external context can completely change what the correct next step is. The info needed will never be fully encoded into the text.

That said, a clearer model for why it conflicts, and how things got there, can definitely help understand what happened and therefore what to do next. Git isn't great there, you're just given the final conflict pair.


I've been researching CRDTs for a reference manager that I'm building (https://getcahier.com), and thought that some hybrid of automatic and user-operated conflict resolution would be ideal, as you described. But current efforts are mostly directed to automatic resolution.


I would want the git-like tool to have semantic diffs, and so have much better conflict resolution as a result; not CRDTs and more obtuse & confused conflicts than it's already capable of.


(hi seph, hope all’s well) — i did exactly this with Redwood and showcased it multiple times during Braid meetings. alas, nobody was very interested in trying to push it forward


Ah sorry - I don’t know why redwood slips my mind! Please correct me if I’m wrong but I thought you were using redwood as a git backend, rather than replacing git’s internals with a text crdt?


CRDTs are powerful, but it's unfortunate that they leave behind a trail of historical operations (or elements), both in their ops- or state-based variants. Even with compression, it's still a downside that makes me concerned about adopting them.

Even so, the discussion surrounding them made me excited by the possibility of implementing conflict-free (or fine-grained conflict resolution) algorithms over file-based storage providers (Dropbox, Syncthing, etc.).


Author here. I've had this conversation with people a lot. And people in the CRDT world also talk about it a lot. But in practice, at least with text editing, the overhead is so tiny that I can't imagine it ever coming up in practice. Diamond types - my post-CRDT project - does (by default) grow without bound over time. But the overhead is usually less than 1 byte for every character ever typed. If I turn on LZ4 compression on the stored text, documents edited with diamond types are often smaller than the resulting document state. Even though we store the entire editing history!

I know a bunch of ways to solve this technically. But I'm just not convinced its a real problem in most systems.

(I have heard of it being a problem for someone using yjs for a 3d modelling tool. While dragging objects, they created persistent edits with each pixel movement of the mouse. But I think its smarter to use ephemeral edits for stuff like that - which aren't supported by most crdt libraries.)

Git also suffers from this problem, by the way. Repositories only grow over time. And they grow way faster than they would if modern CRDT libraries were used instead. But nobody seems bothered by it. (Yes, you can do a shallow clone in git. But almost nobody does. And you could also do that with CRDTs as well if you want!)


I don't think the concern was necessarily about overhead, but about sensitive data. This is a problem in Git too: people make the mistake of reverting a commit with sensitive values and think it's gone, but the history is still out there.

Edit: or maybe that was the concern, but this other concern exists too :)


If thats what you're worried about, use Yjs. Yjs doesn't store deleted characters in documents. Unless you actively make yjs snapshots, once a character in a yjs document is deleted, its gone.


Huh! I wonder how that works if the deletion isn't synced with other clients yet.


It stores the deleted ranges, but not the chars of text that compose those ranges.

So from a privacy perspective, you see the metadata of edit history (the DAG of character positions edited), but are blind to the text.


Ah! That makes sense, thanks - today I learned!

If you’re not building something fully decentralized, you may be able to loosen some of the constraints CRDTs require. As an example, if you can guarantee that all clients have received changes later than X date, you can safely drop any operations from before that date.


Could you also make it fully decentralized but require clients to come online within a deadline (1 day, week) or risk losing their local changes? This would also allowing trimming history but with loss of some functionality to sync.


Yes, you can go that direction as well, although in a decentralized architecture there’s no shared notion of “coming online”.

For example:

1. you have four peers collaborating on a document

2. for some extended period peer A only communicates with peer B and peer C only communicates with peer D (and vice versa)

3. the peers truncate the operations at some point within that period

4. each pair of peers now has a document irreconcilable with the other even though all peers “came online” recently


The full op-log plus deterministic merging is a great fit for immutable block storage, which can have other security, performance, and cost benefits. I'm building Fireproof[1] to take advantage of recent research in this area. An additional benefit to content addressing the immutable data is that each operation resolves to a cryptographically guaranteed proof (or diff), enforcing causal consistency and allowing stable references to be made to snapshots. This means your interactive, offline-capable, losslessly-merging database can run on the edge or in the browser, but still have the integrity you'd have looked to a centralized database or the blockchain for in the past. (Eg you can drop a snapshot CID into a PDF for signature, or a smart contract, and remove all ambiguity about the referenced state.

[1] https://github.com/fireproof-storage/fireproof


how do such immutable systems deal with redaction eg for GDPR delete requests? Do you need to repack the whole history, and break the existing signature chain?


Yes that’s what you’d do. Makes sense to partition eg one chat per database file / proof chain, for lots of reasons including that.


CRDTs are powerful, but it's unfortunate that they leave behind a trail of historical operations (or elements), both in their ops- or state-based variants. Even with compression, it's still a downside that makes me concerned about adopting them.

There's nothing inherent in the concept of a CRDT that requires you to leave behind a trail of historical operations or elements.

You'd be better off directing your criticism and specific implementations than making this blanket statement about what is, at the end of the day, a set of mathematical laws that certain data types / databases follow.


What's the concern if you can delete history?


In a distributed system, which is often a place CRDTs thrive, deleted history means a new client cannot be brought up to the current state unless there is some form of state summarization. Doing this summarization/checkpointing in a consistent manner is difficult to do without some form of online consensus mechanism, which is also difficult to do in a distributed system with clients popping in and out all the time.


It depends on the system. Some approaches (like Greg Little's Shelf or my Eg-walker algorithm for text) make this trivial to implement.


(2021) and Automerge's Rust implementation seems to have landed so it would be interesting to see an updated benchmark.


Author here. Yjs has also been rewritten in rust (yrs) and it’s significantly faster than the JavaScript version.

I’ve also got a new, totally different approach to solving this problem too.

It would definitely be good to update the benchmarks. Everything has gotten faster.


But in the eg-walker paper, you benchmark against Yjs rather than Yrs because Yrs performed worse?

would love to read the totally different approach


Absolutely support this, so do I!


This is one of those rare articles which, although much of the material is over my head, I couldn't stop reading because it's written so well.


Author here. Very kind :)



Thanks! Macroexpanded:

5000x faster CRDTs: An adventure in optimization (2021) - https://news.ycombinator.com/item?id=33903563 - Dec 2022 (22 comments)

Faster CRDTs: An Adventure in Optimization - https://news.ycombinator.com/item?id=28017204 - July 2021 (151 comments)


Quoting the current github Readme [0]: >And since that blog post came out, performance has increased another 10-80x (!).

[0] https://github.com/josephg/diamond-types


Can someone explain to me please why CRDTs are slow?

This article suggests the future to me: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

Take a look at this and compare it to Y.js or automerge: https://github.com/canvasxyz/okra-js


> Can someone explain to me please why CRDTs are slow?

Author here. The main reason was that a lot of CRDT libraries were written by academics, who don't have the time, skill or interest in optimising them.

Since writing this article a few years ago, all the major CRDT libraries have gotten orders of magnitude faster.


I remember stumbling over this post a few years ago. Really entertaining post, one of my favorites in recent years.


IIRC the title was CRDTs go brrr


Author here. CRDTs go brrr was my working title and it’s still in the url. I should probably rename it back to that - so many people latch on to that title anyway. The meme value is strong.


> Why is WASM 4x slower than native execution?

I thought it was because every string operation had to be copied into WASM memory and then back into JS when the result was computed. Am I wrong? Am I misunderstanding the context? Genuinely curious!


Author here. This post was from a few years ago but from memory I controlled for that. So the problem wasn't FFI.

I loaded the whole history into wasm before I started timing, then processed it in an inner loop that was written in rust, running in the wasm context itself. There were only 2 calls to wasm or something. The 4x slowdown wasn't FFI. The algorithmic code itself was actually running 4x slower.

It'd be interesting to rerun the benchmark now. I assume compilers are better at emitting wasm, and wasm runtimes have gotten faster. I'm sure I've still got the benchmarking code around somewhere.


Interesting, thanks for the reply!


That strikes me as the likely plausible culprit.

The one that keeps tripping me up in unrelated domains is that the multithreading story is not easy or fully supported by libraries and tooling. We've run game engines and utility binaries (ffmpeg, zip, etc.) in the browser and they're super slow because of this.


I think the better question is why would someone expect them to be the same? I haven't worked on a WASM interpreter or JIT, but how often is it better to go through multiple layers of translation instead of one? Translating high level code to WASM, or any assembly language, makes you lose a lot of the "intent" embedded in the higher level code. In lower-level code, you often see a series of language-specific idioms for accomplishing the goal, that may or may not have direct equivalents on your actual machine. In the case of modern x86-64, you have a ton of instructions that are far more powerful than what you can do in WASM.

Decompilers exist of course, and maybe a list of macro-op fusions exists where a WASM JIT can do a relatively simple pattern match and get good native code (probably not though, and good luck with cross-platform optimization). And, LLVM is definitely not perfect, there's definitely low-hanging fruit where a post-processing optimizer could make improvements. So it is theoretically possible to make WASM faster than LLVM's native emit by doing something equivalent to an optimization step LLVM should have but doesn't at the moment.

But I highly, highly doubt that you'd get just as good results without some seriously well-laid plans, which I doubt exist, or by creating a set of instructions which is, in effect, a superset of what target ISAs support. From what I've seen, it's a subset, so good luck trying to canonicalize the operations and merge them back together in real time. Not entirely impossible of course, but it would be a serious feat of engineering.

Intuitively, if I wrote a book in English, then it got translated to a very different language that's also limited to only a few thousand words, then translated back to English, it wouldn't be exactly the same text. There would be concepts that have to be explained with a paragraph that, in English, would only take a single word. Getting my English back out would require a 1-to-1 translation of everything, or a list of paragraph-to-1 translations that both translators agree upon.


Seeing the hierarchical structure used I wonder if they tried using nested set instead. No idea if a possible gain in read operation would offset the losses in insertions.


Yeah, new rule: I don't believe anything in a published scientific paper until it has been independently verified for the third time. I don't even want to hear about it, before then, unless I read the journal the original (or second) paper was published in. What I'd really like, and would subscribe to even as a lay person, is the JOURNAL OF STUDIES WHOSE FINDINGS HAVE BEEN SUCCESSFULLY REPRODUCED FOR THE THIRD TIME. I'd pay for a subscription to that.


Me too.


Could be good to have the date put with the title?


Added above.


July 31 2021




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: