Hacker News new | past | comments | ask | show | jobs | submit login
CRDT: Fractional Indexing (madebyevan.com)
343 points by mblode on Nov 27, 2022 | hide | past | favorite | 44 comments



To me the other algorithms described in the list are more novel and interesting:

https://madebyevan.com/algos/crdt-tree-based-indexing/ - for when precise order is critical, like paragraphs in a document. This algorithm is almost like storing adjacency information like a linked list, but is more convergent. Very interesting for [my use-case](https://www.notion.so/blog/data-model-behind-notion).

https://madebyevan.com/algos/crdt-mutable-tree-hierarchy/ - for tree-shaped data, like blocks in a Notion page that should have exactly one parent, but allow concurrent re-parenting operations

https://madebyevan.com/algos/log-spaced-snapshots/ - log space snapshots, for choosing what fidelity of historical information to store. For context, many CRDTs for rich text or sequences store unbounded history so that any edit made at any time can be merged into the sequence. For long-lived documents, this could be impractical to sync to all clients or keep in "hot" memory. Instead, we can decide to compact historical data and move it to cold storage, imposing a time boundary on what writes the system can accept on the hot path. The log-spaced snapshots algorithm here could be used to decide what should be kept "hot", and how to tune the cold storage.


If you’re working on CRDT stuff in production (or possibly in production) do you have thoughts on the CRDT vs OT debate? I would expect Notion to use operational transform given the availability of reliable central servers. But I know quite little! Interested in your thoughts.


I’m not the GP, but OT is pretty annoying to implement. There are so many cases that it’s quite difficult to formally prove an OT correct. On the other hand, a large subset of CRDTs can be implemented in Datalog and if you do that you can’t possibly end up with an invalid CRDT.

From wikipedia:

> Similarly, Joseph Gentle who is a former Google Wave engineer and an author of the Share.JS library wrote, "Unfortunately, implementing OT sucks. There's a million algorithms with different tradeoffs, mostly trapped in academic papers. […] Wave took 2 years to write and if we rewrote it today, it would take almost as long to write a second time." But later he amends his comment with "I no longer believe that wave would take 2 years to implement now - mostly because of advances in web frameworks and web browsers."


CRDT is mostly studied in p2p settings while a central server can reduce lots of complexities in resolving conflict.


Anyone unsure of what a CRDT is (I think everyone on HN must know by now), this is the perfect intro: https://www.inkandswitch.com/peritext/

The two most widely used CRDT implementations (combining JSON like general purpose types and rich text editing types) are:

- Automerge https://github.com/automerge/automerge

- Yjs https://github.com/yjs/yjs

Both have JS and Rust implementations, and have bindings to most online rich text editors.

CRDTs are addictive one you get into them.


Even that link was 5 pages in (on my phone) before it deigned to mention:

> It is a Conflict-free Replicated Data Type (CRDT)

What happened to the idea of defining all non-universally-recognised acronyms the first time you use the term? With people making up new terms exponentially faster than ever before, it’s now more important than ever.


To be fair it's one of those terms where merely expanding the acronym doesn't help you much understand what it is. Or if you know what a "Conflict-free Replicated Data Type" is, you've probably heard the acronym more than a few times.


The first use is a hyperlink to a whole article defining the term, what are you on about?


TBF, it uses the acronym 8 times across 500 words before giving you the actual term.


Also, outside the page title/headings and the reference to the name of said external paper the first use in the document _is_ where they use the expanded name.


five pages down.


CRDTs are often talked about in the same breath as collaborative editing software, but they're useful for much more than that.

They really are a theoretical model of how distributed, convergent, multi-master systems have to work. IE the DT in CRDT could be a whole datastore, not as just an individual document.

(Wish I could remember who on HN alerted me to this. I had read the paper but didn't grok the full implications).


You might be thinking of this [1] paper? The core idea being if your problem space lends itself to monotonicity (see the paper for a more precise definition than I can give), then you can build a globally consistent database (around a CRDT) where the end-user doesn't need to concern themselves with inconsistent states while consistency is reached.

[1] https://arxiv.org/pdf/1901.01930.pdf


I wasn't clear - by "the paper" I meant the original CRDT paper. I read it, thought I understood it on some level, but had not drawn the dots between the theory and real world problems.

Regardless, thanks very much for linking the paper! Right up my alley.


Marc Shapiro and Nuno Preguiça created CRDTs in 2007.

https://pages.lip6.fr/Marc.Shapiro/pubs.html#CRDTs

The original paper would be https://hal.inria.fr/inria-00177693/


We're using a single end-to-end encrypted document tree synced with CRDTs for our collaborative task IDE[1]. All data for a team is a single tree (graph really, if you count transclusions) and its kind of magical how simple everything gets when you know all state will sync in a deterministic way between all clients. It doesn't matter whether you drag&drop some object or add a new user, or rename something. It all reduces to a handful of CRDT operations.

(We have a central server and the app works offline so the algorithm from the linked article doesn't apply in our case.)

[1] https://thymer.com


Are you using Yjs? How are you thinking about scale-up for teams with 100s - 1000s of users?


We've looked at Yjs but ultimately decided to write our own thing from scratch.

Even scale-ups with thousands of users will have people working on different parts of the document tree in practice. The client doesn't need to receive every keystroke for people editing somewhere far away in the document tree. Those updates can be sent in batches every once in a while.

If 10.000 people decide to edit the exact same location in a document then performance will degrade. The client will start to lag behind if it can't keep up with the stream of updates (cpu or internet bottleneck) but eventually it will process everything and all clients will end up with the same state. We have two channels. One for state updates (soft real-time) and one for UI hints ("hard" real-time) and other user actions that aren't CRDT mutations.


How did you got end to end encryption working with CRDTs?


Probably something by Martin Kleppmann?


This stuff is fun to play with. I implemented a Rust version of fractional indexing based on another of Evan’s blog posts.

https://docs.rs/fractional_index/latest/fractional_index/


Ok, let's try changing the URL from https://madebyevan.com/algos/ (a generic list) to https://docs.rs/fractional_index/latest/fractional_index/ (the item you're taking about).

Explanation here: https://news.ycombinator.com/item?id=33764956. Thanks!


Does a Vec<u8> with that calculation listed end up smaller than two integers of (presumably) arbitrary length?

When I read about the problem, I imagined using the mediant to calculate “between”. I think that mediant of $value and 1/1 would work for “after” and mediant of $value and 0/1 would work for “before”.

There must be a requirement I’m missing, though!

mediant: https://en.wikipedia.org/wiki/Mediant_(mathematics)


That would be an interesting thing to test! My first instinct is that the comparison operation (when denominators differ) would be more expensive.

In general some approaches should be better than others depending on how inserts are distributed — I think my approach is optimal when inserts are made at random (I haven’t proved this), but in practice it may be common for a bunch of things to be inserted in sequence for some applications. In those cases, there are more space-optimal approaches.


CRDTs are incredible. I recommend checking out the original link [1] as well as this “awesome” list [2]

[1] https://madebyevan.com/algos/

[2] https://github.com/alangibson/awesome-crdt


This is basically the idea behind Logoot [Weis_2009] that was improved by LSeq [Nédelec_2013] and later extended to the first block-wise sequence CRDT: LogootSplit [André_2013]. LogootSplit was recently improved as Dotted LogootSplit [1] [Elvinger_2021].

Disclamer: I'm the author of Dotted LogootSplit.

[Weis_2009] https://hal.inria.fr/inria-00432368

[Nédelec_2013] https://hal.archives-ouvertes.fr/hal-00921633/en

[André_2013] https://hal.archives-ouvertes.fr/hal-01246212

[Elvinger_2021] https://hal.univ-lorraine.fr/tel-03284806

[1] https://github.com/coast-team/dotted-logootsplit


This is kind of interesting but "fractional indexing" doesn't seem to be a computer science topic, and I think it might be clearer to treat these indexes as lists of numbers (or ordinals in ω^ω, if you prefer) rather than fractions. Those are simpler to generate and compare than arbitrary-precision fractions. Or as jitl's post suggests, using trees as indexes (I haven't yet looked at jitl's linked articles). Those would presumably have order type ε₀. It's not clear to me why you'd want that, but it seems doable. In all these schemes you might occasionally want a "stop the world garbage collection" where you reset all the indices to be ordinary integers or maybe pairs of integers. I guess that is also doable without having to pause all the updates, at least if you use pairs.


In a large distributed system stop the world garbage collection is probably not going to work very well.


Reminds me a little of "Lexorank" used in Jira[0], except by using decimals instead of base-36 strings, which I guess could be more efficient? The "rebalancing" aspect is interesting because on a long enough timescale for a document you will definitely want to do it. Would love to read up more on any algorithms for doing this in a p2p manner - with a central server it's probably quite easy using some tombstoning or something.

[0] https://stackoverflow.com/questions/40718900/jiras-lexorank-...


reminds me a lot of Ford Circle / Farey Diagram / Stern Brocot tree

Basically a tree of fractions where you take two rational points on a number line, a/b and c/d, then the next point in the tree is (a+b) / (c+d). Turns out that every single point you create this way has a unique position and never duplicate each other, and it forms a tree like structure.

https://en.wikipedia.org/wiki/Ford_circle

not sure if this would be useful, but basically it could be a fractional index that has a built in tree structure, since it basically means any fraction is a leaf on a Stern-Brocot tree.


The algorithm exposed in the article would be better served by a Stern-Brocot than average + jitter.


Is this the same as LSeq [1] except rather than using bytes one is basically using digits in a floating point representation (given this is JS where most things are floats)?

[1] https://bartoszsypytkowski.com/operation-based-crdts-arrays-...


Doesn't this end up being effectively a binary heap, with a maximum tree depth of 23 (floating point mantissa precision)? I imagine there must be a rebalancing operation required every so often, possibly more frequently for pathological insertion orders.


> Fractional positions should be represented using arbitrary-precision decimals so that they don't run out of precision. Floating-point numbers are insufficient.


For those unaware: Evan is the cofounder, former CTO, and technical wizard behind Figma; creator of esbuild, etc.


Is there a good resource on designing collaborative apps with CRDTS, that is which type of CRDT and conflict resolution to pick for a given application? Something like https://mattweidner.com/2022/02/10/collaborative-data-design... but generalizes.


Thanks for sharing this, we implemented OT in our real-time editor, but CRDT's potential is evident in creating a collaborative experience, in particular fractional indexing and reparenting of mutable tree for our use cases.

We should end the CRDT vs OT debate once and for all.


Reminds me of TokuDB engine for MySQL. Good times.


Maybe I’m missing the point. Random offsets feels like an inelegant solution in the space of elegant solutions (read: CRDTs).


this reminds me of libketama sharding


This looks like great stuff if you follow the pointers, but lists don't make good submissions to HN (which itself is already a list). They tend not to lead to deep discussion because comments are about lowest common denominator of the items on the list, and this is usually pretty generic. What's better is to pick the most interesting item on the list and submit that instead. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

Edit: let's do that in this case. I've changed the URL from https://madebyevan.com/algos/ based on https://news.ycombinator.com/item?id=33764875.


Thank you dang for updating the title and URL!


The funny thing is that the thread so far has still mostly been generic, yet I swear the comments are higher-quality than they would have been without the change. (Impossible to say for sure, of course)


Conflict-free replicated data type, or CRDT, is often referred to as simply “CRDT” in trade jargon without any definition. Apparently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: