> How do you know that? It's not inferrable from the DAG. Is sequencing also pro... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

haadcode on April 20, 2020 | parent | context | favorite | on: OrbitDB: Peer-to-peer databases for the decentrali...

> How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

I jumped the gun there and made an assumption that the value of a node is the LWW ordering :) Ok, so without that assumption, the DAG

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

...are the values of the operations that the DAG represents, ie. values of a key, so we need to look at the Lamport clocks (or Merkle Clocks when the operations are hashed as a merkle dag) of each operation, represented here as ((ts, id), key, value):

  ((0, x), a, 1) --> ((1, x), a, 2) --> ((2, x), a, 3) --> ((3, x), a, 4)
                                   `--> ((2, y), a, 5) --> ((3, y), a, 6)

which one is the latest value for key a? Which updates, semantically, were lost? In a non-CRDT system, which value (4/x or 6/y) is or should be displayed and considered the latest?

> This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

You're conflating the data(base) model with the log and we can't generalize that all cases of data models or merge conflict are cases of "I expect all my operations to be the latest and visible to me". They are semantically different. If the writes are on the same key, one of them has to come first if the notion of "latest single value" is required. If the writes are not on the same key, or not key-based, multiple values appear where they need to. What we can generalize is that by giving a deterministic sorting function, the "latest value" is the same for all participants (readers) in the system. From data structure perspective this is correct: given same set of operations, you always get the same result. For many use cases, LWW works perfectly fine, and if your data model requires a "different interpretation" of the latest values, you can pass in your custom merge logic (=sorting function) in orbitdb. The cool thing is, that by giving a deterministic sorting function for a log, you can turn almost any data structure to a CRDT. How they translate to end-user data model will depend (eg. I wouldn't model, say, "comments on a blog post" as a key-value store).

If you're curious to understand more, I think the model is best described in the paper "OpSets: Sequential Specifications for Replicated Datatypes" [1]. Another two papers, from the same author, that may also help are "Online Event Processing" [2] and "Moving Elements in List CRDTs" [3] which show how by breaking down the data model to be more granular than "all or nothing", composing different CRDTs give arise to new CRDTS, which I find beautiful. Anything, really, that M. Kleppmann has written about the topic is worth a read :)

[1] https://arxiv.org/pdf/1805.04263.pdf [2] https://martin.kleppmann.com/papers/olep-cacm.pdf [3] https://martin.kleppmann.com/papers/list-move-papoc20.pdf

sagichmal on April 20, 2020 [–]

OK, I understand now. I guess my points then translate to:

1. Modeling an append-only log as a CRDT is trivial

2. Building a database on top of a "CRDT" append-only log doesn't make the database a CRDT

haadcode on April 20, 2020 | [–]

No, on both. See above.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact