More

haadcode · on April 20, 2020

Can you elaborate why you think Rust is a better choice?

budabudimir · on April 21, 2020

I would have said that for any statically typed language.

It simply leads to more robust software because compiler rejects large set of incorrect programs that would otherwise end up in production. Your test coverage has to be much larger just to verify that your program is sane on the most basic level.

Rusts compiler is even stricter than most.

Not to mention resilience to change, JS is simply a blunder

haadcode · on April 20, 2020

Great comments! Indeed, "sign your log appends" gives everything needed for authority. Re. transactions, from OrbitDB's perspective this would be application specific, so you could hook into traditional, centralized consensus or use a blockchain or other types of decentralized consensus mechanisms. Or, given the core data structure is a log, build a "custom database" on OrbitDB that models and provides an interface for a consensus algorithm, eg. "append 1: head is X" <- "append 2: ack head is X" <- "append 3: ack from me too that head is X" etc.

haadcode · on April 20, 2020

One great piece of writing to think about the use cases and what kind of systems and applications can be built following the concepts applied in OrbitDB is this "Local-first software" https://www.inkandswitch.com/local-first.html (there's prolly a thread somewhere here too on that).

haadcode · on April 20, 2020

> Last-Writer-Wins is a conflict resolution strategy that can be used by any kind of data type that needs conflicts resolved, CRDTs included. Unfortunately it's not a very good one: even if you use vector clocks instead of wall clocks, it doesn't give you much stronger guarantees than determinism. That is, given two concurrent writes, the winner is essentially arbitrary. LWW is a merge strategy of last resort; if that's the only thing your CRDT system offers, I'm not sure it's really fair to call it a CRDT system.

Can't reply to the comment below, so replying here.

I believe what markhenderson was trying to say is that in OrbitDB, the default merge strategy for concurrent operations is LWW.

The comment above is conflating a lot of things here. 1) determinism is exactly the guarantee one needs for CRDTs, and I'd argue generally is a good thing in distributed system but 2) adding vector clocks (OrbitDB uses Lamport clocks, or Merkle Clocks [1], by default), nor wall clocks, have nothing to do with determinism and in fact there's a good reason to not use vector clocks by default: they grow unbounded in a system where users (=IDs) are not known. In my experience, LWW is a good baseline merge strategy.

I don't think it's at all correct to say that "the winner is essentially arbitrary" because it's not. The "last" in LWW can be determined based on any number of facts. For example "in case of concurrent operations, always take the one that is written by the ID of the user's mobile device", or "in case of concurrent operations, always take the one that <your preferred time/ordering service> says should come first". It'd be more correct say "the winner is based on the logical time ordering function, which may not be chronological, real world time order".

As for the last comment, I'm pretty sure it's a CRDT system :) Want to elaborate your reasoning why you think it's not a CRDT?

[1] "Merkle-CRDTs: Merkle-DAGs meet CRDTs" - https://arxiv.org/abs/2004.00107

sagichmal · on April 20, 2020

OK, I've read the paper; can you help me reason through a scenario?

As I understand it, the Merkle-CRDT represents a Merkle tree as a grow-only set of 3-tuples. When you add a new event to the thing (as a tuple) you have to reference all of the current concurrent root nodes of the data structure, in effect becoming the new single root node; and your event data, which must be a CRDT, gets merged with the CRDTs of those root nodes. Do I have it right so far?

Assuming yes, let's say you have causality chain like so:

    1 --> 2 --> 3 --> 4 
           `--> 5 --> 6

Two root nodes, 4 and 6. Two concurrent histories, 3-4 and 5-6. It's time to write a new value, so I create a new tuple with references to 4 and 6, and merge their CRDT values. Last Writer Wins, right? So either 4 or 6 dominates the other. Whoever was in the other causal history just... lost their writes?

haadcode · on April 20, 2020

almost! :) let me elaborate on few points.

> you have to reference all of the current concurrent root nodes of the data structure, in effect becoming the new single root node

correct, and more precisely the union of heads is the current "single root node". in practise, and this is where the merge strategy comes in, the "latest value" is the value of the event that is "last" (as per LWW sorting).

> and your event data, which must be a CRDT, gets merged with the CRDTs of those root nodes.

the event data itself doesn't have to be a CRDT, can be any data structure. the "root nodes" (meaning the heads of the log) don't get merged with the "event data" (assuming you mean the database/model layer on top of the log), the merge strategy of the log picks the "last/latest event data" to be the latest value of your data structure.

> It's time to write a new value, so I create a new tuple with references to 4 and 6, and merge their CRDT values.

when a new value is written, correct that the references to 4 and 6 are stored, but the new value doesn't merge the values of the previous events and rather, it's a new value of its own. it may replace the value from one or both of the previous events, but that depends on the data model (layer up from the log).

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

> Last Writer Wins, right? So either 4 or 6 dominates the other. Whoever was in the other causal history just... lost their writes?

no writes are lost. the result in your example depends what 4 and 6 refer to. in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6, so all values are preserved. in the case of a key-value store, it could be that 4 is a set operation to key a and 6 is a set operation to key b, thus the writes don't effect each other. if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a. makes sense?

sagichmal · on April 20, 2020

> in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6

How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

> if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a.

Yes, I mean for all of my events to be reads and writes of the same key. And you've proven my point, I think: if the resolution of this causal tree is Last-Writer-Wins, 6-domiantes-4, and "key a [gets] the value from 6", then whichever poor user was operating on the 3-4 causal branch has lost their writes.

This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

And if you want to say something like, well, when the unlucky user reconnects and sees that their work has been overwritten, they can just look in the causal history, extract their edits, and re-apply them to the new root -- there's no reason for any of this complex machinery! You don't need CRDTs to just replicate a log of all operations. Of course, you also can't do any meaningful work with such a data structure as a foundation, because it immediately becomes unusably large.

pas · on April 20, 2020

It's inferable from the fact that the writer saw both chains and produced a new node that merges these heads. So that writer "resolves" the conflict - according to whatever strategy it is programmed to use. (It might be as simple as just storing a JSON that says {"conflict": true, "values": [4, 6]} , and the user will have to pick.)

If it's possible to model operations in a commutative way (eg. instead of assigning values to keys one just stores differences), then the conflict resolution is mathematically proven, just apply all operations in whatever order, they're commutative, great. Of course it doesn't help with "real world data", but that's where mathematically we can use and oracle (the user, or whatever linearizer service we choose).

haadcode · on April 20, 2020

> How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

I jumped the gun there and made an assumption that the value of a node is the LWW ordering :) Ok, so without that assumption, the DAG

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

...are the values of the operations that the DAG represents, ie. values of a key, so we need to look at the Lamport clocks (or Merkle Clocks when the operations are hashed as a merkle dag) of each operation, represented here as ((ts, id), key, value):

  ((0, x), a, 1) --> ((1, x), a, 2) --> ((2, x), a, 3) --> ((3, x), a, 4)
                                   `--> ((2, y), a, 5) --> ((3, y), a, 6)

which one is the latest value for key a? Which updates, semantically, were lost? In a non-CRDT system, which value (4/x or 6/y) is or should be displayed and considered the latest?

> This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

You're conflating the data(base) model with the log and we can't generalize that all cases of data models or merge conflict are cases of "I expect all my operations to be the latest and visible to me". They are semantically different. If the writes are on the same key, one of them has to come first if the notion of "latest single value" is required. If the writes are not on the same key, or not key-based, multiple values appear where they need to. What we can generalize is that by giving a deterministic sorting function, the "latest value" is the same for all participants (readers) in the system. From data structure perspective this is correct: given same set of operations, you always get the same result. For many use cases, LWW works perfectly fine, and if your data model requires a "different interpretation" of the latest values, you can pass in your custom merge logic (=sorting function) in orbitdb. The cool thing is, that by giving a deterministic sorting function for a log, you can turn almost any data structure to a CRDT. How they translate to end-user data model will depend (eg. I wouldn't model, say, "comments on a blog post" as a key-value store).

If you're curious to understand more, I think the model is best described in the paper "OpSets: Sequential Specifications for Replicated Datatypes" [1]. Another two papers, from the same author, that may also help are "Online Event Processing" [2] and "Moving Elements in List CRDTs" [3] which show how by breaking down the data model to be more granular than "all or nothing", composing different CRDTs give arise to new CRDTS, which I find beautiful. Anything, really, that M. Kleppmann has written about the topic is worth a read :)

[1] https://arxiv.org/pdf/1805.04263.pdf [2] https://martin.kleppmann.com/papers/olep-cacm.pdf [3] https://martin.kleppmann.com/papers/list-move-papoc20.pdf

sagichmal · on April 20, 2020

OK, I understand now. I guess my points then translate to:

1. Modeling an append-only log as a CRDT is trivial

2. Building a database on top of a "CRDT" append-only log doesn't make the database a CRDT

haadcode · on April 20, 2020

No, on both. See above.

haadcode · on April 20, 2020

There's also a Go implementation at https://github.com/berty/go-orbit-db

haadcode · on April 20, 2020

To clarify here, OrbitDB's core is an append-only, immutable log CRDT. While a mouthful, what is gives, is a distributed (decentralized) database operations log. Almost any type of database, or data structure, can be then built on top of that log. The log in OrbitDB is a Merkle-DAG [1], so, a graph.

Key-Value databases, feeds, and other data model types that OrbitDB supports by default, are all built on that log. You can also create your custom database types, ie. custom data models.

[1] https://discuss.ipfs.io/t/what-is-a-merkle-dag/386/4

haadcode · on April 20, 2020

The log described above is this https://github.com/orbitdb/ipfs-log

sagichmal · on April 20, 2020

What makes it a CRDT?

haadcode · on April 20, 2020

Posted this in another reply above, but give this a read: "Merkle CRDTs" (https://arxiv.org/abs/2004.00107).

haadcode · on April 20, 2020

The use case is shared, mutable data structures that don't rely on central coordination or control.

collyw · on April 20, 2020

Does that even make sense? Why wouldn't you want centralized coordination of your data structures? We could go back to people emailing emailing Excels to each other is that's what you want.

haadcode · on April 20, 2020

> I guess performance and robustness were not a priority.

You'd be surprised how well JS does on both fronts, in addition to being able to run across platforms :)

brabel · on April 20, 2020

Performance-wise, JS is not too bad when you have lots of IO going on, as in this case... but robustness? You gotta be kidding! JS is the poster child of a language does NOT have robustness as one of its attributes. Just about every single one of its design decisions makes robustness difficult.

wayneftw · on April 20, 2020

You should check your definition.

Define robust:

1. Strong and healthy; vigorous

- (of a process, system, organization, etc.) able to withstand or overcome adverse conditions.

JS is the most robust language we have according to the main definition, however you measure it. What other language is as alive as JS right now? What other language has as much programmer attention?

JS is also easily robust according to the noted sub-definition. What other language would survive the web? The web, an environment that could easily be described as "very adverse"... What other language is ready to run in a browser where it will be dynamically and continuously mixed with modules from many different sources, without breaking? JS was built for this.

What other popular language is so easily runnable cross platform? What other popular language lets you program with both OOP and functional paradigms while also being as popular as JS? What other popular language lets you monkey patch things to fit a piece of code into any situtation?

Robust means all of these things to me. What does robust mean to you and what is your example of the most robust programming language? You didn't say...

None of these fit the bill: Python, Ruby, C, C++, Golang, Rust, C#.

brabel · on April 20, 2020

You are arguing for the sake of arguing. OP said JS is not a good choice when performance and robustness is important... any programmer knows exactly what they mean, but I will spell it out for you as you clearly do not get it: if you're going to write a DB, you want it to be performant (run fast) and robust (does not fail or lose data, or get into a corrupt state easily). JS is absolutely not the first choice when performance is very important, but I concede it can run fast enough for some kinds of a applications... but would a DB written in JS be robust? The answer by anyone who knows anything about programming languages has to be a big NO!

Any language that has a dynamic & weak type system, and where monkey patching is not only allowed but used widely, cannot make a claim to leading to robust software.

I would argue that languages focusing on correctness are the ones that would have a good claim at producing more robust software. As you've asked, I would say that Rust, Ada, Haskell fit the bill... but even languages that focus on keeping simplicity and boring uniformity at scale, like Go and Java, could still make such claim (and lots of software written in them can be described as robust) a lot more than JS, which offers basically 0 features focused on correctness beyond the bare minimum.

vaultec81 · on April 20, 2020

Maybe javascript might not be the best choice in terms of high end/large databases. But it certainly is the best choice for web3 applications and lightweight nodejs software. There is also a golang edition of orbit-db.

> (does not fail or lose data, or get into a corrupt state easily)

The data is stored on IPFS primarily (Pick a number of different ipfs varieties, rust, golang, js). Oplog specific data is stored in a datastore-level instance, which utilizes c++ leveldown bindings. Code that is well tested and has never been proved to lose data due to being written in JS alone. Aside from that the programming language does not determine whether a developer creates crappy code. The developer does, horrible, inefficient, bug filled code can exist in any number of languages. Its really a matter of preference.

Side note: I used to absolutely hate JS until a began programming in it for many months. Originally programming Java

wayneftw · on April 20, 2020

> You are arguing for the sake of arguing.

Nope. I actually just disagreed with you. I think JS is a robust language and I said why very clearly and politely.

You, however - well I stopped reading after your first sentence.

Good luck to you!

brabel · on April 23, 2020

You start with " You should check your definition." and you think that is " actually just disagreed with you. I think JS is a robust language and I said why very clearly and politely."?

You didn't say why JS is robust, you twisted the word's definition to fit your point. Don't do that.

If your way of debating is to leave the debate when someone says something you don't like, then good luck working in IT.

haadcode · on April 20, 2020

> * How did this project get started? What problem is it trying to solve?

OrbitDB got started because we wanted to build serverless applications, especially for the web (ie. applications that run in the browser.) Serverless meaning "no server" and no central authority, ie. something that can't be shut down.

OrbitDB gives tools to build systems and applications where the user owns their data, that is, the data that is not controlled by a service. As a way of simple example, imagine Twitter that doesn't have one massive database for all tweets, but rather you'd have one database for each user.

haadcode · on Aug 2, 2019

Absolutely! I don't know the exact history and why it hasn't been applied much, and I'd be very curious to learn, but it's definitely a gem.