More

markhenderson · on Aug 17, 2020

https://equilibrium.co is hiring REMOTE Rust programmers to help build the distributed web. Looking for one Senior and one "medium" level folks.

1. https://www.notion.so/Hiring-Rust-Engineer-882281f5248e45579... 2. https://www.notion.so/Hiring-Senior-Rust-Engineer-e6c94ccc26...

markhenderson · on May 3, 2020

Hey cryptoquick, I remember you from a while back. I'm https://twitter.com/aphelionz, one of the maintainers. Nice to see you :)

You're not wrong about the memory usage. We decided early on that the tradeoff would be more RAM than to take the performance hit with the I/O (either to and from IPFS or to and from the filesystem in the case of SnapDB).

That being said, I have been experimenting with removing indexes like the ones you pointed out in favor of generators that emit the values as they are traversed. There are two main efforts here, one is feasible and the other is more difficult.

1. The `entryIndex` inside of ipfs-log can probably go, and the ipfs-log#traverse function can be made into an async generator function that passes the oplog values up to the store 2. The indexes you linked to that hold the calculated STATE are harder to get rid of - they could be persisted to IPFS or the file system as well

Open to ideas on #2. SnapDB might work for all I know, as I haven't attempted it yet.

I can't speak to Electron, React, Redux, Iced, etc, but my guess is there are optimizations one can do there as well.

markhenderson · on May 3, 2020

Also, speaking of Rust - we'd love more contributors over at https://github.com/ipfs-rust/rust-ipfs/.

markhenderson · on April 20, 2020

Well, I imagine you'll be happy to know that some of us have begun IPFS implementation in Rust: https://github.com/ipfs-rust/rust-ipfs

budabudimir · on April 20, 2020

Definitely a much better choice

haadcode · on April 20, 2020

Can you elaborate why you think Rust is a better choice?

budabudimir · on April 21, 2020

I would have said that for any statically typed language.

It simply leads to more robust software because compiler rejects large set of incorrect programs that would otherwise end up in production. Your test coverage has to be much larger just to verify that your program is sane on the most basic level.

Rusts compiler is even stricter than most.

Not to mention resilience to change, JS is simply a blunder

markhenderson · on April 20, 2020

Yeah, sorry about that :( There's this though: https://github.com/haadcode/orbit-db-control-center

(which also seems to be putting https://ipfs.io under strain. Sorry, ipfs.io!

collyw · on April 20, 2020

Decentralized with a control center?

markhenderson · on April 20, 2020

OrbitDB works with a CRDT stored in IPFS. In order to calculate the state of the database, it does need to reduce the CRDT oplog which requires fetching all the entries. This was indeed very time consuming, particularly for remote requests, since we used a "nexts" list of addresses to load.

HOWEVER! Our latest release, 0.23.0, mitigated this by using a power-of-2 skiplist to load things in parallel, which gave us a nice 4-5x boost there.

aboodman · on April 20, 2020

Does this mean that in order to initialize the database, the entire version history must be synced and reduced, to get the current values?

(Is there a design doc for OrbitDB anywhere?)

markhenderson · on April 20, 2020

Hey thanks for taking a look at this. We support identity providers (the one you linked), and access controllers.[1]

Identity providers work by cross-signing an external keypair with the generated OrbitDB key-pairs, and access controllers generally work by exposing a `canAppend` function that facilitates any kind of auth you want to perform. There's support for Metamask for example. OrbitDB is also used _in_ Metamask under the hood by our good friends at 3box.[2]

TallyLab, my application project, uses these in its IAM system and you can see the repos here.[3]

1. https://github.com/orbitdb/orbit-db-access-controllers 2. https://3box.io/ 3. https://github.com/tallylab

markhenderson · on April 20, 2020

Hey folks! I'm https://twitter.com/aphelionz. One of the maintainers of OrbitDB. Happy to answer any questions you might have. I'll also be in the thread answering folks as well.

billconan · on April 20, 2020

I want to build a reddit like community using orbitdb. But orbitdb can't freely add and remove user permissions. When will this feature be implemented?

markhenderson · on April 20, 2020

This is an open problem. It might be surprising to find out that it's quite difficult.

CRDTs usually work as last-write-wins, meaning that if you have a key-value store, the last update to update a key 'wins' the value via the way oplog reduction works.

If you reverse that to a FIRST-write-wins log, you can grant permissions and ownership on a first-come, first-serve basis. Revocation, then, becomes the issue. What do you do with the records they already have? Questions like that are plentiful.

The approach most people take is to find workarounds or "good enough" solutions here, either by using encryption and allowing the encrypted data to be public, or by using some sort of other OrbitDB store as their ACL and management, and only giving select keys access to write to said ACL store in the first place.

Adding encryption into the mix though, particularly multi-writer, becomes exponentially harder.

sagichmal · on April 20, 2020

> CRDTs usually work as last-write-wins

Um.

markhenderson · on April 20, 2020

Ah, sorry, yes I stepped on a rake and whacked myself in the face here.

What I meant, since LWW is nomenclature for an alternative to CRDTs, is that the last writer by _logical clock_ in a CRDT, not by _wall clock_ time, will "win" the key.

sagichmal · on April 20, 2020

> LWW is nomenclature for an alternative to CRDTs

I think you're still stepping on rakes...

markhenderson · on April 20, 2020

Can you elaborate?

haadcode · on April 20, 2020

> Last-Writer-Wins is a conflict resolution strategy that can be used by any kind of data type that needs conflicts resolved, CRDTs included. Unfortunately it's not a very good one: even if you use vector clocks instead of wall clocks, it doesn't give you much stronger guarantees than determinism. That is, given two concurrent writes, the winner is essentially arbitrary. LWW is a merge strategy of last resort; if that's the only thing your CRDT system offers, I'm not sure it's really fair to call it a CRDT system.

Can't reply to the comment below, so replying here.

I believe what markhenderson was trying to say is that in OrbitDB, the default merge strategy for concurrent operations is LWW.

The comment above is conflating a lot of things here. 1) determinism is exactly the guarantee one needs for CRDTs, and I'd argue generally is a good thing in distributed system but 2) adding vector clocks (OrbitDB uses Lamport clocks, or Merkle Clocks [1], by default), nor wall clocks, have nothing to do with determinism and in fact there's a good reason to not use vector clocks by default: they grow unbounded in a system where users (=IDs) are not known. In my experience, LWW is a good baseline merge strategy.

I don't think it's at all correct to say that "the winner is essentially arbitrary" because it's not. The "last" in LWW can be determined based on any number of facts. For example "in case of concurrent operations, always take the one that is written by the ID of the user's mobile device", or "in case of concurrent operations, always take the one that <your preferred time/ordering service> says should come first". It'd be more correct say "the winner is based on the logical time ordering function, which may not be chronological, real world time order".

As for the last comment, I'm pretty sure it's a CRDT system :) Want to elaborate your reasoning why you think it's not a CRDT?

[1] "Merkle-CRDTs: Merkle-DAGs meet CRDTs" - https://arxiv.org/abs/2004.00107

sagichmal · on April 20, 2020

OK, I've read the paper; can you help me reason through a scenario?

As I understand it, the Merkle-CRDT represents a Merkle tree as a grow-only set of 3-tuples. When you add a new event to the thing (as a tuple) you have to reference all of the current concurrent root nodes of the data structure, in effect becoming the new single root node; and your event data, which must be a CRDT, gets merged with the CRDTs of those root nodes. Do I have it right so far?

Assuming yes, let's say you have causality chain like so:

    1 --> 2 --> 3 --> 4 
           `--> 5 --> 6

Two root nodes, 4 and 6. Two concurrent histories, 3-4 and 5-6. It's time to write a new value, so I create a new tuple with references to 4 and 6, and merge their CRDT values. Last Writer Wins, right? So either 4 or 6 dominates the other. Whoever was in the other causal history just... lost their writes?

haadcode · on April 20, 2020

almost! :) let me elaborate on few points.

> you have to reference all of the current concurrent root nodes of the data structure, in effect becoming the new single root node

correct, and more precisely the union of heads is the current "single root node". in practise, and this is where the merge strategy comes in, the "latest value" is the value of the event that is "last" (as per LWW sorting).

> and your event data, which must be a CRDT, gets merged with the CRDTs of those root nodes.

the event data itself doesn't have to be a CRDT, can be any data structure. the "root nodes" (meaning the heads of the log) don't get merged with the "event data" (assuming you mean the database/model layer on top of the log), the merge strategy of the log picks the "last/latest event data" to be the latest value of your data structure.

> It's time to write a new value, so I create a new tuple with references to 4 and 6, and merge their CRDT values.

when a new value is written, correct that the references to 4 and 6 are stored, but the new value doesn't merge the values of the previous events and rather, it's a new value of its own. it may replace the value from one or both of the previous events, but that depends on the data model (layer up from the log).

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

> Last Writer Wins, right? So either 4 or 6 dominates the other. Whoever was in the other causal history just... lost their writes?

no writes are lost. the result in your example depends what 4 and 6 refer to. in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6, so all values are preserved. in the case of a key-value store, it could be that 4 is a set operation to key a and 6 is a set operation to key b, thus the writes don't effect each other. if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a. makes sense?

sagichmal · on April 20, 2020

> in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6

How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

> if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a.

Yes, I mean for all of my events to be reads and writes of the same key. And you've proven my point, I think: if the resolution of this causal tree is Last-Writer-Wins, 6-domiantes-4, and "key a [gets] the value from 6", then whichever poor user was operating on the 3-4 causal branch has lost their writes.

This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

And if you want to say something like, well, when the unlucky user reconnects and sees that their work has been overwritten, they can just look in the causal history, extract their edits, and re-apply them to the new root -- there's no reason for any of this complex machinery! You don't need CRDTs to just replicate a log of all operations. Of course, you also can't do any meaningful work with such a data structure as a foundation, because it immediately becomes unusably large.

pas · on April 20, 2020

It's inferable from the fact that the writer saw both chains and produced a new node that merges these heads. So that writer "resolves" the conflict - according to whatever strategy it is programmed to use. (It might be as simple as just storing a JSON that says {"conflict": true, "values": [4, 6]} , and the user will have to pick.)

If it's possible to model operations in a commutative way (eg. instead of assigning values to keys one just stores differences), then the conflict resolution is mathematically proven, just apply all operations in whatever order, they're commutative, great. Of course it doesn't help with "real world data", but that's where mathematically we can use and oracle (the user, or whatever linearizer service we choose).

haadcode · on April 20, 2020

> How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

I jumped the gun there and made an assumption that the value of a node is the LWW ordering :) Ok, so without that assumption, the DAG

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

...are the values of the operations that the DAG represents, ie. values of a key, so we need to look at the Lamport clocks (or Merkle Clocks when the operations are hashed as a merkle dag) of each operation, represented here as ((ts, id), key, value):

  ((0, x), a, 1) --> ((1, x), a, 2) --> ((2, x), a, 3) --> ((3, x), a, 4)
                                   `--> ((2, y), a, 5) --> ((3, y), a, 6)

which one is the latest value for key a? Which updates, semantically, were lost? In a non-CRDT system, which value (4/x or 6/y) is or should be displayed and considered the latest?

> This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

You're conflating the data(base) model with the log and we can't generalize that all cases of data models or merge conflict are cases of "I expect all my operations to be the latest and visible to me". They are semantically different. If the writes are on the same key, one of them has to come first if the notion of "latest single value" is required. If the writes are not on the same key, or not key-based, multiple values appear where they need to. What we can generalize is that by giving a deterministic sorting function, the "latest value" is the same for all participants (readers) in the system. From data structure perspective this is correct: given same set of operations, you always get the same result. For many use cases, LWW works perfectly fine, and if your data model requires a "different interpretation" of the latest values, you can pass in your custom merge logic (=sorting function) in orbitdb. The cool thing is, that by giving a deterministic sorting function for a log, you can turn almost any data structure to a CRDT. How they translate to end-user data model will depend (eg. I wouldn't model, say, "comments on a blog post" as a key-value store).

If you're curious to understand more, I think the model is best described in the paper "OpSets: Sequential Specifications for Replicated Datatypes" [1]. Another two papers, from the same author, that may also help are "Online Event Processing" [2] and "Moving Elements in List CRDTs" [3] which show how by breaking down the data model to be more granular than "all or nothing", composing different CRDTs give arise to new CRDTS, which I find beautiful. Anything, really, that M. Kleppmann has written about the topic is worth a read :)

[1] https://arxiv.org/pdf/1805.04263.pdf [2] https://martin.kleppmann.com/papers/olep-cacm.pdf [3] https://martin.kleppmann.com/papers/list-move-papoc20.pdf

sagichmal · on April 20, 2020

OK, I understand now. I guess my points then translate to:

1. Modeling an append-only log as a CRDT is trivial

2. Building a database on top of a "CRDT" append-only log doesn't make the database a CRDT

haadcode · on April 20, 2020

No, on both. See above.

sagichmal · on April 20, 2020

Last-Writer-Wins is a conflict resolution strategy that can be used by any kind of data type that needs conflicts resolved, CRDTs included. Unfortunately it's not a very good one: even if you use vector clocks instead of wall clocks, it doesn't give you much stronger guarantees than determinism. That is, given two concurrent writes, the winner is essentially arbitrary. LWW is a merge strategy of last resort; if that's the only thing your CRDT system offers, I'm not sure it's really fair to call it a CRDT system.

s3n4 · on April 20, 2020

You can build this using 3Box, which has extended OrbitDB with DID-based access control system and user permissions. Check it out here: https://docs.3box.io/build/web-apps/messaging

3Box also has support for members only OrbitDB threads which can restrict posting to members, and encrypted OrbitDB threads to make posts private to the group. To Mark's point above.

sbazerque · on April 20, 2020

That's intersting! In the case of persistent threads as mentioned in the link, can the set of moderators be mutable, and still have eventual consistency?

s3n4 · on April 20, 2020

Yes, the list of moderators is mutable. It's addition-only out of the box, but to create a system where removing moderators is possible, you can create a new thread with the new set of mods (minus the one you removed) and the first entry in that new thread can reference the old thread. This model gives the new set of mods forward control, but the old content will still have the old set of mods.

This also works for encryption in members threads. To remove a member, you can create a new thread without the member and link the original thread. This gives forward secrecy since new encryption keys are generated for the new thread.

We're working on improving this system over time, but it requires some more advanced cryptography such as proxy re-encryption (like nucypher).

markhenderson · on April 20, 2020

There ya go!

dvh · on April 20, 2020

What happen if someone posts illegal content?

SamuelAdams · on April 20, 2020

Also, what happens if someone posts content that is legal in one country, but illegal in another? Can end users filter the content they host to content within their own country?

Examples: content about Tiananmen Square 1989.

s3n4 · on April 20, 2020

OrbitDB are p2p database instances on top of IPFS, and front-ends or users can always filter the content they display.

s3n4 · on April 20, 2020

Authors can always remove/delete their own posts from a thread, and threads also have a set of moderators. This needs to be set in the thread configuration.

More advanced moderation tools can be built on top, too.

Nican · on April 20, 2020

Hello! Thanks for maintaining an open source project. I am excited to see IPFS take off.

Some basic questions, as I am still struggling to see the use case of this project.

* How did this project get started? What problem is it trying to solve?

* Are there any real world examples of where this project is used?

haadcode · on April 20, 2020

> * How did this project get started? What problem is it trying to solve?

OrbitDB got started because we wanted to build serverless applications, especially for the web (ie. applications that run in the browser.) Serverless meaning "no server" and no central authority, ie. something that can't be shut down.

OrbitDB gives tools to build systems and applications where the user owns their data, that is, the data that is not controlled by a service. As a way of simple example, imagine Twitter that doesn't have one massive database for all tweets, but rather you'd have one database for each user.

haadcode · on April 20, 2020

One great piece of writing to think about the use cases and what kind of systems and applications can be built following the concepts applied in OrbitDB is this "Local-first software" https://www.inkandswitch.com/local-first.html (there's prolly a thread somewhere here too on that).

markhenderson · on April 20, 2020

tl;dr: You use this any time you want to have mutable data shared across a peer-to-peer network.

I wasn't there at the beginning but I believe the project came out of trying to achieve said mutable state within IPFS (which, for other readers is content-addressed and therefore append-only)

http://orbitdb.org lists all of our current users, the biggest one is Metamask by way of https://3box.io. https://tallylab.com is building with it for remote encrypted backup and shared tallies. https://github.com/dappkit/aviondb is a MongoDB-like interface for it.

markhenderson · on Aug 2, 2019

It's still so interesting to me that this never caught on, particularly in the "distributed age." This feels like a hidden gem in the research that has been covered up by the sands of time.

haadcode · on Aug 2, 2019

Absolutely! I don't know the exact history and why it hasn't been applied much, and I'd be very curious to learn, but it's definitely a gem.

markhenderson · on Jan 13, 2019

It would be great if this achieved API parity with ES. Being able to swap out parts of the ELK stack would make tools like kibana even more powerful.

mdaniel · on Jan 13, 2019

I hear you, and I know why you'd say that, but wow the API surface-area of ES is ginormous. Maybe the 80-20 rule goes a long way here, but I wouldn't expect API parity to be a simple matter of exposing the same REST endpoints -- it's the payload that'll be the headache

I actually strongly considered that just with Solr, which has the extreme benefit of using the same query language under the hood, but the more I scratched the more I found it would be a horrific amount of work

jazoom · on Jan 13, 2019

Plus the Elasticsearch API isn't especially nice to use. I haven't tried their new SQL, since it requires an enterprise licence or something.

mdaniel · on Jan 13, 2019

Do you mean encoding the lucene queries as JSON objects into the ES endpoints, or do you mean the actual lucene syntax (as would be surfaced by kibana et al)?

jazoom · on Jan 13, 2019

I mean the Elasticsearch API. Kinda what you were referring to in the first part of your sentence, but I don't know why you'd say it like that, especially since the Elasticsearch API covers other things, such as mapping indexes and other cluster administration.

markhenderson · on June 5, 2018

Right now a colleague of mine are working on a project called TallyLab. It's a "data diary" based on IPFS and OrbitDB. By utilizing those technologies we hope to create a distributed, decentralized, end-to-end encrypted, peer-to-peer system that gives people full control over their data. We believe it will be the first of many "GDPR-first" or "HIPAA-first" apps.

Sign up for our beta program at https://tallylab.com :)