Osm-p2p: a peer-to-peer distributed OpenStreetMap database

pfraze · on June 16, 2016

Cool, this is made by substack (the author of browserify) using internal protocol-modules of the dat project. Glad to see it launch.

https://twitter.com/substack

https://github.com/mafintosh/hyperlog

http://dat-data.com/

Doctor_Fegg · on June 16, 2016

This is certainly worthwhile and interesting in a lot of ways, but I'm uncomfortable with calling it OpenStreetMap. OSM's raison d'etre is to be collaborative and purely factual, whereas this is providing "your own, private OpenStreetMap" (http://www.digital-democracy.org/blog/openstreetmap-without-...). Naming it something like CommunityMap, while nodding to the fact it's based on parts of the OSM stack, would have been clearer and kinder.

(Also trademark issues, but let's not get into those here.)

chippy · on June 16, 2016

The project seeks to both accept OSM.org data and to contribute back:

>Here’s what we would like to have soon, to better interoperate with the rest of the Open Street Map ecosystem:

   > import public osm data from a region into osm-p2p
   > export osm-p2p edits back to public open street map

Bedon292 · on June 16, 2016

So this is focused on offline editing and sharing? Very cool, though I initially thought it was going to be a p2p torrent style of keeping OSM data synced across the internet.

substack · on June 16, 2016

The focus is offline, but the underlying techniques work just as well across the public internet. For example, you could use https://www.npmjs.com/package/webrtc-swarm to sync the hyperlogs:

    var wswarm = require('webrtc-swarm')
    var signalhub = require('signalhub')
    var swarm = wswarm(signalhub('p2p-map', ['https://signalhub.mafintosh.com'])

    var osm = require('osm-p2p')()
    swarm.on('peer', function (peer, id) {
      peer.pipe(osm.log.replicate()).pipe(peer)
    })

gmaclennan · on June 16, 2016

Yes, right now this is for setting up your own "OpenStreetMap" that will work offline. The author of this work, substack, is also working/thinking about p2p infrastructure for OSM. See https://peermaps.github.io/

mynewtb · on June 15, 2016

How are conflicts handled?

gmaclennan · on June 15, 2016

If two or more peers edit the same record, it doesn't create a conflict but instead that record simply has two versions in the database - like a fork in a git repo. These can be merged at any time in the future, but prior to that two versions can continue to exist and replicate. For more about why we designed it like this see: https://github.com/digidem/osm-p2p-db/blob/master/doc/archit...

pfraze · on June 16, 2016

How similar is this to how CouchDB handles conflicts?

gmaclennan · on June 16, 2016

Substack can probably give a better answer, but my understanding is that CouchDB only holds a version history since the last replication between clients, and then conflicts need to be resolved before replication can continue, after which version history is lost. With OSM-p2p no data is every deleted, it is all just in the underlying hyperlog. It's more like git than CouchDB, and each record has a complete history, and can be forked and merged and will continue to sync/replicate.

rakoo · on June 16, 2016

As someone who's been playing with CouchDB and has been become kind of biased towards it, I feel like I need to make a little bit of correction here. Long story short: CouchDB can behave exactly like hyperlog if you want it to, or it can behave as something that gives the user the best information it can to resolve conflicts.

You can think of CouchDB not as a key->value store, not even a key->document store, but rather a key->tree of documents, where each node is the same document at a different revision at some point in time. The root of the tree is the initial revision, and the leaves are the latest revisions: if there is only one leaf (because the other ones have been marked as deleted (which is not a real deletion, it's just marking this leaf node as "not a possible current value") then it's the correct value of the documents, but if there are multiple, then there is no correct value, only multiple choices. When you want to write a new revision, the node that is the parent (it can be a leaf or it can be an internal node, it can even be a deleted leaf!) becomes an internal node, with a new child. Just like git. Except CouchDB doesn't give you any insurance about whether you're going to be able to retrieve the internal nodes' content; the only nodes you're 100% sure to have access to are the leaf nodes. However you will have full access to how you got where you are (even though you usually don't need it, because the real important thing you're interested in is the current (possible) values)

Because you still need to work and you still need to have a revision to work on, CouchDB gives you a way to automatically select one of those conflict revisions and pretend it's the correct one; but flip one bit in the query (just add the parameter "conflicts:true") and it will give you all the conflicting revisions so you, the user, can make a choice. You don't have to; it will continue to work without it, but at some point in time you'd better clean the db and clearly state what is the truth.

The other way to use it is to have something that is quite usual: create a unique id out of each and every version of your object, and store them all in CouchDB. You'll have the insurance that old versions won't be removed, but you'll have to tweak CouchDB a bit to "group" relevant ids together (by using a view, typically). In this usage each key will have a single node, that will both be a root and a leaf node.

Whatever the way you want to use it, CouchDB's replication will make sure all nodes will end up on every peer, whether they are a leaf node or node, whether the tree on a given peer is complete or not. The replication happens in the background, in parallel; it doesn't block the user working on their stuff, and isn't blocked by the user either.

CouchDB has unfortunately suffered from the misnomed "revision" term, and it probably was a bit too early in the game, but it pains me that such a great DB is not more considered as a viable alternative not because it lacks on the technical side, but probably because there isn't enough information/blog posts on it.

substack · on June 16, 2016

You can sort of do these operations with couchdb, but you've got to fight against some bad assumptions for this use case of offline edits that could span weeks or months. From https://wiki.apache.org/couchdb/Replication_and_conflicts:

    With CouchDB, you can sometimes get hold of old revisions of a document.
    For example, if you fetch /db/bob?rev=v2b&revs_info=true you'll get a list
    of the previous revision ids which ended up with revision v2b. Doing the
    same for v2a you can find their common ancestor revision. However if the
    database has been compacted, the content of that document revision will
    have been lost. revs_info will still show that v1 was an ancestor, but
    report it as "missing".

That is a very dangerous feature for us. Also, the conflict avoidance algorithm means that users can't work on branches in parallel like in git because couchdb obstructs that workflow with 409 responses. This is a very bad feature when you're collecting data in remote areas and replicating with other databases and don't have the time, battery life, or the expert skills to resolve a "conflict". A database should never have states where it rejects new information.

rakoo · on June 16, 2016

Note that, as stated, the content of the historical revisions may be missing, but the full lineage will still be there and will be replicated. Actually what happens is that the "_id" field and the "_rev" field (the revision number) will always be present, while the other fields may be removed.

You are also never blocked by a 409. A 409 happens when CouchDB wants to help you reduce the chance that a conflict happens, but you can bypass it and do create a conflict if you don't want to be bothered. In fact, that's how replication works and the only way conoflict do happen: when two dbs replicate to each other, they may have a different history for the same document, but the two histories are sent to the other side so that each side can have all the timelines. You have to use another, non-obvious endpoint for that, see http://docs.couchdb.org/en/1.6.1/replication/conflicts.html?...:

  So this gives you a way to introduce conflicts within a single
  database instance. If you choose to do this instead of PUT, it means
  you don’t have to write any code for the possibility of 
  getting a 409 response, because you will never get one. Rather,
  you have to deal with conflicts appearing later in the database,
  which is what you’d have to do in a multi-master application anyway.

mbrock · on June 16, 2016

I was looking at this stuff the other day after following links from the "hyperlog" npm package. Kudos for this really great inspiring work.