Operational Transformation (algorithm behind Wave, Google Docs)

jerf · on May 17, 2010

I worked out the logic for a system like this for one of those "Someday I'd like to release this" projects. The way I looked at it is that basically it's just patch theory. It's the same problem you get with a source-control system, it's just happening faster. You can use the exact same patch theory.

In fact I think patch theory may even be stronger than the algorithms described here. This link says "we need to impose a restriction on the client: we can never send more than one operation at a time to the server.", but patch theory can handle that case just fine and still "recover" later. The only thing you need is some sort of merge policy, and my plan was just to arbitrarily (but consistently) pick a side, because unlike a code merge where you could literally have kilobytes of conflict and discarding either side could be bad, in the collaborative case you're looking at a few small commands. (Even a large copy/paste is just one operation, from the human perspective. No matter how collaborative your editor, two people editing the same character is always going to be a problem; there's really no point trying to get too clever.)

If you are interested in this topic, you might want to read http://en.wikibooks.org/wiki/Understanding_darcs/Patch_theor... as well. I think patch theory may be stronger than what was described because it can use patch inverses to do a few things I'm not sure operational transformation" can, and coming up with inverses for the collaborative editing case is quite trivial. Note that darcs had a well-known exponential case, I think in the collaborative editor case it could never come up as everybody is always within a few seconds of "HEAD".

(Incidentally, the patch approach would also potentially allow offline interaction as well, though then you can't ignore the merge issue quite so thoroughly and you could conceivably have exponential merges again, never got far enough to tell. Sounds like this approach would not allow that. Does Wave allow offline interaction which it then propagates up? If so then it may be using something more like patch theory.)

Rauchg · on May 17, 2010

OT is not patch theory. Google already had a patch-based approach (currently used in Bespin): http://code.google.com/p/google-mobwrite/, which was replaced with OT. No diffing algorithms are used, even for producing the operations (which is probably why they're not using `contenteditable`)

Wave does allow offline interaction, and their OT algorithm is optimized for speed (for example, for the execution of possibly hundreds of buffered operations). The tradeoff is the requirement for an acknowledgement from the server after sending each operation or batch of operations.

I'm currently working on an OT implementation that doesn't require ACKs from the server in my spare time, and relies on a vector clock as opposed to a single timestamp. It has higher CPU processing power needs, which is why Wave's algorithm is also more suited to mobile devices[1]

[1] http://portal.acm.org/citation.cfm?id=1718949 I believe one of the people involved in this paper is also involved in Wave's OT implementation paper.

jerf · on May 17, 2010

I wasn't using diffs, my editor directly represented everything as patches, to the extent it exists at all. Much easier to do in some GUI environments than from a web browser.

Of course one of the issues is that a lot of stuff comes down to details. Working in a browser is a hell of a constraint, I wouldn't dream of writing a collaborative editor in a browser. (Not because it's impossible, but because that's an amount of stress I don't need in my life. Obviously it's possible.) It would be interesting to talk to the people involved about the issues, but I suppose that's almost always true.

(BTW, this is me interested in someone who has actually done the stuff I've only designed and curious about the exact details to a fairly deep level about the actual issues they encountered and stuff, as partially indicated in my "to the extent it exists at all" above. I am not trying to tell anyone what they should have done, I'm just chewing the fat as an interested party who stands to learn some interesting-to-me-things. I'm not in a position to tell implementors anything. As usual, tone can be hard to convey in text.)

gritzko · on May 17, 2010

Patch-based stuff is simply based on obsolete 30 yrs old assumptions. Why should you need a combinatorial algorithm to detect changes if you can track every user's action? etc etc

OT is a dead end. The problems they struggle with could be trivially bruteforced by employing unique symbol identifiers. http://bouillon.math.usu.ru/articles/ctre.pdf

Rauchg · on May 17, 2010

"According to an insider [10], in 6 months since the projects’s launch the possibility of working with partial histories is not implemented yet; pages are bootstrapped with their complete histories."

What I will concede is that their OT implementation is probably not perfect yet. If you read the comments on some transformation functions on their Java code, they shown concern and do extra logging in hopes to track unexpected issues they haven't solved yet.

Rauchg · on May 17, 2010

Didn't read the entire paper, but the criticism on Google Wave is poor, baseless and subjective.

"The authors even had to enrich XML with additional entities named “annotations” to make it suitable for their needs"

This is what Etherpad did for adding support for lists, bold, italics, etc. Did anyone complain? Does it not work?

Granted, Wave's approach is way more ambitious. They support any sort of XML groves, and not just simple 1-level deep tags like Etherpad.

gritzko · on May 17, 2010

Some unexpected issues are not solved for past 20 years.

I do not attack annotations or XML per se. The general approach "neither thing is precisely OK, so let's use both" seems questionable to me. I believe, XML is complex enough, so why do we need XXML? Finally, by using annotations they recognize that XML nesting is a pain in the ass, but they did not solve the problem, just sort of relieved it.

Actually, I doubt a lot, whether XML is actually carrying some useful load in the Wave. But the last one is just a feeling.

Rauchg · on May 17, 2010

Have you taken a look at admissibility-based transformations (ABT) [1]?

It seems to solve the issues that concern you (validating the correctness of OT).

[1] http://portal.acm.org/citation.cfm?id=1713726

saikat · on May 17, 2010

How is your approach different from http://neil.fraser.name/writing/sync/ ? You said you aren't doing any diffs - so are you just creating a patch per user action (like typing the character "a" is actually represented as a patch for insertion of a at position 0?). I looked at using differential synchronization at some point, but it has even higher latency/way more memory usage than OT approaches (the server needs to have a full copy of the document per client). You can't constantly be diffing and sending patches using the synch-diff approach - rather you have synchronization cycles that happen every few seconds.

jerf · on May 17, 2010

Yes, I represented everything as patches. The native file format was going to just be a series of patches, so that in the single-user case I could implement the "never have to hit save" functionality just by appending to the patch stream. (An active save might have reset the patch stream to just be a straight dump of the current document.)

In my more evil moments I considered specifying the save file as an XML document that deliberately left off the final close tag. Mu-hu-ha-ha-ha!

Pretty much the same as that doc, yes, though I never saw it. (I did see and was inspired by the Darcs patch theory doc I linked.) My plan was for people to "share" documents, so a server that had a full copy of the document was not a problem, it did by definition. For that matter, in the Google Wave model, I'd rather run my own Wave server so having a full copy of the document there isn't a problem either. I, personally, actively don't want my cloud to have the canonical version of the document.

"You can't constantly be diffing and sending patches using the synch-diff approach - rather you have synchronization cycles that happen every few seconds."

Why not? Assuming just a handful of users (not hundreds), and given that patches have a certain flexibility about when they get applied, why not just send things as they occur? (Within the topologies shown, that is, not just everyone sending to anybody as they feel like it. You could probably get by with full P2P without too much work, but having the central server simplifies things a lot and I don't think you lose much. I was never planning on 100 people "in" the same document.) Clients should probably chunk things up a bit during periods of rapid typing. (I was planning on going to raw sockets if I needed to; most higher level protocols would have indeed choked this from what I saw.)

Anyway, it's pretty theoretical now. Wave is probably good enough and this wasn't even the focus of my project, just an incidental sideline to what I really wanted.

saikat · on June 11, 2010

Neil Fraser made a comment about his approach and why it introduces complications here - http://www.web2media.net/laktek/2010/05/25/real-time-collabo... . I'm glad he clarified =).

saikat · on May 17, 2010

The thing about not constantly being able to send patches - I'll need to think about it again, but I thought I came up with a reason for why it didn't work in a general case.

And I didn't just mean saving a copy of the document on the server - you need to do this for most collab. algos. The differential synchronization algorithm requires saving a copy of the document _per client_ on the server.

Otherwise - sounds cool - you should definitely try to implement this and see how it goes.

mvaerle · on May 17, 2010

I was also wondering about the difference between patch theory and this. I wonder if one of the reasons for operational transformation is that it is more feasible in semi-realtime environments, for example because transforming the operations is (computationally) easier than doing a merge on the data itself.

jerf · on May 17, 2010

I can't speak for Google's system, but with the system I had envisioned "merge" time would have been proportional to the amount of patches diverged, which would always have been tiny. The common case is that you go to apply another user's patch and it applies cleanly. You have to be ready for conflicts but they will generally be rare, even if your latency is fairly large. Remember you actually have to be stomping directly on top of another user's work for the patch to have a problem; possible, certainly, would occur in practice, but not normal.

(Careful design of the patch commutation algorithms could further reduce the problems you face, this and "pasting in a huge swathe of text" being the two basic "huge" operations available to a user in realtime. Imagine one user highlights a huge swathe of text and deletes it while another is trying to add a character or two. Normally this could be a conflict, but depending on how you represent the delete, you might be able to make it so the delete simply eats all further edits to the deleted text. In fact in my system this was actually a major problem due to some other issues (in my system you could limit your visible scope down to, say, a specific "chapter" and then I have a UI problem if that entire chapter gets deleted), but this wouldn't affect Google Wave anywhere near as badly, unless it also has narrowing available.)

gritzko · on May 17, 2010

The present OT is very much an epicycle theory. They always have that one last issue to fix. In fact, several, if not the most of, classic OT papers were later found to be inconsistent, e.g. some combination of concurrent edits makes copies diverge (source: P. Molli). And that problems are extremely hard to find/understand. Apparently, the original authors of OT had a local network in mind: universal connectivity, instant message propagation, etc. But the internet is full of "relativistic" effects: different clients have different views of the state, messages propagate with finite/varying speed, etc etc. After reading the article, try to estimate what happens if e.g. one client syncs hourly/daily/monthly (like in the git), or suppose we don't have a central server (like in git), or if the stream of edits is too dense... like we really have a hundred of monkeys... Maybe Google engineers managed to get the probability of OT screw-ups lower than the probability of all other screw-ups... but that does not make the theory any better.

(...and the way they dealt with XML is a very special song. OT is pretty much a complexity explosion itself, but they also multiplied it by the complexity of XML; got 15 kinds of mutation operations and all the stuff. The fact they invented "annotations" to skip XML nesting speaks for itself.)

Conclusion. OT is a big messy mess.

adrenalin · on May 17, 2010

Hey thanks, have a school-project in progress (a google wave like project) that need OT, this article will help.

saikat · on May 17, 2010

For people interested in the approaches to real-time collab, I made a summary writeup of what I researched a while back - http://techblog.gomockingbird.com/using-real-time-collaborat... . Still hoping to actually dive into all the algorithms later, but there is definitely a lot out there, just mostly in the form of academic papers.

mvaerle · on May 17, 2010

Make sure you watch this guy: http://www.youtube.com/watch?v=84zqbXUQIHc (presentation at google).

They developed a lot of systems using OT, even going so far as doing collaborative autocad. There are papers on that if you search for CoMaya and CoAutoCAD. The complexity, especially of the last one, is mind boggling though.

billjings · on May 17, 2010

"...in light of the fact that Google’s documentation on the subject is sparing at best, I thought I would take some time to clarify this critical piece of the puzzle."

Is Google's documentation crap?

hammerdr · on May 17, 2010

Just very broad: a series of white papers and a sample Java implementation which doesn't have any persistent storage.

There is certainly space for someone to document the specifics of the Wave Protocol (and, along with that, operational transforms) and this blog post goes a long way toward that.

Edit: I should also note that the OT white paper, like most white papers, builds on prior knowledge and expects some background in the area of research. You have to read a few white papers in order to really grok it.

Jd · on May 18, 2010

There was a long blog series on OT not long after the Google Wave that is linked from the Wave docs:

http://blog.wavenz.com/2009/07/introduction-to-operational.h... http://blog.wavenz.com/2009/07/operational-transformation-al... http://blog.wavenz.com/2009/07/operational-transformation-an...

(full disclosure: I helped edit these)