As somebody who has worked full-time, over-time, and essentially in my sleep with Word files, PowerPoints, and highly sensitive bid documentation...yeah, I have to agree working in the cloud for this kind of stuff strikes me as career suicide. Maybe not in your turf, re: software development, but with respect to management, operations and marketing people, there should only be one person with the key to the kingdom. I'm not kidding about this, even if just talking internal development.
Also, this is why everything went out as a locked down PDF, unless explicitly mandated otherwise by RFP/etc specifications...and even then, Track Changes > Accept All Changes is gospel. Anybody in my line of work saw what the .GOV did with converting PDFs and simply redacting with a graphic over the text...yeah, that's why I'm a first-class proposal developer, because I've seen carnage yo.
Even if you are working on something that is not security/negotiation sensitive, this is still scary, and the layman can approximate it with Google docs document history and the like.
Document history enables thoughtcrime detection, especially as it becomes more and more atomic towards actual keystrokes. I wonder how long it will be until typed and deleted text is used as evidence in a court of law.
Or you could "Keep Calm and File > Make a copy..." of the Google Doc and share that instead of the original draft. The copy doesn't copy the version history.
This is personal experience, so I'm not trying to be intentionally glib. Maybe for one person. For distributing to two to five people, this is a terrible idea, because here's how Mr. Murphy rolls: As soon as you get edits from A, B, and C of the team back, and update the document and are ready to distribute another version, the last person Y will submit their content. Person Y's content edits something that Person B already edited. This kind of thing snowballs...hard. It's not about them seeing the past, it's about appropriate control for the future (aka the deadline).
I mean, I know how to really whip the donkey spit out of "Compare" and the available tools, but until you've seen just how sideways this stuff can go, there's always room for a professional pest...an expert Proposal Writer...there's so much learned in the trenches that we all have a certain cynicism built up, and that's the fuel that gets the junk in by deadline time.
Honestly the only way I've found to make this stuff work is to use something like LaTeX, where the source is a text-based format, and then put it in a real version control system like git, that has proper support for merging and resolving conflicts. Then when the document is correct you compile it to a PDF, and that's what you actually send out (bonus: much smaller file size than something like word).
(Of course, the big downside is that all your collaborators have to know LaTeX, and not use funny macros that the others don't understand)
I don't understand why most of the commenters here are focusing on the privacy implications rather than the technical aspects.
Is this really a privacy breach? It's been obvious that Google stores revision history since it launched—you've always been able to access a thorough revision history in the UI itself...
I think the issue for some (including myself) is that this revelation shows that the vulnerability of a compromised document is greater than the apparent contents of the document. It includes all keystrokes, which could expose other ideas the writer might have had (but discarded), etc. This fact makes me even less inclined to use Google docs.
The parent comment was clearly aware of the versioning feature, and appears to be more surprised by the keypress-level granularity of the versioning that may not have been apparent in the official UI.
The creation story for this is really neat. This could be an amazing tool in the school setting, especially for people that teach in university writing centers. This isn't just asking an author for a peek behind the curtain, asking a few questions about what they were thinking at the time of writing, this is Breaking the Magician's Code level stuff!
I am curious to see who is (brave enough?) to show their writing process in all its glory.
I'd be happy to upload scans of one of my more recent short stories, "Pink Paint Rain," simply to show how much true effort goes into massaging lines and language a la code.
I have about 7-10 print outs of the story with mark ups. Before switching to English & Creative Writing I was in Computer Science and pretty skilled with C++ at the time, so I can come up with a decent correlative:
Every version I printed was like running it through a compiler - this is to evidence that even if a piece of code or a line of text is functional, is it efficient, and, if at all possible, an excellent construction? These are subjective concepts that are innate in language. There may be several writers who can put words to paper directly from their head with no revision process - hell, it's what I do when I'm "practicing" on my IBM Selectric III to commit to writing in permanence (think before I type) - but for the greats, it has always been an iterative process.
To keep this from being all about me, allow me to provide a link to something that may be to your liking:
Hi, I'm a computational linguist and I would find it really great if some people could share their traces of their typing/editing process.
In programming, we have editors that have strong support for writing because we know exactly what the semantics of code is, and what good operations for editing/refactoring are. With writing prose, the best we currently have is edit histories from wikipedia articles, which are on a much larger timescale and full of things that should not be part of the editing process (vandalism, NPOV wars, etc.)
I broke out in to a cold sweat watching this as I remembered all the times I've inadvertently pasted sensitive stuff in to a document. It's still very cool though, I'll just need to remember to be careful when sharing documents.
I wonder how much sensitive information is inadvertently pasted into a browser location bar or autofill text box that's silently captured by web apps like Google Docs?
I know I've accidentally done the "paste password" into those places accidentally at times.
I believe (can't recall the source at the moment) that on Google computers, they actually watch all of your input for password input, and if you enter your password somewhere other than the official Google single-sign on interface, will make you rotate your password. They're pretty serious about not letting you type your password anywhere other than where you're supposed to.
That sounds like a pretty big security hole. Just "type" random letters until you get a warning saying not to enter your password outside of password fields.
There's no such warning displayed, because that would be a security hole.
This password security measure is a Chrome extension that's required by company policy to be installed on all corporate machines. It watches all input (to browser forms) and if it detects your password being typed anywhere other than an actual sign-on page, then the next time you sign on successfully you're required to change your password. I believe there's also an e-mail notification, but it's delayed.
This is actually a pretty good password security technique, specifically because people often inadvertently type their password into the wrong forms due to focus errors, lack of caffeine, etc.
Because I can't think of an efficient way to do that that doesn't involve having the extension have access to the password.
I mean, you could store the password hash + length, but then you're securely hashing every single overlapping substring of what you enter, which is not exactly fast. Especially as KDFs are designed to be slow.
And if you store the password hash then you enable an offline attack.
So then you're sending every keystroke people make to a central server?
Even assuming that the connection is secure (never a good assumption), that still means that there is a single point of failure. And one with drastic consequences.
True, but anyone that can gain privileged access to the computer is already king of the castle. Why attack it offline when you can just keylog it? I think it goes back to being one part of an overall security posture. Encrypt your workstations and people can't just pick them up and own them.
This seems really unlikely. Do they really calculate hashes for all words in a google document? What if your password is a sentence? Do they calculate hashes for all the fields in a spreadsheet too?
My browser URL bar has received a few passwords, usually just a local computer password but occasionally something more sensitive. Makes me wonder about the permissions my collection of browser addons has.
I've also chucked a few passwords into IRC in times past. Fortunately non-essential stuff but really motivated me to sort out some better solutions (SSH keypairs, etc).
Plenty of times I've come back to my computer, typed my OS password expecting that it was locked, waited for my display's turn-on lag, and found that it wasn't locked (grace period). I type the password blind, but reserve the enter until I have visual feedback.
At least typical IRC clients don't transmit until hitting enter. Browser omnibars and Javascript can send away every keystroke as it happens. Now I want to search all my Google Docs for my passwords -- let alone other stupidity I don't care to share.
A long time ago in a lecture hall far far away a head of school was giving us a pre-exam talk of some kind. It was too all health science students. As he talked he logged into the system. With the projector showing what he was doing he missed the tab key and typed his username and password into the username field. I had a look round the room and no one else seemed to have noticed. On his desktop sat a folder titled "Exam papers" or something similar.
> I wonder how much sensitive information is inadvertently pasted into a browser location bar or autofill text box that's silently captured by web apps like Google Docs?
Well, if you paste something in the location bar, presumably it'll already be on it's way to google (or whichever service handle autocompletion/suggestion)...
I had a bad habit of using it to remove style from pasted text. Shift+CMD+V (or something like that, muscle memory) does the trick, but I didn't know that a few years ago.
Well, I found that I would use search/URL fields as a "temporary scratchpad" for passwords, when I had to copy-paste something else, and didn't want to lose the password in my clipboard. The history means I don't have to worry about that anymore.
Not quite as bad but Google docs is of course used in Google (and others) interviews. It's a bit disconcerting when you realise that all the backspacing/revisions you made are there forever.
(make me want to go back and see whether I can see other peoples comments .... hmmm)
I find it fascinating to see how much deleting and rewriting the author did on the first two sentences of his Atlantic article. You can see the idea getting rewritten in many ways.
Is this a typical way to write a magazine article? I wouldn't have expected so much time revising the opening sentences before getting the rest of the article in place. (But there's probably a lot of variation between writers.)
I've written a fair number of magazine articles, and also a lot of white papers and other documents.
For me, I usually spend a day or two just thinking about it in my head. Going over what would be a logical thought flow and things like that. When I sit down to actually write, I tend to have very few revisions. My first draft is much closer to the typical persons 5th draft (I think), but that's because I've been revising and editing in my head first.
You remind me of a Japanese artist that envisioned the movements he was going to make to make his painting for a long time and then when he finally started moving the painting was done extremely fast. I forgot his name but I thought it was fascinating. I agonize over the first line of whatever I have to write for a long time but once that first line is done the rest goes easy.
I found that when I first started writing regularly I would spend a lot of time doing constant editing similar to the example shown in the linked article. This would become distracting and time consuming, then I'd forget other things I had wanted to say. So I found it better, for me, to just kind of write things in my head first and then sit down and write -almost more transcribing vs. "writing".
Thanks, thats really helpful for me. I am just starting to learn to write blog articles for my business and I certainly do a LOT of editing. Glad to see someone progress to where I want to be at... one day
I agree, it's really cool to see. I always write like this for the opening paragraph when I'm trying to get the hook right. It sets the tone for the rest of the article, which tends to flow a lot more freely.
Ahh I'm now hyper aware of it and realise I've chopped and changed that ^^ first paragraph a ton of times.
It's either that way, or the old trick / saying "Wait until you've finished your piece before writing the introduction paragraph, because how else are you going to know what the whole thing is about?" Chicken or egg, YMMV.
This is a very good reason to never use software in the so-called "cloud". I also remember years ago when someone showed me "Track Changes" in MS Word and other programs and how you could go back and look at, say, a bid offer and see if everything was on the up and up. You could see, esp. if the document was a form letter or canned response, to which other companies were offered different terms, you name it.
I dislike revision-able software for a number of reasons. Privacy is the foremost reason. Yes, yes, "if you've nothing to hide, you've nothing to fear..." That old chestnut gets trotted out every time someone worries about security or privacy.
Since about 2000, I keep my documents in plain text only on an encrypted drive backed up several times over -- none of the backups are online, but I'm still good if my house burns down, my machines get stolen, you name it.
You can also take advantage of all the great bonuses of revisions and then before sending the document just copy/paste the content into a new document that doesn't include the revisions. That seems sensible too.
Does sound like a very fragile workflow (there's no reasonable way to tell a full-history doc from a "publish grade" doc by glancing at the file on the filesystem.
Keeping everything in proper version control (possibly unzipped, to give usable diffs even for office document formats -- or in something like markdown) -- would at least rise the bar a bit -- there'd be different process for sending a single version of a file, and sending all versions of (all) [a] file(s).
I suppose if you're already running an internal mail server, you could just do filtering there, making sure no version/history-rich documents pass out that way...
Yes, I know about this, and thank you for reminding me, but in the end, I just don't trust software.
I really do like old school and I stick with it. I write most of my docs as text in either vi or nano. I neither want nor need formatting beyond the basics. My CV is the only document I own that has formatting, and I used LibreOffice for it. It's a single page.
I try my best to break people of that awful habit. I also find that I'm usually happy enough dealing only with people who can manage to open a pdf (or any other non-proprietary format).
Yes. I hate receiving documents in .docx or any other proprietary format, especially if I have not agreed to that format prior to receiving them. I like plain text if at all possible. I'll happily receive PDFs because I can view them under Linux/BSD, but others piss me off. People THINK MS Word is the de facto business standard. It isn't. Text and PDF are the standards. Never had issues with either.
I had a friend tell me my CV should be in PDF and locked down to prevent recruiters and others from changing it to something other than the original. I know recruiters are fond of changing things up without informing you.
You can "print" the text file to a pdf for your CV and get the best of both worlds. My department only allows pdf uploads for CVs, so that's my approach.
The author mentions that his system doesn't handle rich text, which is fine, but I'd just like to comment on how difficult of a problem handling rich text is. If anyone is interested in having a personal text-only replay editor, check out http://sharejs.org/ by an ex-Google Wave engineer.
As far as handling rich text, I've talked to the original co-founders of Writely (which became Google Docs), and I've also spent a good 8+ months on it as well. There are lots of tradeoffs involved, that diff-patch-match (as mentioned in the article) won't work on. Doc's ultimately expresses styles as applied ranges, rather than actual markup.
Point being, Google keeping every keystroke you've made is absolutely necessary for realtime collaborative writing.
This is one of the reasons why it's great that the source code of web pages/apps is (relatively, compared to binaries) easy to reverse-engineer - because of their environment inside a browser, web apps have such a low barrier to "phoning home" and making requests that privacy-sensitive information being leaked may otherwise be difficult to notice. Imagine if they were all encrypted/obfuscated binaries...
I don't use Google Docs (and probably never will), but if I did, all those requests - "these /save calls every time I typed something" - would be enough for me to investigate why it's generating so much traffic. I'm using an OS that still has a useful network activity indicator icon, so I easily know when there's data being transmitted/received when there shouldn't be.
There's a line of thought that says those sorts of indicators are unnecessary and a distraction, and that maybe valid justification for removing them, but I can't help feeling like their removal is making users more unaware of what their machines are doing - and thus easier for companies to do things like this to them.
When you say that "privacy-sensitive information" is "being leaked", you make it sound much worse than it is. The information being sent seems completely normal for an online word processor with a revision history, and it's not being "leaked" to anyone besides the company providing the word processor.
The information being sent seems completely normal for an online word processor with a revision history
When most people hear "revision history", they think of the versions of the document that exist between explicit saves or periodic autosaves, and not extremely fine-grained per-keystroke activity logging.
I remember reading an article about how it's a shame that authors don't use pen/paper anymore, since we can't see their crossouts and things for rough drafts. I'd argue that this would be infinitely superior if authors would give us access to some revision history.
If the entire computing session, instead of just the word processor, is similarly cloud-hosted with similarly granular revision history, you've got your margins.
I've tried using the example URL on his blog with one of my documents, just to see exactly how the information is stored, and I could never get it to send me an actual response with any of my documents. Has anyone had any luck?
Didn't Google Docs used to have this "playback" feature built in? I clearly remember there being a slider at the top of the page that you could scrub back and forth through a document's revision history.
Interesting. Is this a side-effect of the old Google Wave design, which had collaborative documents where you could watch your collaborator type in realtime?
Operational transformations are used by both Google Wave (now Apache Wave) and Google Docs [1]. The basic idea is to avoid latency by making sure all edit operations commute, so patches can be applied out of order to get the same result.
This is not so different from source control except that merge conflicts are handled differently.
Differential synchronization [2] might be easier to implement, though.
Storing the operations for ever is not necessary - they could apply an upper bound on what latency is reasonable; the server then could coalesce changes into groups (sorted by position in the document, transformed into a position at the end of the group of changes) for history. For undo, they could use grouped changes after a certain number of operations of history.
They would then be able to implement sharing without allowing access to the history.
Alternatively, and more easily, they could record the state when someone is given access to the document, and not allow access to operations received by the server before then unless they are granted a separate permission.
Google's OT has some of its routes in EtherPad. YC-funded AppJet made Etherpad as a demo app, Google acquired it, and they incorporated the tech into Wave.
Interesting perspective by the author and most of the commenters here. My first thought when I read the headline and the article was "privacy breach!". This is certainly interesting data. But it can also be dangerous if the owner of the document isn't aware of the implications of the storage format.
SageMathCloud's IDE (which is CodeMirror-based, so similar to Adobe brackets, but online) does this, with a nice slider like in "pirate pad". Just create a file and start editing and it will record diffs at about a 1-second interval, which actively editing. Click on the blue "History" button and you get a slider across past revisions. Jon Lee implemented this functionality last summer. https://cloud.sagemath.com I frequently use this fine history functionality when coding. I'll remember that I had my code in some useful state 15 minutes ago (say in the middle between git commits), and I can easily drag the slider back to that point in time and look at the code. It adds a whole new dimension two coding that dramatically improves things. Unlike with Google docs, the SageMathCloud revision history for a file foo is simply stored in the file .foo.sage-history, with one diff per line (in JSON format). You can delete .foo.sage-history or archive it or whatever.
That's what I was wondering, considering we have something like that for a live document in most IDEs (remembering changes character by character and the like).
If a document needed to be played back each time to see what it was then it would suck, you would need a format that both represents the final product and also has all the history in one file.
Well you can get a general copy paste detector on a code base. CPD from PMP is one package. Language support isn't that great although, you mostly only get C/C++ & Java.
Very nice, not only because it shows you a playback of Google Docs documents but also because the author takes the time to note his inspiration, first drafts of the project and eventual evolution of it.
Interesting -- I would not have guessed they were storing all our keystrokes. It'd be fascinating to mine that data, for example, to find patterns of typos.
Anyone that has used the multi-user collaborative editing should not be surprised that at the very least your keystrokes are regularly being transmitted to Google for playback on other people's computers. It's not so much of a leap to assume that they are also stored given Google's lust for data.
Very cool. Just spent the day writing some integration for Google Drive, and was a bit amused when writing a single sentence in Google Doc increased the "Changes.list()" id with ~40.
But I'm curious: Can one delete these kind of revisions displayed here? Those visible in the GDocs UI are only a few, mayor revisions (which may be troublesome in itself for people not knowing about it and sharing a document).
Reminds me of the submission https://news.ycombinator.com/item?id=557191 where we could see pg's thought process in writing the Founders Visa essay. (the link no longer works)
Fine, there's the worry you typed in something sensitive in the document. But it's far from being the major worry here, if I understood everything correctly...
Basically, if you have all keystrokes with timing info, you've got all the keystroke dynamics required to establish an individual's biometric keystroke fingerprint. And that seems freaking scary to me, for a few reasons.
1) Impersonation
Anyone can grab that fingerprint from any shared file on Google Docs and then feed it to a program so that you can impersonate the author for various purposes... Be it typing a blog comment (harmless), or something more insidious like logging into a secure system using that type of auth system.
Another trivial example could be impersonating someone on a Coursera course, where they use such fingerprints for identification on paid / "signature track" courses, which allow you to get a verified certificate from a known university. They use a photo, but that can also be fed by a tweaked webcam driver. So there you have it, you can hire someone to take lessons and pass exams for you. Or fail you.
2) Anonymity
Anyone with access to a shared google doc can now get your fingerprint, and if they implement a similar record on another website, they can identify who you are. Maybe such fingerprints are not 100% unique, but they surely can be accurate enough to pick you from a crowd of anonymous commenters on a website, for instance.
You could also imagine that a software you already have installed could identify you using a similar approach.
In that case, surfing using a Tails/Tor VM and in the incognito mode of your browser won't help you that much.
I'm sure in a perfect world this could sound awesome: no logins required anywhere, just type in stuff and get automatically ID-ed and credited for what you type and say. In our world, that could be bad for some people.
Plus I can only imagine how bad that would be if companies started to include in their web and desktop apps EULA that you agree to share keystroke dynamics with them and that you auhtorize them to redistribute it to partners. BAM, a global commercial database of uniquely identified users, no matter what account or throwaway email they use. Forget cookies and stuff, that won't need that anymore.
Bit far-fetched of course, as that would require some effort. But it's not that much effort that it wouldn't be interesting enough for someone to do it...
Well, you know what, I actually realized that there's no timing info in the recorded data. So, no problem. I jumped the gun quite a bit.
Still, a bit worrysome, because it could easily be modified to track it. And for all we know, some sites could be doing that. Facebook was (at least at some point) listening to what you were typing in timeline posts even if you didn't actually decide to send them, so it wouldn't be surprising if some sites did that sort of stuff.
Hmm, true, it's mentioned in first paragraph, but I couldn't find it in the rest of the article's body when I came back to it. Well, this is rather bad then...
Very cool read!but i dont seem to understand how the algorithm behind the whole phrase/paragraph tracking is supposed to work, anyone can enlighten me?
Funny and interesting. Did you know that firepad.io is a rich-text OT text editor with timestamps? I think it's just what you need. You get a real-time collaborative text editor with a fully featured toolbar, and the exact history you need to replay.
Actually, Firepad does replay the history to display the current version on load (though it also has some snapshotting system to restart faster, but snapshots do not erase the history, they are kept in another location).
As somebody who has worked full-time, over-time, and essentially in my sleep with Word files, PowerPoints, and highly sensitive bid documentation...yeah, I have to agree working in the cloud for this kind of stuff strikes me as career suicide. Maybe not in your turf, re: software development, but with respect to management, operations and marketing people, there should only be one person with the key to the kingdom. I'm not kidding about this, even if just talking internal development.
Also, this is why everything went out as a locked down PDF, unless explicitly mandated otherwise by RFP/etc specifications...and even then, Track Changes > Accept All Changes is gospel. Anybody in my line of work saw what the .GOV did with converting PDFs and simply redacting with a graphic over the text...yeah, that's why I'm a first-class proposal developer, because I've seen carnage yo.