The full English prose text, minus headers or footers, would still provide almos...

ohgodplsno · on Nov 15, 2020

What if there are ten variants, all with slightly modified wording, allowing knowing immediately who leaked it?

bobthepanda · on Nov 15, 2020

You don't even need that. Documents have been identified before because some versions replace characters with nearly identical looking but different unicode characters (say, the various variations of spaces, or the semicolon with the Greek question mark.)

https://en.wikipedia.org/wiki/Whitespace_character

https://en.wikipedia.org/wiki/Question_mark#Greek_question_m...

piaste · on Nov 15, 2020

Yes, I've seen that episode of Game of Thrones too :)

First, consider the requirements to set such a trap. The authors of the document need to be actively concerned about a leaker, and to be OK with the document itself being leaked as long as they catch the culprit - at the same time, they need the document to be juicy enough that it will be leaked. They need to share the document in such a way that no two of the suspects will be able to compare notes, otherwise the jig is up. So no putting the file on a common internal resource (unless the server can stealthily serve different versions based on the user's login data); no attachments, else a reply all / forward would reveal the trap; no collaboration; no physical office where two suspects may see each other's copy.

Is that still possible? Yes. But a _lot_ of times it won't be possible, and the would-be leaker will know it's not possible. It's much more likely, and makes much more sense, for critical documents to be shared in such a way that the users _know_ they are fingerprinted, and won't leak them. IIRC, major Hollywood studios do that with their film scripts.

Second, what if the _key phrases_ are slightly altered in each version? Or hell, if your bosses want to finger you so bad, what's if they changed a small factual detail in each version? Then even the journalist quotes would reveal the leaker.

bobthepanda · on Nov 15, 2020

The not-so-great news is that common characters like spaces and semicolons have various similar-looking characters defined in Unicode, which would not be very noticeable to a human but would be noticeable to a machine.

So you just need to do random substitutions that uniquely identify the document and you'll have a fingerprint. It wouldn't be very challenging to do and it wouldn't be very challenging of a record to maintain.

You also don't need to uniquely identify it to a person; you just need to narrow the search space and then apply other techniques that would narrow it down. If it's a version of a document that leaked through an email chain then you've just limited the search space to the recipients, which is still plenty useful.

darkwater · on Nov 15, 2020

Then inevitably somebody would complain that the original document wording might have been altered.

piaste · on Nov 15, 2020

As opposed to a PDF scan which can definitely not be forged at all? ;)

Nothing less than a digital signature can prove the integrity of a digital document, and even that is worthless unless the corresponding public key has been publicly made available via a separate and trusted channel, which is unlikely.

bobthepanda · on Nov 15, 2020

Anything that can be used to prove a document's integrity can generally also be used to identify where it came from and how it was produced, which is why we generally don't see any effort to do this at all.

In fact, plenty of things that can't prove a document's integrity can also be used to identify its source, which is why this isn't done; you can't be sure that you've sanitized the document enough to protect the leaker.