Redacting is a long accepted practice when revealing information but preserving ...

nitrogen · on April 2, 2020

Supposedly so is the practice of using subtle variations in spelling, word choice, word order, spacing, typography, etc. to identify recipients of documents.

oh_sigh · on April 2, 2020

This is exactly what I worked on ~10 years ago at amazon, embedding steganographic information into a certain internal app that reported confidential sales numbers. Ended up catching the person who leaked this: https://techcrunch.com/2011/10/04/leaked-sales-data-puts-kin...

mavsman · on April 2, 2020

Curious how you feel about that now. Any guilt about building that? Pride? Ambivalence?

oh_sigh · on April 2, 2020

No guilt at all - mostly ambivalence. It was actually my idea to put it into the specific product, but it's not like I invented the technique or anything. It was only one small thing I worked on, 98% of my time was on something else.

I think the ability to leak information about the wrongdoing of corporations or governments is extremely important, but most of the leaks I see coming out of the tech industry seem designed just to score points in some internal political war or push the company in the direction that the leaker wants it to go. Or just for some weird form of self-aggrandizement

throwanem · on April 2, 2020

Having done the same kind of work - yeah, that. For every Edward Snowden, there's at least ten thousand Frank Underwoods and Michael Scotts.

choward · on April 3, 2020

This is the difference between leaking and whistleblowing. Leaking is for one's own personal benefit. Whistleblowing is to expose something you feel is wrong for no personal gain.

I wouldn't call this a leak unless the news agency paid him or something else that benefited him.

JorgeGT · on April 2, 2020

Out of curiosity, can you share a ballpark of how many different variations can you generate per, say, paragraph of text?

oh_sigh · on April 2, 2020

What I worked on was more like a spreadsheet, so I didn't use any of the text-oriented steganographic techniques like replacing words with synonyms, etc.

I was able to develop enough variations that vastly outnumbered our users though, so even with just a portion of a screenshot, you could fairly easily figure out where it came from.

Just looking at possible CSS rules and you can see where the variations come into play - cell width, border width and styles, font color(e.g. the specific green or red that represents gain/loss), kerning, column placement , etc.

On top of that, I only fudged with display elements - the numbers were never changed. However, the numbers were updated on a near-continuous basis by ingesting various logs, so any column that was live(year/month-to-date, etc) would have only a very small time range where that number could have been displayed to the user.

throwanem · on April 2, 2020

If you choose N words to alternate with one synonym each, you can make 2^n unique versions.

JorgeGT · on April 2, 2020

Oh, I was thinking in more subtle things such as spacing, punctuation, sizing, kerning, etc.

throwanem · on April 2, 2020

Ideally you don't want to count on a screenshot being published.

londons_explore · on April 3, 2020

For numbers like this, you can add a small amount of random variation to each number, and then save whatever variations you used to a database whenever someone views the stats.

Now when a leak happens of a specific number, you just check the logs to see who saw those exact numbers.

throwanem · on April 2, 2020

Word choice works best, assuming the source is textual; simple alternation of synonyms gives 2^n unique versions in the number of replacement candidates, and it's not hard to automate. You ideally want to take measures to reduce the likelihood of a given recipient seeing anyone else's copy and thus having a chance to spot the variance, but there are ways to do that and in most cases it's not all that likely in the first place.

On the other hand, news outlets that receive leaks are typically well aware of these techniques and will act to frustrate them. When you see a leak reported on but not directly published, that's why. If you want to evaluate veracity, a good method is to look at any response made by the source. In this case, it's legit; if it weren't, the Amazon GC would say so. He's not going to lie in a way that discovery will make immediately obvious in any case that comes of this, so he made the world's worst excuse instead. The surprise is that he let himself be reached for comment at all - between that and the "yeah, I sure did goof it, huh?" style of what he said when he was, I wouldn't be too astonished to see a golden handshake eventuate in the fullness of time.

daenz · on April 2, 2020

Are you making an argument for a news agency to never reveal the contents of any leaks, ever? There's always some risk involved, and that's the price of leaking the truth and expecting people to believe you. How can we expect people to "do their research" and be critical of information, when the news agencies themselves won't reveal it, and instead are paraphrasing and interpreting it for us? That's nonsense.

greedo · on April 2, 2020

Canary Trap...

Pharaoh2 · on April 2, 2020

Its fairly common to embed canary trap into message to find out who the leeker is. Not saying this memo had one, but its generally no longer safe to just show redacted messages without compromising the source.