NLP on Mueller Report

spdustin · on April 21, 2019

Oh, that’s my thread. It’s not written for the HN crowd; my followers on Twitter are a very different demographic. It’s written to help non-technical folks start to consider the possibilities of the technology that exists in their lives now.

They’re just getting a carefully curated peek behind the curtain. The HN crowd knows what’s behind the curtain already.

A couple things: there’s a lot more interesting stuff going on than I’m explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP part of the analysis (and displaCy was the best visual way to explain that to non technical folks!). The approach I’m using to dereference relative dates could be useful for its own spaCy post on its NER, for example. Same with dereferencing last names to the full names they’re referring to. Those two things alone are immensely useful for journalists and data viz designers. Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it that I should write about—like discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used).

Two more things: the flight data is the sleeping dragon of that thread. Really.

And I’ve abandoned the idea of “unredacting” with predictions. For some technical reasons, sure (the redactions, for example, actually displace or edit other text, so it’s impossible to know the correct box size)…but mostly for ethical concerns.

It’s both cool and weird to see my thread posted here! I hope my actual technical post gets a shot when I’m finished with it.

rahimnathwani · on April 21, 2019

"discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used)."

Please write about that. I remember someone on HN claimed to be able to group accounts belonging to a single individual. When challenged, they offered to email the commenter their own alternate username, and the commenter later replied confirming they were correct. I can't find the the thread now. It might have been minimaxir?

rahimnathwani · on April 23, 2019

It was lettergram, not minimaxir:

https://news.ycombinator.com/item?id=17944484

meowface · on April 21, 2019

Could you share a little more about the sockpuppet detection? That sounds very fascinating.

Do you specifically mean IRA sockpuppets, or Internet sockpuppeting in general? Were you looking at indications of sockpuppeting from native Russian speakers writing in English? Or identifying what appears to be automatically generated text? Or using text from known sockpuppet accounts to predict if other accounts may be sockpuppets? Some combination? How much stylometry was involved?

phowon · on April 21, 2019

And here is a tweet thread on why using NLP models to "fill in" the redacted portions is a horrendously terrible idea.

https://twitter.com/emilymbender/status/1119081131234611201

Certhas · on April 21, 2019

Of course that's not the main aim of the person in the thread. The aim is a timeline cross referencing different data sources.

spdustin · on April 21, 2019

Bingo!

intuitionist · on April 21, 2019

I certainly agree that this wouldn’t tell us anything real about the Mueller report, although it might be a useful exercise to learn about the language models. What really rubs me the wrong way is academics or anyone else with power trying to tell me what is or isn’t “funny” or “interesting” or “fun.” Like, that’s for me to decide, not you!

georgespencer · on April 21, 2019

Before you read a 25 tweet thread: he has not yet done any of the things he is talking about. This post is somewhat premature.

spdustin · on April 21, 2019

I have, actually. That thread was truncated and a new one started where I’ve shared some new things.

The whole thing is like a weekend project that’s being narrated for my Twitter followers. They’re (in general) a very different demographic than HN.

My technical post later on will be of greater interest to HN.

inflatableDodo · on April 21, 2019

He'd better hurry, the unredacted version might leak by tomorrow. I'd be amazed if it doesn't appear pretty soon.

soVeryTired · on April 21, 2019

The work is more about running named entity recognition on the text and correlating the names with other sources of data than it is about deducing the redacted words (which is probably impossible for the most interesting words). For example if flight XX1234 is mentioned in the text, he might be able to deduce that the plane is owned by some Russian oligarch.

spdustin · on April 21, 2019

I already have a self-curated list of tail numbers of the private aircraft owned by oligarchs (and other international parties of interest), and a while bunch of ADSB data. Combine that with a timeline of events (and, for some, locations), and there are some…interesting correlations there.

Certhas · on April 21, 2019

People seem to have filled in what OP is doing based on the headline/first three tweets only...

spdustin · on April 21, 2019

Yeah, that’s bumming me out a little. It’s the NER (which is getting a lot of my time on this one because spaCy is an amazing tool to extend) and the correlation with other data that’s the interesting part.

Especially those flights.

I wish I could delete the tweet about “unredacting” (or at least edit it to point to my decision later on) without breaking the thread. It was written in a moment of nerd glee, but the ethical considerations are more important to me.

inflatableDodo · on April 22, 2019

Sorry, I picked up the wrong end of the stick and ran with it. I really shouldn't use the internet when tired.

raverbashing · on April 21, 2019

Why would you have to hurry? Just save the unredacted version and use it when you're ready

lake99 · on April 21, 2019

> Section 508 requires your PDF to be accessible to users of assistive technology—like screen readers or Braille displays.

Have people filed official complaints about this? I think once the complaint is official (instead of tweets) the department is obligated to respond to it.

rahimnathwani · on April 21, 2019

The site where the document is published (https://www.justice.gov/sco) shows the following message below the link to the PDF:

"The Department recognizes that these documents may not yet be in an accessible format. If you have a disability and the format of any material on the site interferes with your ability to access some information, please email the Department of Justice webmaster. To enable us to respond in a manner that will be of most help to you, please indicate the nature of the accessibility problem, your preferred format (electronic format (ASCII, etc.), standard print, large print, etc.), the web address of the requested material, and your full contact information, so we can reach you if questions arise while fulfilling your request."

I'm not sure what to make of the message. They don't say whether they'll send an ASCII version if you ask for it.

spdustin · on April 21, 2019

I have yet to get anything back from them.

Honestly, there is no good reason that born-digital content couldn’t get posted in digital/textual form rather than scanned pages.

The secure redactions created by other software would’ve been preserved, too. It’s inexcusable, IMO, to just scan the damn thing.

diminoten · on April 21, 2019

To be honest I bet they were scared the redacted sections would leak if they did it entirely digitally, due to ignorance of their own software. Not an excuse, but maybe an explanation.

jfoutz · on April 21, 2019

is that how this works? throw up a disclaimer and i'm good?

i mean, it's fine they are in a tough spot and can't put up an accessible version right now. but that should maybe be answered in court in a year. and people will go to jail if it's a cover up, or not if it was an honest mistake.

The whole notion of 'i might stab you because i'm not good at knives' isn't really plausible to me. however, i'm totally willing to accept student drivers on the road, because everyone needs to learn _somehow_. Hopefully it's an honest mistake, or incompetence, and not anything sinister.

bumby · on April 21, 2019

Unfortunately, this is a fairly common work around for govt sites. Same goes for designing a site for multiple browsers, just throw up a caveat that "site is best viewed in IE circa 2001" to avoid actually fixing the problem

yay_cloud2 · on April 21, 2019

This is the type of thing that I imagine the "data suckers" (Facebook, Google, Apple, Amazon) are able to do regularly with the complex data that they have and the advanced tools that they have at their disposal.

I can't see yet to what end that ability would be trained, but I consider that ability to be akin to the state's ability to see into my backyard from space: I know they can do it and it doesn't harm me immediately that they can, but there's something net negative about the power/ability imbalance that makes me feel uncomfortable. Of course nobody is going to use that awesome power on someone as inconsequential as me, but...

beautifulfreak · on April 21, 2019

I wonder if the "data suckers" have any intention of creating a history of the world, in which no author imposes an interpretation of events. Besides remembering all the data, these companies have tools to parse the data and render it understandable, without imposing a meaning. Call it objective history. That would be valuable.

fit2rule · on April 21, 2019

Angelastic ran her haiku detector on the Mueller report and found some amusing results:

https://angelastic.com/2019/04/20/unintentional-haiku-in-the...

armantor · on April 21, 2019

Meanwhile check this if you are interested: https://www.axios.com/explore-a-detailed-version-of-the-muel...

syllogism · on April 21, 2019

Surely the report isn't long enough to do this sort of thing? I mean...You can just read it, right?

(And I say this as the developer of the tools he'd likely be using...At least, he has a screenshot of our NER visualiser there.)

spdustin · on April 21, 2019

The unredacting part was originally borne out of my experiments with a word-level LSTM approach trained on everything the SCO had released. More relevantly, that part was quickly abandoned. It’s all about extracting date-referenced narrative text, and the combination of the NER and the dependency parser have been amazing. Together, they’ve let me begin an extension that dereferences relative dates and last names as though they were pronouns.

displaCy will make an appearance in the final “public” post I’m writing, as well as the tech post for the HN crowd. Thank you so much for your work on spaCy/displaCy!

sytelus · on April 21, 2019

All these NLP is great but the important question is what insights you gained from these that you didn’t had before?

soVeryTired · on April 21, 2019

Honestly just a list of names, people, places and dates mentioned in the document could be a boon to investigative journalists.

spdustin · on April 21, 2019

That was my prediction and hope. Judging by the DMs I’ve gotten, it was an accurate prediction. :)

mgamache · on April 21, 2019

I think he points out the ability to correlate non-report data (like airline flights) or to make it more obvious what is in the redacted sections. I am not sure how effective that will be, but overall NER counts and the relationships between people and a coherent timeline would be helpful.

spdustin · on April 21, 2019

The unredacting stuff was abandoned. It’s problematic even with training on other material from the SCO, and the ethics are fraught.

The correlation is where it’s at, no doubt.

zachguo · on April 21, 2019

It's a bit boring TBH. Basically OCR+NER, I guess some journalists have already done similar stuff using Google's NLP API.

rayrrr · on May 3, 2019

Here's some NLP text summarization of the Mueller Report in action (by yours truly): https://news.ycombinator.com/item?id=19815506

1wd · on April 21, 2019

Is there a central public place (wiki / github / ...) where people collaborate on annotating the Mueller Report?

dang · on April 21, 2019

Url changed from https://threadreaderapp.com/thread/1119118085443559425.html, which points to this.

nwrk · on April 21, 2019

That looks pretty cool!

daRealDodo · on April 21, 2019

Why not a Medium post?

spdustin · on April 21, 2019

The informal thread was really written for my followers on Twitter, which (mostly) comprise a very different demographic than HN. I use a different voice when speaking to mostly non-technical people. It’s the teacher/public speaker in me, I guess. :)

There will be a technical post later.