Oh, that’s my thread. It’s not written for the HN crowd; my followers on Twitter are a very different demographic. It’s written to help non-technical folks start to consider the possibilities of the technology that exists in their lives now.
They’re just getting a carefully curated peek behind the curtain. The HN crowd knows what’s behind the curtain already.
A couple things: there’s a lot more interesting stuff going on than I’m explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP part of the analysis (and displaCy was the best visual way to explain that to non technical folks!). The approach I’m using to dereference relative dates could be useful for its own spaCy post on its NER, for example. Same with dereferencing last names to the full names they’re referring to. Those two things alone are immensely useful for journalists and data viz designers. Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it that I should write about—like discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used).
Two more things: the flight data is the sleeping dragon of that thread. Really.
And I’ve abandoned the idea of “unredacting” with predictions. For some technical reasons, sure (the redactions, for example, actually displace or edit other text, so it’s impossible to know the correct box size)…but mostly for ethical concerns.
It’s both cool and weird to see my thread posted here! I hope my actual technical post gets a shot when I’m finished with it.
"discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used)."
Please write about that. I remember someone on HN claimed to be able to group accounts belonging to a single individual. When challenged, they offered to email the commenter their own alternate username, and the commenter later replied confirming they were correct. I can't find the the thread now. It might have been minimaxir?
Could you share a little more about the sockpuppet detection? That sounds very fascinating.
Do you specifically mean IRA sockpuppets, or Internet sockpuppeting in general? Were you looking at indications of sockpuppeting from native Russian speakers writing in English? Or identifying what appears to be automatically generated text? Or using text from known sockpuppet accounts to predict if other accounts may be sockpuppets? Some combination? How much stylometry was involved?
I certainly agree that this wouldn’t tell us anything real about the Mueller report, although it might be a useful exercise to learn about the language models. What really rubs me the wrong way is academics or anyone else with power trying to tell me what is or isn’t “funny” or “interesting” or “fun.” Like, that’s for me to decide, not you!
The work is more about running named entity recognition on the text and correlating the names with other sources of data than it is about deducing the redacted words (which is probably impossible for the most interesting words). For example if flight XX1234 is mentioned in the text, he might be able to deduce that the plane is owned by some Russian oligarch.
I already have a self-curated list of tail numbers of the private aircraft owned by oligarchs (and other international parties of interest), and a while bunch of ADSB data. Combine that with a timeline of events (and, for some, locations), and there are some…interesting correlations there.
Yeah, that’s bumming me out a little. It’s the NER (which is getting a lot of my time on this one because spaCy is an amazing tool to extend) and the correlation with other data that’s the interesting part.
Especially those flights.
I wish I could delete the tweet about “unredacting” (or at least edit it to point to my decision later on) without breaking the thread. It was written in a moment of nerd glee, but the ethical considerations are more important to me.
> Section 508 requires your PDF to be accessible to users of assistive technology—like screen readers or Braille displays.
Have people filed official complaints about this? I think once the complaint is official (instead of tweets) the department is obligated to respond to it.
The site where the document is published (https://www.justice.gov/sco) shows the following message below the link to the PDF:
"The Department recognizes that these documents may not yet be in an accessible format. If you have a disability and the format of any material on the site interferes with your ability to access some information, please email the Department of Justice webmaster. To enable us to respond in a manner that will be of most help to you, please indicate the nature of the accessibility problem, your preferred format (electronic format (ASCII, etc.), standard print, large print, etc.), the web address of the requested material, and your full contact information, so we can reach you if questions arise while fulfilling your request."
I'm not sure what to make of the message. They don't say whether they'll send an ASCII version if you ask for it.
To be honest I bet they were scared the redacted sections would leak if they did it entirely digitally, due to ignorance of their own software. Not an excuse, but maybe an explanation.
is that how this works? throw up a disclaimer and i'm good?
i mean, it's fine they are in a tough spot and can't put up an accessible version right now. but that should maybe be answered in court in a year. and people will go to jail if it's a cover up, or not if it was an honest mistake.
The whole notion of 'i might stab you because i'm not good at knives' isn't really plausible to me. however, i'm totally willing to accept student drivers on the road, because everyone needs to learn _somehow_. Hopefully it's an honest mistake, or incompetence, and not anything sinister.
Unfortunately, this is a fairly common work around for govt sites. Same goes for designing a site for multiple browsers, just throw up a caveat that "site is best viewed in IE circa 2001" to avoid actually fixing the problem
This is the type of thing that I imagine the "data suckers" (Facebook, Google, Apple, Amazon) are able to do regularly with the complex data that they have and the advanced tools that they have at their disposal.
I can't see yet to what end that ability would be trained, but I consider that ability to be akin to the state's ability to see into my backyard from space: I know they can do it and it doesn't harm me immediately that they can, but there's something net negative about the power/ability imbalance that makes me feel uncomfortable. Of course nobody is going to use that awesome power on someone as inconsequential as me, but...
I wonder if the "data suckers" have any intention of creating a history of the world, in which no author imposes an interpretation of events. Besides remembering all the data, these companies have tools to parse the data and render it understandable, without imposing a meaning. Call it objective history. That would be valuable.
The unredacting part was originally borne out of my experiments with a word-level LSTM approach trained on everything the SCO had released. More relevantly, that part was quickly abandoned. It’s all about extracting date-referenced narrative text, and the combination of the NER and the dependency parser have been amazing. Together, they’ve let me begin an extension that dereferences relative dates and last names as though they were pronouns.
displaCy will make an appearance in the final “public” post I’m writing, as well as the tech post for the HN crowd. Thank you so much for your work on spaCy/displaCy!
I think he points out the ability to correlate non-report data (like airline flights) or to make it more obvious what is in the redacted sections. I am not sure how effective that will be, but overall NER counts and the relationships between people and a coherent timeline would be helpful.
The informal thread was really written for my followers on Twitter, which (mostly) comprise a very different demographic than HN. I use a different voice when speaking to mostly non-technical people. It’s the teacher/public speaker in me, I guess. :)
They’re just getting a carefully curated peek behind the curtain. The HN crowd knows what’s behind the curtain already.
A couple things: there’s a lot more interesting stuff going on than I’m explaining in my thread. As syllogism noticed, I’m using spaCy for the NLP part of the analysis (and displaCy was the best visual way to explain that to non technical folks!). The approach I’m using to dereference relative dates could be useful for its own spaCy post on its NER, for example. Same with dereferencing last names to the full names they’re referring to. Those two things alone are immensely useful for journalists and data viz designers. Plus, I love the hell of spaCy. I’ve done a ton of interesting stuff with it that I should write about—like discovering sock puppet accounts by comparing simple grammar signatures (relative ratios of parts of speech used in their text, along with other features like where adverbs and prepositions are used).
Two more things: the flight data is the sleeping dragon of that thread. Really.
And I’ve abandoned the idea of “unredacting” with predictions. For some technical reasons, sure (the redactions, for example, actually displace or edit other text, so it’s impossible to know the correct box size)…but mostly for ethical concerns.
It’s both cool and weird to see my thread posted here! I hope my actual technical post gets a shot when I’m finished with it.