Hacker News new | past | comments | ask | show | jobs | submit login
TSA redaction fail: hidden text easily readable via copy & paste (wanderingaramean.com)
108 points by anigbrowl on Dec 7, 2009 | hide | past | favorite | 54 comments



It's helpful to understand a little bit of how a PDF file works.

A PDF file stores a sequence of instructions for what do draw on a page. The instructions are commands, such as "write this text here" or "draw a circle there." When you draw black redaction boxes, you are just appending an instruction to the end of the list ("draw a box here"), but most of the time the previous instructions are all still there.

Really, the only way to redact a PDF with certainty, short of manually reading the file itself (yes, PDF files are text-readable; you just have to decompress them) is to rasterize the file to a large image. Otherwise, the instructions for writing the text are all still in the file, and there are any number of ways to extract them (for example, simply by deleting the instruction to draw the redaction box).


A government document can't (shouldn't) be rasterized because it would loose all the tagging information making it non-accessible in a 508 sense.

It is however trivial for even a non-technically inclined person to remove the text content through Acrobat's content editing pane.

Just as you suggest, things like this seem to be an example of someone in a large organization not quite understanding how the "magic" behind the tools they use work and winding up with results they didn't expect.


I understand that but why would they not just replace the area with text that says [ redacted ] instead of the black box?

Edit: you can pad the redacted box with spaces to keep the formatting consistent.


From what I have seen in working with the government and the web I think that anigbrowl's comment is spot on. Many people use tools that are designed for the web or the desktop in the same way that they used to use tools designed for print. In this case just slapping a black line over it works in print, and it does not occur to many people that don't fully understand the technology they use that it might not be directly analogous.

A significant portion of my day job involves working with people in government that have a mental model of content distribution that was constructed during the era of print. Its a fun challenge to help people adjust those models to make things more efficient / more informative / less dangerous.


Perhaps they want to keep the formatting, and don't want pages to reflow after they've already been layed out? The Unicode character "█" works well for this purpose.


I remember a similar story from a few years ago where redacted names where discovered from the length of the redacted characters. Turned out it wasn't too difficult to brute force names and see which ones matched the exact size of the redaction.


They should have used a fixed width font.


Occum's Razor: because "they" did not understand PDF and thought that they were making the text unreadable simply by drawing boxes over it.


Black boxes look much more secret agent?


Mrs Browl works at a e-discovery company where documents are often redacted depending on which bits are admissible in court; I gather they use TIFF files for exactly the reasons you describe.

I suppose the (extremely limited) redaction features in Acrobat stem from the belief that the document is ultimately destined for printing rather than computer reading.


I also work at an e-discovery company and can confirm the use of TIFF files as the court-admissible format for producing any sort of document (especially if it has redactions).

Here is an example of the process to apply redactions to a PDF file (or any sort of electronic document). The document is opened in a third-party native viewer with redaction capabilities (and not in the original application like Arcobat or Office). The end-user draws a black box over the appropriate text using the third-party native viewer. This redaction is saved as a separate layer that can be rendered on top of the native document when viewed in the software, as well as printed out. Finally, when documents need to be produced for court, they are printed out to a TIFF-printer driver, effectively converting them to TIFF and "burning in" the redaction layer. The TIFFs are delivered to the courts and opposing council.


I have no personal experience and at best perhaps a vague recollection of having read or heard something, but would document images be or in the past have been used also to make discovery as inefficient as permitted for the opposition? Images require a person to analyze, or pre-processing before electronic search and analysis can be performed against them.


OCR is very good nowadays, so that's not so big a problem - and for big lawsuits (eg patent fights) there are literally hundreds of lawyers hired to comb through the vast array of documents deciding whether they're responsive or not.

The pre-processing is largely automated but there's certainly a portion (maybe 5%?) of documents that need to be hand-classified in the database before they get to the lawyers. It an interesting field - lots of money to be made, intense competition for it, relatively simple technology requirements but a legal industry which has been resistant to technology for quite a long time. Autonomy seems to be the leading software company in this space.


I can't speak for their other products but in my experience their enterprise search product is a useless piece of crap.


When you're billing the client $300 an hour to do a job they can't easily take elsewhere, you don't have a lot of incentive to improve efficiency.


I heard once that NSA guidelines are to print out PDFs, redact portions with a marker and then rescan them into new files.


If you open the file in Adobe Acrobat Pro, there's a tool called "TouchUp Object" that lets you literally just click and delete the boxes.


Want to know which twelve passports will instantly get you shunted over for secondary screening, simply by showing them to the ID-checking agent?

Cuba, Iran, North Korea, Libya, Syria, Sudan, Afghanistan, Lebanon, Somalia, Iraq, Yemen, or Algeria


That's such an obvious list I'm not sure if it's from the document itself, or if you just wrote down what it would probably be.


The absence of Saudi Arabia is pretty interesting.


Right, didn't half the 9/11 hijackers have Saudi passports? They wouldn't even get secondary.


None of the 9/11 hijackers' nationalities are represented (Saudi Arabia, Egypt, UAE), excepting Lebanon.

http://en.wikipedia.org/wiki/Organizers_of_the_September_11,...


Yes, but the Saudis are our rich friends so we wouldn't want to bother them with anything like pursuing real justice.


I was a little surprised about Cuba. They're not our favorite government, but I don't see them as a security risk.


Cuba (along with Syria, Sudan, and Iran) compromise the State Department's list of State Sponsors of Terrorism - which I would assume is the origin of their being on that list.


The countries on the list are there more for "Screw you" reasons than actual security. That is why Cuba is on the list, and possibly first.


Are you serious!!!!

America has been living under the shadow of imminent Cuban aggression for 50years. There have been Cuban attempts to invade America and numerous assassination attempts on the US president by Cuban agents.

And their chemical weapons kill 1000s of Americans after being secretly delivered from Canada.


Apparently your sense of humor is too subtle for some other HNers!


Or perhaps a bit too blunt.


It's from the doc.


Sounds like George Carlin's seven words you can't say on TV.


This isn't the first time this has happened. Several years back, another government agency made the same mistake. On a slow computer, the text would render first, and could be seen for several seconds before the black boxes would appear.

Googling around, it seems to be a fairly common mistake, going back to at least 2000.

http://www.securityfocus.com/news/7272

http://cryptome.org/iran-cia/cia-iran-pdf.htm

http://blogs.zdnet.com/BTL/?p=12907


In the same vein, if not the same format: Distributing a Word document from which you've not removed change tracking (and/or other metadata). (Or whatever Word calls it; it's been a while.) I had to correct my teammates on this one, a while back (and yes, the documents were going to external clients). Nothing novel; the problem's been in the news repeatedly. Nonetheless, people -- even "technical" -- still don't get it right.

(I encouraged them to go further and switch to PDF format for the distribution, but they wouldn't.)


This should be required reading on redaction in the government. http://www.fas.org/sgp/othergov/dod/nsa-redact.pdf


Nice one. I had to giggle at the ubiquity of the MS Office assistant down in the corner... It looks like you're trying to conceal some information from prying eyes. Would you like help with that?


Here's a version with the redactions lifted and the areas highlighted. http://cryptome.org/tsa-screening.zip


it would be really interesting if the government attempts to go after you for hosting that document.


That's actually hosted by cryptome: http://en.wikipedia.org/wiki/Cryptome



It's a lengthy document. Luckily, they've highlighted the areas of interest.


They aren't all that remarkable.


A responsible citizen would not google the following:

"SENSITIVE SECURITY INFORMATION" site:.gov filetype:pdf

Less than 2000 such PDFs. A very determined and very irresponsible citizen could have already mirrored all of them.


Well, not from their home IP they wouldn't anyway.


I like watching Security Theater as much as the next guy, but this season's plot is so thin, I'm having trouble suspending disbelief.


I downloaded it earlier but it seems to have been taken down now. Very interesting. I bet that will lose some people their jobs - but I think it definitely highlights an underlying issue within the government bureaucracy that shows why people getting paid 30k with degrees from the University of Phoenix shouldn't be entrusted with national security... I wonder how many security people that passed through...


"bet that will lose some people their jobs"

I doubt it. the more likely outcome is they mandate all their employees attend document redaction training developed and administered at great tax payer cost by a government contracting company. It's also quite likely that they use the incident to justify spending millions to have a government contractor write document redaction software.


Most interesting thing I learned from the document: members of Congress get special ID cards. It makes sense, I just never thought about it before.


Using the link in the article, it appears that it's already been taken down (link leads to a 404 page). However, from the comments in the article, it's been mirrored at http://cryptome.org (at the top of the list, in fact, filename is tsa-screening.zip, in case that makes it easier to find it).


Looking at the dates on the document, it was last revised in May 2008, so it seems like the horse left the stable some 18 months ago as far as actual bad guys and spies (who presumably seek out this information more actively than your or I) are concerned.


Pretty ridiculous. Why not just make one version for internal distribution, and a completely separate document for the public?


Because that would cut into your Solitaire time.


searchable too. are we feeling safer yet?


My sister is an attorney and asked me about software that can catch things like this, remove metadata, etc. Does anyone have experience with something along these lines?


Schneier is going to have a field day with this one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: