Hacker News new | past | comments | ask | show | jobs | submit login

This is broken in multiple ways, some obvious, some not.

1. Obviously most people jumped directly to OCR, and that works of course. So counter to the OP, you can trivially render the first page to a high resolution PNG and then OCR that with what will probably be 100% accurate results. Sample image: https://i.imgur.com/hyJOSjY.jpg

2. This is just messing with glyphs in fonts, so one trivial way of undoing the changes losslessly (not even requiring OCR) is to create a mapping between each font glyph and the original character. For example, I was able to extract the font used for the text "CONTENTS" near the beginning of the sample document. It is named "SecureFont-1845559949-FranklinGothic-Demi", as extracted by mutool. In the PDF "CONTENTS" is made up of eight Unicode characters, which render as "CONTENTS" in that font.

3. Even if the first two methods somehow failed, the same character in a given font is repeatedly used to render the same character in English. That makes the approach similar to a substitution cipher [1] which is trivially broken with frequency analysis. You could literally just copy / paste the fake "text" out of the PDF and with an analysis tool derive the original text. This isn't really significant since the PDF can be read by sight anyway, but it's worth pointing out.

[1] https://en.wikipedia.org/wiki/Substitution_cipher




To point 1, it was trivial for me to open the PDF in Preview on macOS, save it out as a multi-page PNG, open that PNG in Preview and macOS itself just let me copy out the text without error (in my limited test).


Yep. Even taking the most obvious approach, PDF -> PNG, PNG -> PDF, and running the PDF through Acrobat's "Optimize Scanned Pages" feature results in a PDF that is almost the exact same size as the original and has copyable text: https://0x0.st/oQvE.pdf

Even most of the columns scan correctly!


I took a screenshot of the address and pasted it into OneNote. The 'Copy Text from Picture' extracted the text with no issues.


I believe this can be taken as an axiom: there's nothing a human eye can see that a machine cannot.


Isn't the opposite of your assertion the whole reason reCaptcha (and its dumbass hcaptcha competitor) exists?

Maybe you were asserting only on written text, and not machine vision in the general case -- but even then I'd bet that only applies to very regular text, and not handwritten items


Are you asserting that reCaptcha is not solvable by OCR?

It's a two parter if so. First, the main reCaptcha checks are via JS, that's what determines your likelihood to be a bot and the difficulty of the check. That's a cat and mouse game, but ultimately the client wins.

Second is the images like road signs, red lights, etc. can be solved via OCR, and there are services that do so. But reCaptcha keeps a decent amount of the opportunistic actors out, so it's good enough for the industry.

But I will give you that currently the human captcha services have a better solve rate, albeit they are more expensive, but it's been getting closer and closer and it's more economical to use an OCR service.


But there are many things that the machine can read, that is completely unintelligible to (most) humans.


except for beauty.


Sure thing. Machines can see, not appreciate, evaluate or contemplate the same way humans could.


For 1 - this is spot on. There are tools to dump PDF text, but they are quite flawed because you don't know how the text is laid out. In many cases the text can come out quite jumbled, for example if there are columns of text. Therefore, for OCR it's better to convert the PDF to an image and let OCR handle it. Google's OCR for example will understand and output columns properly.


so one thing that I do to counter point 3 is by having multiple non-english characters map to the same english character and I pick a random one each time. Depending on the input, there can be ~10 or so characters mapping to any single english letter. If you're advanced enough to know about a substitution cipher, you'll figure out how to convert to an image based PDF and then use OCR on that. The reason I have the multiple mappings is so that if a layperson was trying to find all instances of "Billy" they could copy those characters and then search for "Ҽҙӈㅰベ", but the other instances of "Billy" might have the codepoints "Ҽтぃぃヴ".

Again, its resistant to built in PDF Reader OCR, not bulletproof. I'm trying to thwart a crawler or a script kiddie, or a 50 year old divorce attorney. Not the denizens of HN.


That's a fair point. I was trying to point out an interesting relationship between your approach and various forms of cryptography, but obviously this would not be the first line solution to the problem. I think the resistance to crawlers will come down to whether they implement OCR or not. I suspect some of the more sophisticated ones do. (A naive one might also be fooled by the fact that real glyphs are used and not attempt to OCR the text.)

BTW, you probably shouldn't say it's resistant to PDF reader OCR because most PDF readers don't have OCR, AFAIK. They just pull the text from the document, that's not OCR. Software that has OCR like Adobe Acrobat will not be fooled by your obfuscation if you render it to bitmap or textless vector first. If OCR doesn't work on the document as is, it's only because the presence of text glyphs fools it into thinking there's nothing there to perform optical character recognition on.


I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).


In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.


Maybe (2) can be resolved by using different mapping in different documents, making them not scrape-able by bots?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: