This is broken in multiple ways, some obvious, some not. 1. Obviously most peopl...

ChristopherDrum · on July 6, 2022

To point 1, it was trivial for me to open the PDF in Preview on macOS, save it out as a multi-page PNG, open that PNG in Preview and macOS itself just let me copy out the text without error (in my limited test).

bscphil · on July 7, 2022

Yep. Even taking the most obvious approach, PDF -> PNG, PNG -> PDF, and running the PDF through Acrobat's "Optimize Scanned Pages" feature results in a PDF that is almost the exact same size as the original and has copyable text: https://0x0.st/oQvE.pdf

Even most of the columns scan correctly!

davidhbolton · on July 7, 2022

I took a screenshot of the address and pasted it into OneNote. The 'Copy Text from Picture' extracted the text with no issues.

rmbyrro · on July 6, 2022

I believe this can be taken as an axiom: there's nothing a human eye can see that a machine cannot.

mdaniel · on July 7, 2022

Isn't the opposite of your assertion the whole reason reCaptcha (and its dumbass hcaptcha competitor) exists?

Maybe you were asserting only on written text, and not machine vision in the general case -- but even then I'd bet that only applies to very regular text, and not handwritten items

hunterb123 · on July 7, 2022

Are you asserting that reCaptcha is not solvable by OCR?

It's a two parter if so. First, the main reCaptcha checks are via JS, that's what determines your likelihood to be a bot and the difficulty of the check. That's a cat and mouse game, but ultimately the client wins.

Second is the images like road signs, red lights, etc. can be solved via OCR, and there are services that do so. But reCaptcha keeps a decent amount of the opportunistic actors out, so it's good enough for the industry.

But I will give you that currently the human captcha services have a better solve rate, albeit they are more expensive, but it's been getting closer and closer and it's more economical to use an OCR service.

lwswl · on July 7, 2022

But there are many things that the machine can read, that is completely unintelligible to (most) humans.

xwdv · on July 7, 2022

except for beauty.

rmbyrro · on July 7, 2022

Sure thing. Machines can see, not appreciate, evaluate or contemplate the same way humans could.

danielrhodes · on July 6, 2022

For 1 - this is spot on. There are tools to dump PDF text, but they are quite flawed because you don't know how the text is laid out. In many cases the text can come out quite jumbled, for example if there are columns of text. Therefore, for OCR it's better to convert the PDF to an image and let OCR handle it. Google's OCR for example will understand and output columns properly.

viggity · on July 6, 2022

so one thing that I do to counter point 3 is by having multiple non-english characters map to the same english character and I pick a random one each time. Depending on the input, there can be ~10 or so characters mapping to any single english letter. If you're advanced enough to know about a substitution cipher, you'll figure out how to convert to an image based PDF and then use OCR on that. The reason I have the multiple mappings is so that if a layperson was trying to find all instances of "Billy" they could copy those characters and then search for "Ҽҙӈㅰベ", but the other instances of "Billy" might have the codepoints "Ҽтぃぃヴ".

Again, its resistant to built in PDF Reader OCR, not bulletproof. I'm trying to thwart a crawler or a script kiddie, or a 50 year old divorce attorney. Not the denizens of HN.

bscphil · on July 6, 2022

That's a fair point. I was trying to point out an interesting relationship between your approach and various forms of cryptography, but obviously this would not be the first line solution to the problem. I think the resistance to crawlers will come down to whether they implement OCR or not. I suspect some of the more sophisticated ones do. (A naive one might also be fooled by the fact that real glyphs are used and not attempt to OCR the text.)

BTW, you probably shouldn't say it's resistant to PDF reader OCR because most PDF readers don't have OCR, AFAIK. They just pull the text from the document, that's not OCR. Software that has OCR like Adobe Acrobat will not be fooled by your obfuscation if you render it to bitmap or textless vector first. If OCR doesn't work on the document as is, it's only because the presence of text glyphs fools it into thinking there's nothing there to perform optical character recognition on.

viggity · on July 6, 2022

I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).

pwg · on July 7, 2022

In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.

maxloh · on July 7, 2022

Maybe (2) can be resolved by using different mapping in different documents, making them not scrape-able by bots?