Hacker News new | past | comments | ask | show | jobs | submit login

I literally used an OCR tool to grab the text directly out of the first box. I think this is meant to be guarding against copy/pasting—not OCR.



Yeah, that was not an accurate choice of terminology... as it says in the "more info" box,

> Resistant to Optical Character Recognition (OCR), most laypeople will need to print+rescan to OCR

If print+rescan (or equivalently, screen-grab+OCR) works, which it will, then it's hardly OCR-resistant!

The only thing this "blocks" is text extraction from the PDF with things like copy/paste or pdftotext/html/whatever conversion tools, which will "see" the codepoints used rather than the glyph images.


So this is interesting. I guess I didn't realize that there are (common?) tools to OCR screenshots. And do that end, there probably isn't a whole lot I can do to stop it. But when you're looking at a huge tax return, or sworn testimony, or just a dump of 3000 emails, you're not gonna screenshot each one. You're going to want to automate the OCR, which most PDF readers (at least the commercial ones) will let you do. It is against that type of OCR that my app is resistant to. They look for image data within the PDF and OCR that. They bypass my text because to the pdf reader, it already is in a text format.

I'm 1000% sure there are gurus who could whip up a script to overcome this. But its kind of one of those things where you don't have to outrun the bear, you have to outrun your friend running next to you. It makes your sensitive documents just that much less likely to be scanned/found.


Nothing that you said is wrong but it doesn't make the situation better.

1) As many people pointed out, this doesn't prevent OCR, it just prevents copying strings (e.g. with crawlers). 2) Majority of OCR doesn't deal with PDFs produced from a text source but either from a) jpg-scans of documents b) pdfs produced from those jpg-scans. 3) The first thing I tried, was OCR with my iPhone and it obviously worked. As someone else said, there're solutions that let you batch process many documents.

Don't get me wrong, your stuff works for what you designed it to. However, it provides <false sense of security> by <falsely> claiming that it prevents OCR; which in turn, can lead to more harm[1].

[1] - e.g., it may convince people to share stuff that they wouldn't otherwise.


> I didn't realize that there are (common?) tools to OCR screenshots.

Retrieving text from images is literally the definition of OCR.


I think the main issue is that you have no idea what OCR is: https://en.wikipedia.org/wiki/Optical_character_recognition


> I didn't realize that there are (common?) tools to OCR screenshots.

This seems like quite the oversight to me...


But if you have a huge tax document, you're likely not going to screenshot page by page. Yes, there are ways to automate this. But if you're 50 year old divorce attorney, you're going to click on the "OCR" button in your PDF reader and it will not work.


You don’t have to screen shot every page… convert the PDF to a PNG/TIFF image for every page, and OCR those. This is very easy to automate. If this is working with Unicode code points, you’re not blocking OCR, you’re obfuscating text. Anything that renders the PDF to a raster format will produce an OCR-able document.

If you’re a divorce attorney who used this to convert documents in response to a discovery request, and the opposing side had a valid reason for needing the unobfuscated text, then you’re probably going to end up having a nice conversation with the judge about acceptable formats.

Sending compressed TIFFs would probably be just as good. A bit larger file sizes, but it would be just as effective as stopping automated scraping of text. Also, less likely to piss off a judge. Any opposing firm that would be sophisticated enough to automate scrapping the text from a normal PDF would be able to OCR these files just as easily.

Or maybe you have a second site that sells the decoder, so you get to sell to both sides. Not a bad business model, if you can work it.


I don't know why you think divorce attorneys are stupid. Some are probably very well versed in tech; those who aren't know others who are. They won't simply sit there and think "oh, for some reason I can't copy-paste from that PDF, better give up the case then".

... And most attorneys simply print documents. Once the PDF is on paper, OCR-ing it back into text is just one scanner away.


I understand that. And I understand your personal use case was valid. But I think your "Human Eyes Only" name and domain is a little deceptive.


I’m not sure what reader you are talking about, but that button is most certainly not doing any kind of OCR if your technique stops it.


It's actually much easier than you think.

You don't need any scripts, just Acrobat itself (or any comparable PDF viewer) can do this. Export the PDF to images, make a new PDF out of the images, scan the text, done.

Example (took your example and did just that with it, now everything can be copied & pasted as normal text): https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf

In general, if it LOOKS like text, SOMETHING can OCR it. That's the whole point of OCR. If you want to try to block OCR, you need something like CAPTCHAs, and that's getting less and less effective every day. In fact many are already more easily solved by computers than humans.


Of course. The OP doesn't understand what "OCR" actually means.


>It is against that type of OCR that my app is resistant to.

There is no form of OCR this is resistant to, simply the change the description to be accurate and remove references to being OCR resistant as this is false.


>I'm 1000% sure there are gurus who could whip up a script to overcome this. But its kind of one of those things where you don't have to outrun the bear, you have to outrun your friend running next to you. It makes your sensitive documents just that much less likely to be scanned/found.

Security through obscurity is stupid:

    gs -sDEVICE=tiffscaled24 -dNOPAUSE -dBATCH -dSAFER \
       -sOutputFile=filename.tiff \
       filename.pdf
    tesseract filename.tiff filename.txt
All you need is ghostscript and tesseract. Both are an apt-get away.


If it's on the dark web then they probably know how to use 'ocrmypdf' as well (which uses tesseract under the hood).

https://github.com/ocrmypdf/OCRmyPDF


I'm fairly sure you can just open up any image (not sure if there's a limit on size or complexity) on a Mac and use the select all shortcut to grab all the text to use for whatever you'd like.


macOS does it by default now. I've found it to be very useful at times.

Any image you open in Safari, Preview, etc. (official Apple programs) will be OCR'd automatically, allowing you to extract the text with copy+paste. I think it works with PDFs, but I haven't tested it.

https://support.apple.com/guide/preview/interact-with-text-i...


And tools like pdftotext...it effectively breaks that.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: