This is done with OCR, specially when digitizing things. The Internet Archive do...

the-dude · on Aug 17, 2019

Thanks for the heads-up. Of course I do know what OCR is. Just hadn't seen it combined in the wild like this.

rayiner · on Aug 17, 2019

Acrobat has had this as a built in feature forever.

the-dude · on Aug 17, 2019

Didn't know that either. Makes me wonder if my browser ( PDF.js ) is doing the OCR? Anybody knows?

philipkglass · on Aug 17, 2019

He means that Acrobat Pro includes an OCR system that you can use to add a searchable text layer to scanned documents. Readers like Acrobat Reader and PDF.js do not perform OCR. You won't be able to use them to search scanned documents if the document creator did not run OCR.

Google runs its own OCR pass on scanned PDF documents in order to index them better. It can be annoying when you get a 50 page scanned document as a search result and then find out that it doesn't include a text layer, so you need to run your own OCR or skim the whole thing to find the relevant parts.

visarga · on Aug 18, 2019

What PDF.js is showing is an invisible text layer overlaid on top of the original image. It does not do OCR which can take up to 1-2 seconds per page, it would be too slow and require a large-ish neural net if you care about accuracy.

coldtea · on Aug 18, 2019

No, the OCR has already happened at the time of the scanning (or shortly thereafter), and the result is embedded into the PDF document.

agumonkey · on Aug 17, 2019

True, but as parent, I realized that only very recently.

alttab · on Aug 17, 2019

Guy's login is "the-dude," he's been around for a while.