Hacker News new | past | comments | ask | show | jobs | submit login

This is done with OCR, specially when digitizing things.

The Internet Archive does this a lot. There was an old 90s book that I needed but they never made an electronic edition. So the internet archive, scanned and digitized it with character recognition to make a select-able PDF like this, as well an epub that can be read on your phone like any other ebook.

Now that book is available for anyone to borrow and read. (its still copyrighted so you gotta access it via their DRM controlled app/website, but that can be easily broken and its better than not having access to that book at all)




Thanks for the heads-up. Of course I do know what OCR is. Just hadn't seen it combined in the wild like this.


Acrobat has had this as a built in feature forever.


Didn't know that either. Makes me wonder if my browser ( PDF.js ) is doing the OCR? Anybody knows?


He means that Acrobat Pro includes an OCR system that you can use to add a searchable text layer to scanned documents. Readers like Acrobat Reader and PDF.js do not perform OCR. You won't be able to use them to search scanned documents if the document creator did not run OCR.

Google runs its own OCR pass on scanned PDF documents in order to index them better. It can be annoying when you get a 50 page scanned document as a search result and then find out that it doesn't include a text layer, so you need to run your own OCR or skim the whole thing to find the relevant parts.


What PDF.js is showing is an invisible text layer overlaid on top of the original image. It does not do OCR which can take up to 1-2 seconds per page, it would be too slow and require a large-ish neural net if you care about accuracy.


No, the OCR has already happened at the time of the scanning (or shortly thereafter), and the result is embedded into the PDF document.


True, but as parent, I realized that only very recently.


Guy's login is "the-dude," he's been around for a while.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: