Hacker News new | past | comments | ask | show | jobs | submit login

it dosnt make sense to use ocr for this. libraries such as aspose will do much better



Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.

So, I do have to use OCR, right?


Maybe not. The PDF probably has an embedded text (so it doesn't blur when zomming in) but it could be either cinverted into vector curves or protected from copying (see properties). The easiest way is to change the PDF export settings in Word/Ghostscript/Distiller.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: