Hacker News new | past | comments | ask | show | jobs | submit login

For 1 - this is spot on. There are tools to dump PDF text, but they are quite flawed because you don't know how the text is laid out. In many cases the text can come out quite jumbled, for example if there are columns of text. Therefore, for OCR it's better to convert the PDF to an image and let OCR handle it. Google's OCR for example will understand and output columns properly.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: