Hacker News new | past | comments | ask | show | jobs | submit login

Why do you use OCR and not PDF to text conversion?



Probably because the pdf is just a big image file? If I understand correctly. Otherwise it should be just copy paste from pdf.


Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.


> The PDF files are not scans. They are PDF files created from a Word file.

I am unsure as to why he can't just copy / paste.


Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: