Why do you use OCR and not PDF to text conversion?

angry-hacker · on Oct 12, 2016

Probably because the pdf is just a big image file? If I understand correctly. Otherwise it should be just copy paste from pdf.

iplaw · on Oct 13, 2016

Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.

pitaj · on Oct 13, 2016

> The PDF files are not scans. They are PDF files created from a Word file.

I am unsure as to why he can't just copy / paste.

iplaw · on Oct 13, 2016

Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.