> Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and...

bredren · on July 19, 2021

I have permission to publish an ebook edition of an out of print history of Portland, Oregon. I haven’t found the time to work on the project.

One point of friction has been selecting an OCR workflow. Any chance you would share what you’ve been successful with?

petespeed · on July 19, 2021

I built a simple pipeline with bash and python. Did it for free for learning, but it has been deployed and used in a professional setting on daily basis for almost a year now. (Use case: fax with headers and tabular data).

Most of time was spent in field parsing and validating ocr output (is it valid date). At one point I realized that playing with tess config was giving marginal improvement, and investment in post-ocr parsing/wrangling was more valuable e.g. in date column, if ocr says b, consider it 6 and flag low confidence record.

One new nice-to-have use case customer asked was varying orientation of pages, that I couldn't hack together quickly.

hkt · on July 19, 2021

I use a £15 arm with a vice grip for my phone from Amazon, copy the files to my laptop and then run a bash for-loop of the tesseract CLI over the resultant files.

I use https://github.com/4lex4/scantailor-advanced to deskew the images and generate the PDF.

It isn't perfect but my purposes are more around research than publication, so, YMMV!

bredren · on July 19, 2021

Thanks for this and the other replies!

bostonsre · on July 19, 2021

My company uses it on documents with typed and hand written text successfully.