Hacker News new | past | comments | ask | show | jobs | submit login

> Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

I have used Tesseract for OCRing scanned books and it was great. I had no idea it was so old, nor that it had been through so many maintainers. To all of them past and present, thank you.




I have permission to publish an ebook edition of an out of print history of Portland, Oregon. I haven’t found the time to work on the project.

One point of friction has been selecting an OCR workflow. Any chance you would share what you’ve been successful with?


I built a simple pipeline with bash and python. Did it for free for learning, but it has been deployed and used in a professional setting on daily basis for almost a year now. (Use case: fax with headers and tabular data).

Most of time was spent in field parsing and validating ocr output (is it valid date). At one point I realized that playing with tess config was giving marginal improvement, and investment in post-ocr parsing/wrangling was more valuable e.g. in date column, if ocr says b, consider it 6 and flag low confidence record.

One new nice-to-have use case customer asked was varying orientation of pages, that I couldn't hack together quickly.


I use a £15 arm with a vice grip for my phone from Amazon, copy the files to my laptop and then run a bash for-loop of the tesseract CLI over the resultant files.

I use https://github.com/4lex4/scantailor-advanced to deskew the images and generate the PDF.

It isn't perfect but my purposes are more around research than publication, so, YMMV!


Thanks for this and the other replies!


My company uses it on documents with typed and hand written text successfully.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: