> Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.
I have used Tesseract for OCRing scanned books and it was great. I had no idea it was so old, nor that it had been through so many maintainers. To all of them past and present, thank you.
I built a simple pipeline with bash and python. Did it for free for learning, but it has been deployed and used in a professional setting on daily basis for almost a year now. (Use case: fax with headers and tabular data).
Most of time was spent in field parsing and validating ocr output (is it valid date). At one point I realized that playing with tess config was giving marginal improvement, and investment in post-ocr parsing/wrangling was more valuable e.g. in date column, if ocr says b, consider it 6 and flag low confidence record.
One new nice-to-have use case customer asked was varying orientation of pages, that I couldn't hack together quickly.
I use a £15 arm with a vice grip for my phone from Amazon, copy the files to my laptop and then run a bash for-loop of the tesseract CLI over the resultant files.
I have used Tesseract for OCRing scanned books and it was great. I had no idea it was so old, nor that it had been through so many maintainers. To all of them past and present, thank you.