Hacker News new | past | comments | ask | show | jobs | submit login
Tesseract OCR (github.com/tesseract-ocr)
223 points by tosh on July 18, 2021 | hide | past | favorite | 64 comments



If you are trying to detect text from document images / photos: tesseract is strongly focused mainly on the OCR part of the whole preprocessing(1). If you would like to get better results on that, you could use the wolf binarization tool (2) as an easy adaptive thresholding to remove shadows and uneven areas, which should improve your OCR results a lot on document photos!

(1): https://towardsdatascience.com/pre-processing-in-ocr-fc231c6... (2): https://github.com/chriswolfvision/local_adaptive_binarizati...


Ah, the nostalgia. More than 10 years ago I wrote a local adaptive binarization tool to improve Tesseract's results, and the upstream still hasn't picked up the idea.


It is mentioned in the README that there is a Gimp plugin for Christian Wolf's binarization. This could be handy at times.

Otherwise, the C++ code on Github requires converting images to PGM format.

---

The page is in French, so I will mention that the Python script is here: https://www.vvpix.com/gmp_Telecharger_script.php?sFichier_a_...

To install it, copy the file to: `C:\Program Files\GIMP 2\lib\gimp\2.0\plug-ins`

Then call the script via the `Python-fu > Color > Binarize` tab in Gimp.

The algorithm is quite slow for large images. Aim for 1440p at most.

That being said, the results on my quick experiments look great, so it saves my time compared to other more manual methods in the end!


Many years ago there was specific OCR software (SpotlightPro / RasterDesk) for vectorize scanned technical drawings to CAD formats (with dimension lines, text labels, etc.); now there is Scan2CAD[0]. All them are proprietary software.

Sadly I can't find any open-source vectorizer & OCR for repair scanned technical drawings, and Tesseract has a lot of issues with rotated text labels specific to CAD.[1]

[0] https://alternativeto.net/software/scan2cad/

[1] https://groups.google.com/g/tesseract-ocr/c/t-2Ru9h4xSc


I know that tesseract uses leptonica which does have capabilities of binarization and thresholding. Interesting that it's not enough.


Leptonica appears to use Sauvola binarization instead of the improved Wolf version.


Image processing strongly depends on what image you wanna use. To find an "auto" approach, that works for every image is nearly impossible...

I once wrote a bookscanner app in Java (https://boofcv.org), where everything was done automatically (preprocessing, object detection / book extraction, skin detection / finger removal, deskewing, line-slope-correction and so on). It was very difficult to adjust the parameters, that at least most of the books looked good.


I finally figured out how to use the Tesseract CLI utility on macOS today (installed from Homebrew). It's really neat - you can use it to turn a PNG into a PDF with embedded text, which you can then copy-and-paste: https://til.simonwillison.net/tesseract/tesseract-cli

I learned about it from this post: https://alexn.org/blog/2020/11/11/organize-index-screenshots...


That's a nifty use case! I use it to organize screenshots as well.

I read HN on my kindle[1] to assimilate knowledge from the comments using its highlighting, clipping features.

But commenting is painful on Kindle's 'forever experimental browser' so I take a screenshot and when I connect it to the computer the to-comment stack on a network to-do list is updated with a ready to visit HN story URL using query from the text on the screenshot parsed using Tesseract.

[1] https://hntokindle.com/ (Disclaimer: I built this)


I used it on Android through Termux, it was quite handy to convert a large amount of image-based files to text in a pinch.


If you're just looking to copy/paste text from images, there are a couple of small utilities for this using the built-in Vision framework.

OwlOCR is the one I use, with shortcuts set up next to the cmd-shift-3/4/5 screenshot keys.

TextSniper is another one, and I think there was a third that I can't remember the name of.


Does it work well on photographs ? I’d love to run it on my photo library so I can search for shop names etc!


I just tried it on a photo of a fish counter at a supermarket with some text labels on some of the fish and it did very well (printed text, in focus) - so yeah this may well be worth trying!


Tesseract is not really good for text on pictures (non-white background). You can use the free Space OCR API at https://ocr.space instead.

Or, just upload your photographs to Google Photos. Google OCRs all images automatically(!) and you can search them for text in the images. This includes text e. g. on posters in the background.


Tesseract is mainly for documents and generally doesn’t work well on photos but you can try EasyOCR for photos.


For Android there is a old project from Mozilla that just works like a firecracker on whatever size screenshot directory you have :

Firefox ScreenshotGo[beta] https://mzl.la/2NMgD30


It uses Google firebase on device ML OCR (appears to be rebranded as ML Kit).

https://github.com/mozilla-tw/ScreenshotGo


This is the OCR engine used by Mayan EDMS[1] which I've used since 2018. The reliability has been topnotch.

[1] https://www.mayan-edms.com/


>Does it work well on photographs

Usefully, the new macOS / iOS releases will do this automatically (although for macOS you'll need to be running Apple Silicon)


macports also had Tesseract CLI available with a lot of language packs.


> Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

I have used Tesseract for OCRing scanned books and it was great. I had no idea it was so old, nor that it had been through so many maintainers. To all of them past and present, thank you.


I have permission to publish an ebook edition of an out of print history of Portland, Oregon. I haven’t found the time to work on the project.

One point of friction has been selecting an OCR workflow. Any chance you would share what you’ve been successful with?


I built a simple pipeline with bash and python. Did it for free for learning, but it has been deployed and used in a professional setting on daily basis for almost a year now. (Use case: fax with headers and tabular data).

Most of time was spent in field parsing and validating ocr output (is it valid date). At one point I realized that playing with tess config was giving marginal improvement, and investment in post-ocr parsing/wrangling was more valuable e.g. in date column, if ocr says b, consider it 6 and flag low confidence record.

One new nice-to-have use case customer asked was varying orientation of pages, that I couldn't hack together quickly.


I use a £15 arm with a vice grip for my phone from Amazon, copy the files to my laptop and then run a bash for-loop of the tesseract CLI over the resultant files.

I use https://github.com/4lex4/scantailor-advanced to deskew the images and generate the PDF.

It isn't perfect but my purposes are more around research than publication, so, YMMV!


Thanks for this and the other replies!


My company uses it on documents with typed and hand written text successfully.


Two alternatives, which are designed for OCR from photos: https://github.com/PaddlePaddle/PaddleOCR/ https://github.com/JaidedAI/EasyOCR/ It's worth trying them if Tesseract isn't giving you good accuracy.


I must’ve done things spectacularly wrong cuz the two times I’ve tried tesseract (second time to recognize factory printed 8 inch tall letters on a trash can), I got 0% accuracy. 0%. Not even close. So I gave up.


Yeah tesseract is more like the bit that would sit in the middle of an ocr solution than a complete solution. But it’s all we’ve got for free at the moment.

You pretty much need black text on white background at 300-600 dpi. (Not sure the exact size but I’ve had crappy scans do better by scaling the file.)

I’ve had reasonable success with photos of printed pages run through text cleaner.


> But it’s all we’ve got for free at the moment.

Doesn't OCRopus qualify as well (it does look unmaintaned, or less actively maintained than Tesseract)?


You typically need to pre-process the images.

I'd recommend https://scantailor.org/ for this (OSS, but unmaintained)


This has the latest developments, but is also seemingly unmaintained for over a year: https://github.com/4lex4/scantailor-advanced

Scan Tailor forum: https://forum.diybookscanner.org/viewforum.php?f=21


> [ScanTailor Advanced] seemingly unmaintained for over a year

ScanTailor official repo also archived on November 29, 2020.[0]

> This project is no longer maintained, and has not been maintained for a while.

As alternative to ScanTailor (and its forks) there is gImageReader (Tesseract Qt/GTK GUI), but also seems like unmaintained since 2019.[1,2]

[0] https://github.com/scantailor/scantailor/commit/e881b30b6ed1...

[1] https://github.com/manisandro/gImageReader

[2] https://github.com/probonopd/gImageReader/releases/tag/conti...


Most people are best served by the big vendor OCRs. In my experience Amazon’s works the best, followed closely by Microsoft and Google at a distant third.


When did you do this comparison? A couple years ago I did a comparison and found Google the best and Amazon to be not very good.

Agree it’s best to skip tesseract unless the free cost is important. We spent a lot of time trying to preprocess and tune tesseract before realizing cloud OCR solutions are much better and fairly cheap.


I have been tasked with developing a Textract tool at work and so far I have been impressed with the complete accuracy of it for non-handwritten, non-photocopied documents.

I haven't seen it make any mistakes at all and responses take less than 3 seconds usually.


I tried it a couple of years ago on some Japanese receipts and it couldn't handle the mix of Japanese and English words/characters. Perhaps it was the way I set things up but the result was that of failure.


It seems the project has made a trade-off about language support.

One approach would be to say language doesn't matter, just train on converting any character from any language alphabet from image to text. The problem is that higher accuracy can be achieved by isolating characters from each language from each other. I imagine that particularly for Latin alphabet languages, accuracy must improve dramatically by splitting out any kanji or hanzi.


Even directly for Latin alphabet languages you would typically want to know the language, both to reduce the charset (i.e. the distinctions between smudgy ã or â or ā or ä or á often are trivial if you know the language but may be quite hard to tell otherwise) and to use a proper language model for disambiguating individual character guesses.


I also failed at using this. I think it needs training data and the default set leaves a lot to be desired. There are also hundreds of options which makes it difficult to wrap your head around. I was reading from screenshots, black English text on white background with a TrueType font. Never worked.

I hacked something together with "Capture2Text". Basically taking the screenshot, saving to jpg, shelling to the exe and getting the text back. Works pretty good.


I've used tesseract directly and there definitely is some footguns when it comes to PDFs and being sure not to re-compress them and lose quality.

If you're looking to add a text layer to a PDF (for search purposes for instance) I can highly recommend OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

It uses Tesseract and works quite well for most PDFs, I made a semi-functional script before I discovered it and it would have saved a lot of hassle.


I used Tesseract almost 10 years ago to scan letters from a Words With Friends board. I was getting over 90% accuracy, but the letters with score values on them corrupted the letters and screwed up the detection. So I created a new "language" which Tesseract supports, that incorporated the score value corruption as part of the OCR translation. I got to over 98% accuracy with that which was about as good as I could get.

Overall I thought it was great and I wonder how good it would perform these days with 10 years of improvements!


This packages the tesseract very nicely:

https://kebekus.gitlab.io/scantools/

Make SURE to select the correct OCR language


Been using this for 4 years to read and index meme text on my meme-aggregator/search web app :)

https://memebay.io


I have found EasyOCR to be more accurate in OCR than tesseract. Curious to learn about HN-ers experience with those tools.


I like this thread. I'm hoping to put together a tool to help catalog info from a specific class of vintage automobiles. The significant info is typically contained on a tag like this

https://www.forbbodiesonly.com/moparforum/threads/fender-tag...

I'd like to capture each item that is delimited by whitespace, convert to text, and store its position and line in a database.

The same code may appear more than once with different meaning, so position is important.

The tags are often different colors as well.

Anyone know which technology may be best or simplest to implement?

This is for a historical search function.


I've been looking for something similar with hadwriting recognition in mind (and maybe math formulas). I got a few leads (unfortunately I don't have the list at hand right now).

Any open-source solution you'd recommend for handwriting recognition?


Is there a good OCR is good for some weird data (not natural language), such as a screenshot of a base64 encoded file. There was a CTF challenge requiring you to recover full RSA private key from a partially redacted screenshot of a RSA private (pem). I tried serveral OCR tools to try to extract the remaining base64 characters. I did tried to use Tesseract too, but the result is quite bad. I used GCP's OCR service eventually, the result is almost perfect for this task, but I wonder is there some non cloud-based tools are good for this task?


I have used Tesseract for OCRing Japanese vertical texts and while it does work fine most of the time (not good mind you, since it constantly mixes certain complex kanji, lack of advanced context awareness shows) after a bit of pre-processing (removing noise using threshold filtering, making characters darker, removing furigana), sometimes it simply breaks and produces a clear garbage. And page segmentation is not the only problem. I've wrote a custom algorithmic page segmentation (text from a page gets concatenated into a single "line") and Tesseract still breaks on certain inputs, removing several characters from the beginning of such text usually "fixes" this issue.


Eagerly waiting for the alpha version 5! It's been 2 years since the stable 4.1 release.


Wish there was an up to date, complete guide for training fonts. I use it for digits and single digits, and it is not 100% accurate even if the image is best quality and I preprocess it well,


I used Tesseract for some projects a few years ago (version 2.x). It was already state-of-the-art in the specific task it is designed for, more or less on par with the best proprietary solutions. As others pointed out, Tesseract is only a specific part in a general OCR solution.

I'm wondering how actively it is being developed. I see that the last release is from 2019. I also see, however, that there have been some version 5.x alpha releases published this year. Does anyone know what is happening inside the project?


I’ve seen Tesseract pop up on almost all archive.org files. Never got around to look into it further, thanks for the link!


Tesseract is a nice tool, bur is a bit behind in adoption of cutting edge technologies. See here an up to date comparison - https://towardsdatascience.com/ocr-101-all-you-need-to-know-...


I tried to use this library to process nutrition labels for a fitness app. Sometimes took 40 seconds to process, which is unacceptable for a phone app. I remember seeing a video where a google product used a neural net and resolved the same info in 1 sec or so.


How long ago did you try it?

"Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns."


I used the wasm implementation ( https://tesseract.projectnaptha.com/ ) and scanned 1 cereal box label.


You can also compare total time between two options: 1. Do it on phone 2. Upload image to a powerful server and get results faster (more accuracy controls available too)


How much info are you talking about? A hundred or so words or thousands?


I didn't know that google had stopped maintaining it. Interesting.


Wonder if their Google Vision OCR offering lead to that


I use this from a small script to OCR the text from screenshots and add it to the metadata, so that I can search for them in Finder.

I believe Apple are adding this feature to the OS soon.


Make sure you have black text on white background, it turns me crazy after upgrading from 3.0 to 5.0,


I upvote for the comments, not tesseract




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: