Tesseract OCR

sandreas · on July 19, 2021

If you are trying to detect text from document images / photos: tesseract is strongly focused mainly on the OCR part of the whole preprocessing(1). If you would like to get better results on that, you could use the wolf binarization tool (2) as an easy adaptive thresholding to remove shadows and uneven areas, which should improve your OCR results a lot on document photos!

(1): https://towardsdatascience.com/pre-processing-in-ocr-fc231c6... (2): https://github.com/chriswolfvision/local_adaptive_binarizati...

oriolid · on July 19, 2021

Ah, the nostalgia. More than 10 years ago I wrote a local adaptive binarization tool to improve Tesseract's results, and the upstream still hasn't picked up the idea.

woko · on July 19, 2021

It is mentioned in the README that there is a Gimp plugin for Christian Wolf's binarization. This could be handy at times.

Otherwise, the C++ code on Github requires converting images to PGM format.

---

The page is in French, so I will mention that the Python script is here: https://www.vvpix.com/gmp_Telecharger_script.php?sFichier_a_...

To install it, copy the file to: `C:\Program Files\GIMP 2\lib\gimp\2.0\plug-ins`

Then call the script via the `Python-fu > Color > Binarize` tab in Gimp.

The algorithm is quite slow for large images. Aim for 1440p at most.

That being said, the results on my quick experiments look great, so it saves my time compared to other more manual methods in the end!

app4soft · on July 19, 2021

Many years ago there was specific OCR software (SpotlightPro / RasterDesk) for vectorize scanned technical drawings to CAD formats (with dimension lines, text labels, etc.); now there is Scan2CAD[0]. All them are proprietary software.

Sadly I can't find any open-source vectorizer & OCR for repair scanned technical drawings, and Tesseract has a lot of issues with rotated text labels specific to CAD.[1]

[0] https://alternativeto.net/software/scan2cad/

[1] https://groups.google.com/g/tesseract-ocr/c/t-2Ru9h4xSc

knuthsat · on July 19, 2021

I know that tesseract uses leptonica which does have capabilities of binarization and thresholding. Interesting that it's not enough.

gary_0 · on July 19, 2021

Leptonica appears to use Sauvola binarization instead of the improved Wolf version.

sandreas · on July 19, 2021

Image processing strongly depends on what image you wanna use. To find an "auto" approach, that works for every image is nearly impossible...

I once wrote a bookscanner app in Java (https://boofcv.org), where everything was done automatically (preprocessing, object detection / book extraction, skin detection / finger removal, deskewing, line-slope-correction and so on). It was very difficult to adjust the parameters, that at least most of the books looked good.

simonw · on July 19, 2021

I finally figured out how to use the Tesseract CLI utility on macOS today (installed from Homebrew). It's really neat - you can use it to turn a PNG into a PDF with embedded text, which you can then copy-and-paste: https://til.simonwillison.net/tesseract/tesseract-cli

I learned about it from this post: https://alexn.org/blog/2020/11/11/organize-index-screenshots...

Abishek_Muthian · on July 19, 2021

That's a nifty use case! I use it to organize screenshots as well.

I read HN on my kindle[1] to assimilate knowledge from the comments using its highlighting, clipping features.

But commenting is painful on Kindle's 'forever experimental browser' so I take a screenshot and when I connect it to the computer the to-comment stack on a network to-do list is updated with a ready to visit HN story URL using query from the text on the screenshot parsed using Tesseract.

[1] https://hntokindle.com/ (Disclaimer: I built this)

m-p-3 · on July 19, 2021

I used it on Android through Termux, it was quite handy to convert a large amount of image-based files to text in a pinch.

wlesieutre · on July 19, 2021

If you're just looking to copy/paste text from images, there are a couple of small utilities for this using the built-in Vision framework.

OwlOCR is the one I use, with shortcuts set up next to the cmd-shift-3/4/5 screenshot keys.

TextSniper is another one, and I think there was a third that I can't remember the name of.

ramraj07 · on July 19, 2021

Does it work well on photographs ? I’d love to run it on my photo library so I can search for shop names etc!

simonw · on July 19, 2021

I just tried it on a photo of a fish counter at a supermarket with some text labels on some of the fish and it did very well (printed text, in focus) - so yeah this may well be worth trying!

eastendguy · on July 19, 2021

Tesseract is not really good for text on pictures (non-white background). You can use the free Space OCR API at https://ocr.space instead.

Or, just upload your photographs to Google Photos. Google OCRs all images automatically(!) and you can search them for text in the images. This includes text e. g. on posters in the background.

alexcnwy · on July 19, 2021

Tesseract is mainly for documents and generally doesn’t work well on photos but you can try EasyOCR for photos.

jimmySixDOF · on July 19, 2021

For Android there is a old project from Mozilla that just works like a firecracker on whatever size screenshot directory you have :

Firefox ScreenshotGo[beta] https://mzl.la/2NMgD30

piceas · on July 19, 2021

It uses Google firebase on device ML OCR (appears to be rebranded as ML Kit).

https://github.com/mozilla-tw/ScreenshotGo

markmaglana · on July 19, 2021

This is the OCR engine used by Mayan EDMS[1] which I've used since 2018. The reliability has been topnotch.

[1] https://www.mayan-edms.com/

garblegarble · on July 19, 2021

>Does it work well on photographs

Usefully, the new macOS / iOS releases will do this automatically (although for macOS you'll need to be running Apple Silicon)

protomyth · on July 19, 2021

macports also had Tesseract CLI available with a lot of language packs.

hkt · on July 18, 2021

> Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

I have used Tesseract for OCRing scanned books and it was great. I had no idea it was so old, nor that it had been through so many maintainers. To all of them past and present, thank you.

bredren · on July 19, 2021

I have permission to publish an ebook edition of an out of print history of Portland, Oregon. I haven’t found the time to work on the project.

One point of friction has been selecting an OCR workflow. Any chance you would share what you’ve been successful with?

petespeed · on July 19, 2021

I built a simple pipeline with bash and python. Did it for free for learning, but it has been deployed and used in a professional setting on daily basis for almost a year now. (Use case: fax with headers and tabular data).

Most of time was spent in field parsing and validating ocr output (is it valid date). At one point I realized that playing with tess config was giving marginal improvement, and investment in post-ocr parsing/wrangling was more valuable e.g. in date column, if ocr says b, consider it 6 and flag low confidence record.

One new nice-to-have use case customer asked was varying orientation of pages, that I couldn't hack together quickly.

hkt · on July 19, 2021

I use a £15 arm with a vice grip for my phone from Amazon, copy the files to my laptop and then run a bash for-loop of the tesseract CLI over the resultant files.

I use https://github.com/4lex4/scantailor-advanced to deskew the images and generate the PDF.

It isn't perfect but my purposes are more around research than publication, so, YMMV!

bredren · on July 19, 2021

Thanks for this and the other replies!

bostonsre · on July 19, 2021

My company uses it on documents with typed and hand written text successfully.

jonatron · on July 19, 2021

Two alternatives, which are designed for OCR from photos: https://github.com/PaddlePaddle/PaddleOCR/ https://github.com/JaidedAI/EasyOCR/ It's worth trying them if Tesseract isn't giving you good accuracy.

atonse · on July 19, 2021

I must’ve done things spectacularly wrong cuz the two times I’ve tried tesseract (second time to recognize factory printed 8 inch tall letters on a trash can), I got 0% accuracy. 0%. Not even close. So I gave up.

dunham · on July 19, 2021

Yeah tesseract is more like the bit that would sit in the middle of an ocr solution than a complete solution. But it’s all we’ve got for free at the moment.

You pretty much need black text on white background at 300-600 dpi. (Not sure the exact size but I’ve had crappy scans do better by scaling the file.)

I’ve had reasonable success with photos of printed pages run through text cleaner.

gattilorenz · on July 19, 2021

> But it’s all we’ve got for free at the moment.

Doesn't OCRopus qualify as well (it does look unmaintaned, or less actively maintained than Tesseract)?

david_allison · on July 19, 2021

You typically need to pre-process the images.

I'd recommend https://scantailor.org/ for this (OSS, but unmaintained)

buovjaga · on July 19, 2021

This has the latest developments, but is also seemingly unmaintained for over a year: https://github.com/4lex4/scantailor-advanced

Scan Tailor forum: https://forum.diybookscanner.org/viewforum.php?f=21

app4soft · on July 19, 2021

> [ScanTailor Advanced] seemingly unmaintained for over a year

ScanTailor official repo also archived on November 29, 2020.[0]

> This project is no longer maintained, and has not been maintained for a while.

As alternative to ScanTailor (and its forks) there is gImageReader (Tesseract Qt/GTK GUI), but also seems like unmaintained since 2019.[1,2]

[0] https://github.com/scantailor/scantailor/commit/e881b30b6ed1...

[1] https://github.com/manisandro/gImageReader

[2] https://github.com/probonopd/gImageReader/releases/tag/conti...

an_opabinia · on July 19, 2021

Most people are best served by the big vendor OCRs. In my experience Amazon’s works the best, followed closely by Microsoft and Google at a distant third.

connorproctor · on July 19, 2021

When did you do this comparison? A couple years ago I did a comparison and found Google the best and Amazon to be not very good.

Agree it’s best to skip tesseract unless the free cost is important. We spent a lot of time trying to preprocess and tune tesseract before realizing cloud OCR solutions are much better and fairly cheap.

fridif · on July 19, 2021

I have been tasked with developing a Textract tool at work and so far I have been impressed with the complete accuracy of it for non-handwritten, non-photocopied documents.

I haven't seen it make any mistakes at all and responses take less than 3 seconds usually.

brigandish · on July 19, 2021

I tried it a couple of years ago on some Japanese receipts and it couldn't handle the mix of Japanese and English words/characters. Perhaps it was the way I set things up but the result was that of failure.

jfoster · on July 19, 2021

It seems the project has made a trade-off about language support.

One approach would be to say language doesn't matter, just train on converting any character from any language alphabet from image to text. The problem is that higher accuracy can be achieved by isolating characters from each language from each other. I imagine that particularly for Latin alphabet languages, accuracy must improve dramatically by splitting out any kanji or hanzi.

PeterisP · on July 19, 2021

Even directly for Latin alphabet languages you would typically want to know the language, both to reduce the charset (i.e. the distinctions between smudgy ã or â or ā or ä or á often are trivial if you know the language but may be quite hard to tell otherwise) and to use a proper language model for disambiguating individual character guesses.

vb6sp6 · on July 19, 2021

I also failed at using this. I think it needs training data and the default set leaves a lot to be desired. There are also hundreds of options which makes it difficult to wrap your head around. I was reading from screenshots, black English text on white background with a TrueType font. Never worked.

I hacked something together with "Capture2Text". Basically taking the screenshot, saving to jpg, shelling to the exe and getting the text back. Works pretty good.

iotku · on July 19, 2021

I've used tesseract directly and there definitely is some footguns when it comes to PDFs and being sure not to re-compress them and lose quality.

If you're looking to add a text layer to a PDF (for search purposes for instance) I can highly recommend OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

It uses Tesseract and works quite well for most PDFs, I made a semi-functional script before I discovered it and it would have saved a lot of hassle.

ping_pong · on July 19, 2021

I used Tesseract almost 10 years ago to scan letters from a Words With Friends board. I was getting over 90% accuracy, but the letters with score values on them corrupted the letters and screwed up the detection. So I created a new "language" which Tesseract supports, that incorporated the score value corruption as part of the OCR translation. I got to over 98% accuracy with that which was about as good as I could get.

Overall I thought it was great and I wonder how good it would perform these days with 10 years of improvements!

pacman2 · on July 19, 2021

This packages the tesseract very nicely:

https://kebekus.gitlab.io/scantools/

Make SURE to select the correct OCR language

sram1337 · on July 19, 2021

Been using this for 4 years to read and index meme text on my meme-aggregator/search web app :)

https://memebay.io

phenkdo · on July 19, 2021

I have found EasyOCR to be more accurate in OCR than tesseract. Curious to learn about HN-ers experience with those tools.

slowhand09 · on July 19, 2021

I like this thread. I'm hoping to put together a tool to help catalog info from a specific class of vintage automobiles. The significant info is typically contained on a tag like this

https://www.forbbodiesonly.com/moparforum/threads/fender-tag...

I'd like to capture each item that is delimited by whitespace, convert to text, and store its position and line in a database.

The same code may appear more than once with different meaning, so position is important.

The tags are often different colors as well.

Anyone know which technology may be best or simplest to implement?

This is for a historical search function.

MayeulC · on July 19, 2021

I've been looking for something similar with hadwriting recognition in mind (and maybe math formulas). I got a few leads (unfortunately I don't have the list at hand right now).

Any open-source solution you'd recommend for handwriting recognition?

maple3142 · on July 19, 2021

Is there a good OCR is good for some weird data (not natural language), such as a screenshot of a base64 encoded file. There was a CTF challenge requiring you to recover full RSA private key from a partially redacted screenshot of a RSA private (pem). I tried serveral OCR tools to try to extract the remaining base64 characters. I did tried to use Tesseract too, but the result is quite bad. I used GCP's OCR service eventually, the result is almost perfect for this task, but I wonder is there some non cloud-based tools are good for this task?

fuoqi · on July 19, 2021

I have used Tesseract for OCRing Japanese vertical texts and while it does work fine most of the time (not good mind you, since it constantly mixes certain complex kanji, lack of advanced context awareness shows) after a bit of pre-processing (removing noise using threshold filtering, making characters darker, removing furigana), sometimes it simply breaks and produces a clear garbage. And page segmentation is not the only problem. I've wrote a custom algorithmic page segmentation (text from a page gets concatenated into a single "line") and Tesseract still breaks on certain inputs, removing several characters from the beginning of such text usually "fixes" this issue.

abarrak · on July 19, 2021

Eagerly waiting for the alpha version 5! It's been 2 years since the stable 4.1 release.

student2k · on July 19, 2021

Wish there was an up to date, complete guide for training fonts. I use it for digits and single digits, and it is not 100% accurate even if the image is best quality and I preprocess it well,

carschno · on July 19, 2021

I used Tesseract for some projects a few years ago (version 2.x). It was already state-of-the-art in the specific task it is designed for, more or less on par with the best proprietary solutions. As others pointed out, Tesseract is only a specific part in a general OCR solution.

I'm wondering how actively it is being developed. I see that the last release is from 2019. I also see, however, that there have been some version 5.x alpha releases published this year. Does anyone know what is happening inside the project?

_fnqu · on July 18, 2021

I’ve seen Tesseract pop up on almost all archive.org files. Never got around to look into it further, thanks for the link!

shgidi · on July 19, 2021

Tesseract is a nice tool, bur is a bit behind in adoption of cutting edge technologies. See here an up to date comparison - https://towardsdatascience.com/ocr-101-all-you-need-to-know-...

opsecweather · on July 18, 2021

I tried to use this library to process nutrition labels for a fitness app. Sometimes took 40 seconds to process, which is unacceptable for a phone app. I remember seeing a video where a google product used a neural net and resolved the same info in 1 sec or so.

simonw · on July 19, 2021

How long ago did you try it?

"Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns."

opsecweather · on July 19, 2021

I used the wasm implementation ( https://tesseract.projectnaptha.com/ ) and scanned 1 cereal box label.

petespeed · on July 19, 2021

You can also compare total time between two options: 1. Do it on phone 2. Upload image to a powerful server and get results faster (more accuracy controls available too)

bostonsre · on July 19, 2021

How much info are you talking about? A hundred or so words or thousands?

jccalhoun · on July 19, 2021

I didn't know that google had stopped maintaining it. Interesting.

captainoats · on July 19, 2021

Wonder if their Google Vision OCR offering lead to that

tokamak-teapot · on July 19, 2021

I use this from a small script to OCR the text from screenshots and add it to the metadata, so that I can search for them in Finder.

I believe Apple are adding this feature to the OS soon.

student2k · on July 19, 2021

Make sure you have black text on white background, it turns me crazy after upgrading from 3.0 to 5.0,

karxxm · on July 19, 2021

I upvote for the comments, not tesseract