From what I can tell (without having read the research papers) it looks like this is just an easy to use package for sparse scene text extraction. It seems to do okay if the scene has sparse text but it
falls down for dense text detection. The results are going to be pretty bad if you try and do a task like "extract transactions from a picture of a receipt." Here's an example of input you might get for a production app: https://www.clusin.com/walmart-receipt.jpg
Notice the faded text from the printer running out of ink and the slanted text. From limited experience each of these are thorny problems and the state of the art CV algorithms won't help you escape from having to learn how to algorithmicly pre-process images and clean them up prior to feeding them into a CV algorithm. You might be able to use Google's Cloud OCR but that charges per image, although it is pretty good. Even if you use that you've graduated to the next super difficult problem which is Natural Language Processing.
Once you have the text you need to determine if it has meaning to your application. That's basically what NLP is about. For the receipts example, how do you know you're looking at a receipt? What if its a receipt on top of a pile of other receipts? How do you extract transactions from the receipt? Does a transaction span multiple lines? How can you tell? etc etc etc.
I'm just happy to see some advancement in open source OCR for Python. Last time I had a Python project that needed OCR, I found that the open-source options were surprisingly limited, and it required some effort to achieve consistently good results even with relatively clean inputs.
Honestly I was kind of surprised that good basic OCR isn't a totally solved issue with an ecosystem of fully open-source solutions by now.
It doesn't need to be Python. Tesseract is what I ended up using, IIRC. But I was looking for a turnkey package that would work from beginning to end. I wasn't doing anything unusual, and my app wasn't OCR-focused. I just wanted easy drop-in OCR for documents.
Tesseract is more like getting a pretty good motor for free (recognizing text), but it's up to you to build the rest of the car around it (preprocessing images, handling errors, dealing with the output, potentially training it to your task, and various other issues).
Wow you weren't kidding. Went through the docs and the number of preprocess steps they demand is outrageous. Is there seriously no solution taking care of the preprocess steps??
> Honestly I was kind of surprised that good basic OCR isn't a totally solved issue with an ecosystem of fully open-source solutions by now.
Yes! Can anyone comment on why this is the case, since OCR is proclaimed to be a solved problem?
I've always wondered why Google Lens works "out of the box" and shows great accuracy on extracting text from images taken using a phone camera, but open-source OCR software (Tesseract, Ocropy etc.) needs a lot of tweaking to extract text from standard documents with standard fonts, even after heavily pre-processing the images.
I was building an image search engine[0] a while back and faced the same issues you mentioned with OCR. What i realized is tesseract[1](one of the more popular ocr framework) works so long as you are able to provide it data similar to the one it was trained on.
We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.
Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.
TLDR;Some nifty image processing is required to make tesseract perform as expected.
Yeah! And Lens is not the only closed-source OCR solution that works. I've gotten great accuracy using ABBYY and docparser.com in the past. But one needs to pay per page after the free trial ends :(
I’ve found that none of the open source stuff works well for Japanese language documents. Most of the time, I’ve just ran them through Adobe Acrobat’s OCR and dumped the results into a text file. There are still mistakes, but it at least returns a passable result compared to others.
From my experience the algorithms & implementations seem to be pretty good but the caveat is that you the developer need to be aware of all the different approaches and when it is appropriate to apply them. There just doesn't seem to be a good general purpose library that stitches them all together and knows when to use which approach based analyzing the image.
I've found that often for tools related to natural language (ORC, text-to-speech, and speech-to-text) it feels like you need a PhD in the subject just to figure out how to anything done at all. I heartily welcome efforts to package these sort of things up in ready-to-use ways.
I'm surprised, too. After all, if you can train an AI to recognize a cat, why can't it be trained to recognize a letter?
Mine, for example, works well on clean laser-printed text. It fails on anything written with a typewriter, though. (My definition of "failure" is it's quicker to retype it from scratch than fix the OCR's errors.)
I'd also love to have one that worked on cursive handwriting.
Try reading the post. There’s a lot more there but the gist is that this is optimized for a different set of OCR uses and not the more typical scan a book/receipt cases.
This is a fair point. I think my criticism more generally is that they position it as easy to use but its still just another library for a subset of OCR problems: sparse text extraction from a scene. As I said in a sibling post there doesn't seem to be a library that stitches together OCR approaches for all the different use cases and chooses an approach based on analyzing the image itself. That would be truly easy to use.
About a year ago I surveyed the available OCR packages for receipts. This was for pristine scans (not the crumpled scan you have in your image). In my survey all OCRs failed except google cloud OCR! If there is another OCR that works I would love to know.
I use TesseractOCR for general screenshot text extraction. Granted they're not receipts but Tesseract works well enough. What packages did you survey? Do you still have the data and code?
Yes, you're right. I tried with some scanned pages from a Vietnamese book but the result was very bad (say <5% accurate). The scans was pretty OK, though. Probably the model was not trained much for the Vietnamese language but I think it's more likely that it does not do the necessary per-processing steps.
I had very bad results on Vietnamese using Tesseract and their trained model. French output was mostly fine. I guess less attention is given to some language, and the huge number of diacritics used in Vietnamese make it harder to process too.
I've been very impressed with the OCR on an app called Fetch, which you use to scan your grocery receipts and get points you can use to redeem for gift cards. Even if I pull a receipt out of my pocket and it's wrinkly, it still seems to read it very well.
Can you get the data from them yourself, or is it purely for them?
I've just tried easyocr on a receipt, and it's pretty bad. I've also just noticed that ASDA have a "mojibake" problem and print ú instead of £ on the entire receipt ...
I haven't looked into it, I believe it's purely for them. It's sort of like a reverse-coupon app. You buy stuff, and get extra points for say, Lipton iced tea. That's supposed to encourage you to buy more of that stuff next time.
Looking at the Chinese example, it’s kinda funny it managed to output Traditional Chinese characters when the image contains Simplified Chinese; the SC and TC versions look pretty different (园 vs 園, 东 vs 東).
No, these are completely different, standalone code points, not variant forms of the same code point.
What's actually happening seems to be that the ch_tra model can recognize simplified too and output the corresponding traditional version if the character isn't in the traditional "alphabet"; it doesn't work so well in the other direction.
Example recognizing a partial screenshot of https://chinese.stackexchange.com/a/38707 (anyone can try this on Google Colab, no hardware required; remember to turn on GPU in Runtime -> Change runtime type):
import easyocr
import requests
zhs_reader = easyocr.Reader(['en', 'ch_sim'])
zht_reader = easyocr.Reader(['en', 'ch_tra'])
image = requests.get('https://i.imgur.com/HtrpZCZ.png').content
print('ch_sim:', ' '.join(text for _, text, _ in zhs_reader.readtext(image)))
print('ch_tra:', ' '.join(text for _, text, _ in zht_reader.readtext(image)))
Results:
ch_sim: One simplified character may mapping to multiple traditional ones: 皇后->皇后,後夭->后夭 豌鬟->头发,骏财->发财 As reversed, one traditional character may mapping to multiple simplified ones too: 乾燥->干燥, 乾隆->乾隆 嘹望->嘹望,嘹解->了解
ch_tra: One simplified character may mapping to multiple traditional ones: 皇后->皇后,後天->后天 頭髮->頭發,發財->發財 As reversed, one traditional character may mapping to multiple simplified ones too: 乾燥->干燥, 乾隆->乾隆 瞭望->瞭望, 瞭解->了解
Compare to the original text:
One simplified character may mapping to multiple traditional ones:
- 皇后 -> 皇后,後天 -> 后天
- 頭髮 -> 头发,發財 -> 发财
As reversed, one traditional character may mapping to multiple simplified ones too:
- 乾燥 -> 干燥,乾隆 -> 乾隆
- 瞭望 -> 瞭望,瞭解 -> 了解
Of course, automatic character-to-character conversion from simplified to traditional can be wrong due to ambiguities; excellent examples from above: 头发 => 頭發 (should be 頭髮), 了解 => 了解 (should be 瞭解).
This approach seems a bit weird to me. While I appreciate them separating the models of Traditional and Simplified Chinese, I think I might prefer them to be combined (perhaps even including Japanese Kanji), and instead provide a way for the user to specify which language or regional variant is expected so characters matching the expected variant are simply given a higher score.
Without delving into implementation details, I suspect the ch_tra model was simply trained on a dataset including simplified images with traditional labels.
What are people using in mobile development (native iOS/native Android/Cross-platform e.g. React-Native) when you want accurate extraction from a fixed format-source?
E.g. poor-quality images of ID cards or credit cards, where the position of data is known.
This is something that I find really interesting. Open-source OCR is lagging behing commercial applications and seeing someone trying ideas is always beneficial. Kudos!!
Has anyone made a desktop app with a really simple UI for detecting text in images? I'm thinking something that lives in the taskbar, lets you make a box around the text you want to read, and then returns it as plaintext?
In my job as a support engineer I sometimes get screenshots of complex technical configurations and end up having to type them in one character at a time, so this would be really handy.
Looks like maybe I could just create a wrapper around EasyOCR.
I put it in a custom keyboard shortcut, so I just press it, draw an onscreen rectangle around any non-selectable text, and in a few seconds it goes to the clipboard.
Tesseract can be very accurate (>99%), especially when you train it for your particular data set.
This does involve creating your own labeled data.
I got this 99% accuracy by performing incremental training using latest Manheim model as a base. I added about 20k lines which is not really that much. https://github.com/tesseract-ocr/tesseract/wiki
The hard part was crowd sourcing those 20k lines :)
Tesseract might not be best for photos as you said but I did not have major problems.
Of course some documents the source is so bad that a human can't achieve 99%.
Tesseract used to be quite average before they moved onto LTSM models a few years ago.
Care to share resources/lessons learned for training tesseract with custom data? I'm using it for a side project and would love to hear about your insights.
I did use another data source from Manheim but can't locate it right now.
Using vanilla Ubuntu 18.04
I looked at the example training files and made a small script to convert my own labeled data to fit the format that tesseract requires.
I did do a bit of pre-processing adjusting contrast.
All the data munging was done on Python (Pillow for image processing, Flask for collecting data into a simple SQLite DB before converting back to format that Tesseract requires).
Python was not necessary just something that felt most comfortable to me. I am sure someone could do it using bash scripts or node.js or anything else.
EDIT: To make life easier for my curators I did run Tesseract first to generate prelabeled data for my training set. It was about 90% accurate to start with.
So the process was:
Tesseract OCR on some documents to be trained -> hand curation (2 months)-> train (took about 12 hours) -> 99% (on completely separate test set)
This depends on the model you use, right? As far as I know, Tesseract supports a couple of models, and you could also use a more powerful neural network in there. And if you have trained it well, it should be fine.
AFAIK Tesseract is trained to recognize characters and uses a bunch of steps to prepare image for recognition. Steps like removing noise, fixing contrast and resizing.
It means that it performs not-so-good when for example image contains black text and white text on green background since this is not "normalized" through image preparation steps and it cannot detect white text on green background (but you can do it yourself)
The CPU fallback taking on the order of tens of seconds on my modest i5-5250U for a few images of street signs I've thrown at it. Good enough for my purposes at least.
I have Abby installed locally as well. I don't use it's cloud components. And I can set my own server exposing ABBYY's API's to rollout my own cloud server instead of theirs, if needs be.
I've recently become interested in OCR due to using Kaku on Android for trying to get better at reading Japanese. So thanks Hacker News for showing me a new version. I'd love any comments about other resources that may be good for learning. Especially because for funsies I'd like to try and develop my own.
Thanks! A dataset was one of the things I was dreading searching for/building. (Maybe this was super easily searchable and I'm just a goon, again I'm early stages of passing interest).
It doesn't work that bad on a few French examples I had lying around: It's doing quite well on scanned documents, even quite dense ones. Handwriting doesn't work well at all, even for simpler cases. It managed to recognize a few words from a blackboard picture, but that's hardly usable.
However, it looks like my simple example of an old "S note" export (like a lowish resolution phone screenshot) confused it a bit:
Reglementation -> Reglemantation
km -> kn
illimitée -> illiritée
limite -> liite
baptême -> bapteme
etc.
Overall, it works, and it is quite easy to install and use. I'd have to compare it with tesseract, but I think it's a bit better. A lot slower, though (I only have AMD devices, no CUDA). It's underusing my CPU, and maybe leaking memory a bit, though I didn't clean up.
Take that with a grain of salt, that was a quick try, I haven't tried to tune anything.
Notice the faded text from the printer running out of ink and the slanted text. From limited experience each of these are thorny problems and the state of the art CV algorithms won't help you escape from having to learn how to algorithmicly pre-process images and clean them up prior to feeding them into a CV algorithm. You might be able to use Google's Cloud OCR but that charges per image, although it is pretty good. Even if you use that you've graduated to the next super difficult problem which is Natural Language Processing.
Once you have the text you need to determine if it has meaning to your application. That's basically what NLP is about. For the receipts example, how do you know you're looking at a receipt? What if its a receipt on top of a pile of other receipts? How do you extract transactions from the receipt? Does a transaction span multiple lines? How can you tell? etc etc etc.