Hacker News new | past | comments | ask | show | jobs | submit login
EasyOCR: Ready-to-use OCR with 40 languages (github.com/jaidedai)
419 points by vortex_ape on July 8, 2020 | hide | past | favorite | 69 comments



From what I can tell (without having read the research papers) it looks like this is just an easy to use package for sparse scene text extraction. It seems to do okay if the scene has sparse text but it falls down for dense text detection. The results are going to be pretty bad if you try and do a task like "extract transactions from a picture of a receipt." Here's an example of input you might get for a production app: https://www.clusin.com/walmart-receipt.jpg

Notice the faded text from the printer running out of ink and the slanted text. From limited experience each of these are thorny problems and the state of the art CV algorithms won't help you escape from having to learn how to algorithmicly pre-process images and clean them up prior to feeding them into a CV algorithm. You might be able to use Google's Cloud OCR but that charges per image, although it is pretty good. Even if you use that you've graduated to the next super difficult problem which is Natural Language Processing.

Once you have the text you need to determine if it has meaning to your application. That's basically what NLP is about. For the receipts example, how do you know you're looking at a receipt? What if its a receipt on top of a pile of other receipts? How do you extract transactions from the receipt? Does a transaction span multiple lines? How can you tell? etc etc etc.


I'm just happy to see some advancement in open source OCR for Python. Last time I had a Python project that needed OCR, I found that the open-source options were surprisingly limited, and it required some effort to achieve consistently good results even with relatively clean inputs.

Honestly I was kind of surprised that good basic OCR isn't a totally solved issue with an ecosystem of fully open-source solutions by now.


Why does it have to be Python based? You can call out to other processes or services. Tesseract[1], for example, is pretty easy to work with.

1: https://github.com/tesseract-ocr/tesseract


It doesn't need to be Python. Tesseract is what I ended up using, IIRC. But I was looking for a turnkey package that would work from beginning to end. I wasn't doing anything unusual, and my app wasn't OCR-focused. I just wanted easy drop-in OCR for documents.

Tesseract is more like getting a pretty good motor for free (recognizing text), but it's up to you to build the rest of the car around it (preprocessing images, handling errors, dealing with the output, potentially training it to your task, and various other issues).


Wow you weren't kidding. Went through the docs and the number of preprocess steps they demand is outrageous. Is there seriously no solution taking care of the preprocess steps??


> Honestly I was kind of surprised that good basic OCR isn't a totally solved issue with an ecosystem of fully open-source solutions by now.

Yes! Can anyone comment on why this is the case, since OCR is proclaimed to be a solved problem?

I've always wondered why Google Lens works "out of the box" and shows great accuracy on extracting text from images taken using a phone camera, but open-source OCR software (Tesseract, Ocropy etc.) needs a lot of tweaking to extract text from standard documents with standard fonts, even after heavily pre-processing the images.

PS: Has Google released any paper on Google Lens?


I was building an image search engine[0] a while back and faced the same issues you mentioned with OCR. What i realized is tesseract[1](one of the more popular ocr framework) works so long as you are able to provide it data similar to the one it was trained on.

We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.

Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.

TLDR;Some nifty image processing is required to make tesseract perform as expected.

[0] (https://www.askgoose.com) [1] (https://github.com/tesseract-ocr/tesseract)


I've been wondering this ever since I used Lens. My hobby applications doing OCR always fall way short of Len's magic.


Yeah! And Lens is not the only closed-source OCR solution that works. I've gotten great accuracy using ABBYY and docparser.com in the past. But one needs to pay per page after the free trial ends :(


I’ve found that none of the open source stuff works well for Japanese language documents. Most of the time, I’ve just ran them through Adobe Acrobat’s OCR and dumped the results into a text file. There are still mistakes, but it at least returns a passable result compared to others.


From my experience the algorithms & implementations seem to be pretty good but the caveat is that you the developer need to be aware of all the different approaches and when it is appropriate to apply them. There just doesn't seem to be a good general purpose library that stitches them all together and knows when to use which approach based analyzing the image.


I've found that often for tools related to natural language (ORC, text-to-speech, and speech-to-text) it feels like you need a PhD in the subject just to figure out how to anything done at all. I heartily welcome efforts to package these sort of things up in ready-to-use ways.


This is good news if you have one of these PhD's. Your career probably isn't going anywhere any time soon :)


> isn't a totally solved issue

I'm surprised, too. After all, if you can train an AI to recognize a cat, why can't it be trained to recognize a letter?

Mine, for example, works well on clean laser-printed text. It fails on anything written with a typewriter, though. (My definition of "failure" is it's quicker to retype it from scratch than fix the OCR's errors.)

I'd also love to have one that worked on cursive handwriting.


So your point is that this library is not a magic unicorn that solves all problems related to OCR and natural language processing?


Try reading the post. There’s a lot more there but the gist is that this is optimized for a different set of OCR uses and not the more typical scan a book/receipt cases.


This is a fair point. I think my criticism more generally is that they position it as easy to use but its still just another library for a subset of OCR problems: sparse text extraction from a scene. As I said in a sibling post there doesn't seem to be a library that stitches together OCR approaches for all the different use cases and chooses an approach based on analyzing the image itself. That would be truly easy to use.


About a year ago I surveyed the available OCR packages for receipts. This was for pristine scans (not the crumpled scan you have in your image). In my survey all OCRs failed except google cloud OCR! If there is another OCR that works I would love to know.


I use TesseractOCR for general screenshot text extraction. Granted they're not receipts but Tesseract works well enough. What packages did you survey? Do you still have the data and code?


Yes, you're right. I tried with some scanned pages from a Vietnamese book but the result was very bad (say <5% accurate). The scans was pretty OK, though. Probably the model was not trained much for the Vietnamese language but I think it's more likely that it does not do the necessary per-processing steps.


I had very bad results on Vietnamese using Tesseract and their trained model. French output was mostly fine. I guess less attention is given to some language, and the huge number of diacritics used in Vietnamese make it harder to process too.


I've been very impressed with the OCR on an app called Fetch, which you use to scan your grocery receipts and get points you can use to redeem for gift cards. Even if I pull a receipt out of my pocket and it's wrinkly, it still seems to read it very well.


Can you get the data from them yourself, or is it purely for them?

I've just tried easyocr on a receipt, and it's pretty bad. I've also just noticed that ASDA have a "mojibake" problem and print ú instead of £ on the entire receipt ...


I haven't looked into it, I believe it's purely for them. It's sort of like a reverse-coupon app. You buy stuff, and get extra points for say, Lipton iced tea. That's supposed to encourage you to buy more of that stuff next time.


Looking at the Chinese example, it’s kinda funny it managed to output Traditional Chinese characters when the image contains Simplified Chinese; the SC and TC versions look pretty different (园 vs 園, 东 vs 東).


They're rendering Unicode without any markup for language variant.


No, these are completely different, standalone code points, not variant forms of the same code point.

What's actually happening seems to be that the ch_tra model can recognize simplified too and output the corresponding traditional version if the character isn't in the traditional "alphabet"; it doesn't work so well in the other direction.

Example recognizing a partial screenshot of https://chinese.stackexchange.com/a/38707 (anyone can try this on Google Colab, no hardware required; remember to turn on GPU in Runtime -> Change runtime type):

  import easyocr
  import requests

  zhs_reader = easyocr.Reader(['en', 'ch_sim'])
  zht_reader = easyocr.Reader(['en', 'ch_tra'])
  image = requests.get('https://i.imgur.com/HtrpZCZ.png').content
  print('ch_sim:', ' '.join(text for _, text, _ in zhs_reader.readtext(image)))
  print('ch_tra:', ' '.join(text for _, text, _ in zht_reader.readtext(image)))
Results:

  ch_sim: One simplified character may mapping to multiple traditional ones: 皇后->皇后,後夭->后夭 豌鬟->头发,骏财->发财 As reversed, one traditional character may mapping to multiple simplified ones too: 乾燥->干燥, 乾隆->乾隆 嘹望->嘹望,嘹解->了解
  ch_tra: One simplified character may mapping to multiple traditional ones: 皇后->皇后,後天->后天 頭髮->頭發,發財->發財 As reversed, one traditional character may mapping to multiple simplified ones too: 乾燥->干燥, 乾隆->乾隆 瞭望->瞭望, 瞭解->了解
Compare to the original text:

  One simplified character may mapping to multiple traditional ones:

  - 皇后 -> 皇后,後天 -> 后天
  - 頭髮 -> 头发,發財 -> 发财

  As reversed, one traditional character may mapping to multiple simplified ones too:

  - 乾燥 -> 干燥,乾隆 -> 乾隆
  - 瞭望 -> 瞭望,瞭解 -> 了解
Of course, automatic character-to-character conversion from simplified to traditional can be wrong due to ambiguities; excellent examples from above: 头发 => 頭發 (should be 頭髮), 了解 => 了解 (should be 瞭解).


This approach seems a bit weird to me. While I appreciate them separating the models of Traditional and Simplified Chinese, I think I might prefer them to be combined (perhaps even including Japanese Kanji), and instead provide a way for the user to specify which language or regional variant is expected so characters matching the expected variant are simply given a higher score.


Without delving into implementation details, I suspect the ch_tra model was simply trained on a dataset including simplified images with traditional labels.


Funny thing, 夭 and 天 are not the same at all.

It doesn't seem to have a dictionary to do word level matching, only character level.


> Funny thing, 夭 and 天 are not the same at all.

Yes, the simplified model is not that great at recognizing simplified either, at least in this case.


Yeah, a bit strange considering traditional and simplified make up two different languages in their list of forty.


What are people using in mobile development (native iOS/native Android/Cross-platform e.g. React-Native) when you want accurate extraction from a fixed format-source?

E.g. poor-quality images of ID cards or credit cards, where the position of data is known.


iOS has the Vision framework, can’t say whether it’s accurate enough for your use case.

https://developer.apple.com/documentation/vision/recognizing...


not on mobile but as a service -- Abbyy is the market leader in OCR from documents.


OpenCV is excellent for this.


This is something that I find really interesting. Open-source OCR is lagging behing commercial applications and seeing someone trying ideas is always beneficial. Kudos!!


Has anyone made a desktop app with a really simple UI for detecting text in images? I'm thinking something that lives in the taskbar, lets you make a box around the text you want to read, and then returns it as plaintext?

In my job as a support engineer I sometimes get screenshots of complex technical configurations and end up having to type them in one character at a time, so this would be really handy.

Looks like maybe I could just create a wrapper around EasyOCR.


I was looking exactly for the same thing, and found this wonderful script:

http://askubuntu.com/a/280713/81372

I put it in a custom keyboard shortcut, so I just press it, draw an onscreen rectangle around any non-selectable text, and in a few seconds it goes to the clipboard.


I have the same problem every day. 20 years ago I used to have some tiny shareware tool for this but I lost it. Hope someone is able to recommend one!


What would be the advantage compared to something like Tesseract ?


Tesseract isn't very accurate, especially with text in photos. It works OK for scanned documents, but that's about it.


Tesseract can be very accurate (>99%), especially when you train it for your particular data set.

This does involve creating your own labeled data.

I got this 99% accuracy by performing incremental training using latest Manheim model as a base. I added about 20k lines which is not really that much. https://github.com/tesseract-ocr/tesseract/wiki

The hard part was crowd sourcing those 20k lines :)

Tesseract might not be best for photos as you said but I did not have major problems.

Of course some documents the source is so bad that a human can't achieve 99%.

Tesseract used to be quite average before they moved onto LTSM models a few years ago.


Care to share resources/lessons learned for training tesseract with custom data? I'm using it for a side project and would love to hear about your insights.


I followed the resources here: https://github.com/tesseract-ocr/tessdoc/blob/master/Trainin...

Also this: https://github.com/UB-Mannheim/tesseract/wiki

The original data was here: https://github.com/tesseract-ocr/langdata_lstm

I did use another data source from Manheim but can't locate it right now.

Using vanilla Ubuntu 18.04

I looked at the example training files and made a small script to convert my own labeled data to fit the format that tesseract requires.

I did do a bit of pre-processing adjusting contrast.

All the data munging was done on Python (Pillow for image processing, Flask for collecting data into a simple SQLite DB before converting back to format that Tesseract requires).

Python was not necessary just something that felt most comfortable to me. I am sure someone could do it using bash scripts or node.js or anything else.

EDIT: To make life easier for my curators I did run Tesseract first to generate prelabeled data for my training set. It was about 90% accurate to start with.

So the process was: Tesseract OCR on some documents to be trained -> hand curation (2 months)-> train (took about 12 hours) -> 99% (on completely separate test set)


If you don't mind disclosing, what was your particular use-case (the labeled dataset you trained on)?


It was for digitizing 19th century books written in a font and language not supported in a vanilla Tesseract.


I didn't see any accuracy comparison in the EasyOCR repo.


This depends on the model you use, right? As far as I know, Tesseract supports a couple of models, and you could also use a more powerful neural network in there. And if you have trained it well, it should be fine.


AFAIK Tesseract is trained to recognize characters and uses a bunch of steps to prepare image for recognition. Steps like removing noise, fixing contrast and resizing.

It means that it performs not-so-good when for example image contains black text and white text on green background since this is not "normalized" through image preparation steps and it cannot detect white text on green background (but you can do it yourself)


Are there any OCR benchmarks?


Does this require a Nvidia GPU? Some modules seem to import PyTorch and Cuda libs: https://github.com/JaidedAI/EasyOCR/blob/master/easyocr/dete...


The CPU fallback taking on the order of tens of seconds on my modest i5-5250U for a few images of street signs I've thrown at it. Good enough for my purposes at least.


It defaults to CPU if it does not detect a CUDA device.


I'm going to test this on my shopping receipts, if I can get over the Windows packaging hassle, and report back ..


Compared to ABBYY, how does this thing fare? I don't have the time right now to do this test and if anyone here did it I'd be thankful to share.


It can not compete with cloud services from Abbyy, Google, OCR.space and others. But it runs locally and is open-source.

It works for sparse text on images, and for that specific use case it is better than Tesseract.


I have Abby installed locally as well. I don't use it's cloud components. And I can set my own server exposing ABBYY's API's to rollout my own cloud server instead of theirs, if needs be.


I've recently become interested in OCR due to using Kaku on Android for trying to get better at reading Japanese. So thanks Hacker News for showing me a new version. I'd love any comments about other resources that may be good for learning. Especially because for funsies I'd like to try and develop my own.


For learning, you could try training yourself on datasets for handwritten character recognition: http://etlcdb.db.aist.go.jp/


Thanks! A dataset was one of the things I was dreading searching for/building. (Maybe this was super easily searchable and I'm just a goon, again I'm early stages of passing interest).


this is great, been waiting for a newer machine learning based library as an option to Tesseract.

a bit out of topic - but does anyone happen to know if there is an open-source, new school OCR library for music notation?


Anyone know how this (EasyOCR) compares with a service like AWS Textract?


What metrics do you want the comparison on?

Cost: AWS is not free vs Open Sourced

Time: AWS averages under 10 seconds vs 140 seconds on a standard Dell 7480 & 9 seconds on a GPU Google colab

Character Accuracy: Almost same on a high quality input. No comparison with AWS on a blurred camera photo like this https://github.com/ExtractTable/ExtractTable-py/blob/master/...


Has anyone tried it? How good is it?


It doesn't work that bad on a few French examples I had lying around: It's doing quite well on scanned documents, even quite dense ones. Handwriting doesn't work well at all, even for simpler cases. It managed to recognize a few words from a blackboard picture, but that's hardly usable.

However, it looks like my simple example of an old "S note" export (like a lowish resolution phone screenshot) confused it a bit:

    Reglementation -> Reglemantation
    km -> kn
    illimitée -> illiritée
    limite -> liite
    baptême -> bapteme
    etc.
Overall, it works, and it is quite easy to install and use. I'd have to compare it with tesseract, but I think it's a bit better. A lot slower, though (I only have AMD devices, no CUDA). It's underusing my CPU, and maybe leaking memory a bit, though I didn't clean up.

Take that with a grain of salt, that was a quick try, I haven't tried to tune anything.


The README presents an unfinished example. How does one work with the result?


From the title I thought it was OCR implemented in 40 programming languages.


How do I train this on my own custom dataset/font?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: