PaddleOCR: Lightweight, 80 Langauge OCR

aliabd · on Sept 9, 2021

Thanks for sharing! For context this is a demo of PaddleOCR V2 [0] which was released yesterday. You can find their original repo here [1]. We built this demo using Gradio [2] and deployed it on HuggingFace's Spaces [3].

[0]: https://arxiv.org/abs/2109.03144 [1]: https://github.com/PaddlePaddle/PaddleOCR [2]: https://gradio.app/ [3]: https://huggingface.co/spaces

hdjjhhvvhga · on Sept 10, 2021

I went over to Gradio. They have a demo on their front page. I choose "Q&A with Paragraph" since "understanding" the text is a pet peeve of mine.

I entered a simple text "John kissed Mary in 2010. Two years later Adam was born." and a question "When was Adam born?" The answer was. . . "Two years."

This is one of the reasons I hate the current AI hype. People would do much better being honest and realistic about our progress in the field - it's enormous, but not near what the marketing teams claim it to be.

julien_c · on Sept 10, 2021

FWIW this is extractive question answering so the output of the model can only be a span of the input paragraph.

SQuAD is the prototypical example of a dataset for this task, see https://rajpurkar.github.io/SQuAD-explorer/

hdjjhhvvhga · on Sept 10, 2021

The problem is it fails even simplest tests (Q: "Adam was not born in 2012. When was Adam born?" A: "2012"; "Mary says Adam was born in 2012. John says Adam was born in 2013. And in fact the latter date is correct." A: "2012").

hdjjhhvvhga · on Sept 10, 2021

Simple tasks like digit recognition work much better, but still...:

https://i.imgur.com/q39gxip.png

ShakataGaNai · on Sept 9, 2021

Demo is cool, but it tells us nothing about this particular OCR.

* Github: https://github.com/PaddlePaddle/PaddleOCR

* PyPi: https://pypi.org/project/paddleocr/

godelski · on Sept 9, 2021

I've been using it is a current project. The OCR is pretty good but I've been learning that OCR isn't as good as many of us think it is. Specifically with handwritten text and more "in the wild" type text. But website text? No problems.

dr_zoidberg · on Sept 10, 2021

For a long time there was this distinction that OCR meant "typefaces as in a scanned document" whereas if you wanted more "free form text recognition" you'd lean yourself towards some machine learning based approach (SVM, convnets, etc).

Now, this is an artificial divide: there's nothing in the defintion of what OCR is that means "oh it only works on scanned documents". It's just what it used to mean back in the day.

Still, if that's your kind of problem, you may look into something more along the lines of a deep convnet to do character detection and then recognize each bounding box into an alphabet.

dmos62 · on Sept 10, 2021

Are you OCR'ing websites?

mdp2021 · on Sept 10, 2021

I think he meant, "bitmaps of computer rendered text (without "natural noise")".

godelski · on Sept 10, 2021

Actually I mean samples of real peoples handwriting. That's "in the wild" text

mdp2021 · on Sept 10, 2021

I see. I thought you also wanted to stress the difference between "lab conditions" OCR (recognize rendered text, e.g. a PDF - not just necessarily websites) and the typical "OCR over the scans of old pages using real-fonts typography, with bleeding ink, stains etc".

godelski · on Sept 10, 2021

No. That's the problem lol

mdp2021 · on Sept 9, 2021

Very nice.

But I see that (like tesseract) it cannot recognize different styles (italics, bold, monospace...) - not only it seems to just translate into pure (UTF) characters, it shows confusion on terms in alternative styling.

cdolan · on Sept 9, 2021

Would love to find a solution that can handle telling me the text is bold/italic. If you or anyone else knows of one please share!

Ansil849 · on Sept 10, 2021

Character style/formatting is a pretty standard feature in commercial OCR offerings such as ABBYY FineReader. Its accuracy rate depends on the origin text, and is generally less than the overall OCR accuracy, though.

cdolan · on Sept 15, 2021

Interesting, Azure and AWS do not support this

mdp2021 · on Sept 10, 2021

There could be a way by building a product next-in-pipeline which takes hocr output (which includes the bounding box coordinates in the image) from (e.g.) tesseract and iterates over the data "recognized_string, bounding_box, source_image" to decide the styling of each word through some statistical analysis

- for instance, take the cropped bitmap of the term and compare it with the renderings of the styled variants of the recognized word and see what is more similar, and/or, check the thicknesses, spacing, slanting in the bitmap of the word...

I think it would be intriguing to develop it.

giampaolo44 · on Sept 12, 2021

I agree. Have been experimenting with headers/bold recognition so far, and it's promising.

mdp2021 · on Sept 14, 2021

I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.

Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.

By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.

giampaolo44 · on Sept 14, 2021

Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.

pacman2 · on Sept 9, 2021

How well does it perform compared to Abbyy?

This packages the tesseract very nicely: https://kebekus.gitlab.io/scantools/

I actually would buy Abbyy OCR but pricing for Linux ist just insane for private use. I just saw, the CLI is even discontinued: https://www.ocr4linux.com/

zmix · on Sept 10, 2021

Another (multi-platform) one, that packages Tesseract very nicely, is gImageReader[1]

[1]: https://github.com/manisandro/gImageReader

giampaolo44 · on Sept 12, 2021

I've been trying to use it multiple times, but I can't find it effective. I'm sure they just have in mind different needs from the ones I have, because I can tell it's a nice project.

giampaolo44 · on Sept 12, 2021

They (Abbyy) in my understanding are a good example of a lazy company, sigh. They find something valuable and extract the most out of the franchise with the minimun effort. A bit like Lastpass. You can't deny they both excel at some stuff in their field but I still get pushed away by the lack of care/generosity.

Pamar · on Sept 9, 2021

ABBY is really great... but I was sorry to see they stopped offering it as a subscription based webapp :(

pacman2 · on Sept 9, 2021

That is really odd https://pdf.abbyy.com/finereader-online-end-of-life/

What may drive this decisions?

dylan604 · on Sept 10, 2021

>What may drive this decisions?

I mean, you provided the link:

"Why we are sunsetting FineReader Online

The entire ABBYY FineReader product family is getting a new look and feel. Our online recognition tools will be reworked and introduced at a later date to demonstrate the power of ABBYY's OCR technologies."

ianhawes · on Sept 9, 2021

They launched a cloud version a few years ago and have seen most growth there. It’s not the best API (it requires polling; no webhooks. Yuck), but it does include their latest engine at entry level pricing.

antisthenes · on Sept 9, 2021

Not sure what 80 languages are supported (couldn't find a list anywhere), but I guess Russian wasn't one of them

Tried to OCR some russian text from an image and got absolute nonsense.

ipsum2 · on Sept 10, 2021

I've tried Paddle, its great that its open source, but doesn't hold a candle to Google's OCR. Pro-tip: if you only need it for a few images, Google photos automatically does OCR if it recognizes its a doc.

shpx · on Sept 10, 2021

Android does OCR natively as well. Drag up from the bottom part way (the gesture that opens the view that lets you scroll left and right through all your open apps) and then press and hold on some text and you'll be able to select and copy text being displayed by one of your apps.

giampaolo44 · on Sept 12, 2021

Oh that would be interesting. Can you say more? I tried on a Samsung S7 but I can't seem to do it. Maybe it's a later Android version, last time the S7 was updated it was with v8 o_O

dylan604 · on Sept 10, 2021

After crowdsourcing the training models, I would hope that Goog's OCR is pretty damn accurate. If the crowdsourcing doesn't help, then I feel sorry for those self-driving folks using the latest data being created.

eitland · on Sept 10, 2021

OneNote desktop used to be great as well.

The new version was botched last I tried.

BugsJustFindMe · on Sept 9, 2021

When I submit an image it just starts counting up until it reaches 60.0/4.9s (whatever that means) and then says ERROR. ¯\_(ツ)_/¯

Edit: I finally got it to work. The result looks good! https://i.imgur.com/hoS4oMP.png

Though it looks like yet another OCR program that doesn't understand archaic lexical paradigms like the long S or ligatures.

aliabd · on Sept 9, 2021

Sorry, this is because of the traffic right now. what you're seeing is a counter for how long the current prediction time is vs avg prediction time.

azinman2 · on Sept 9, 2021

Ligatures aren’t rare and archaic.. they’re a standard part of many fonts today. I actually looked at that and didn’t think as favorably as you. Lots of mistakes all over.

To me good results is like 99%+ correct, and the ability to highlight where it’s confused.

BugsJustFindMe · on Sept 9, 2021

Sorry that was meant to be two separate categories "archaic lexical paradigms like the long S" and "ligatures". I should have put ligatures first to avoid the ambiguity.

This kind of blobby faded printing is still challenging for OCR. The fact that it decided to just skip entire sections is the most troubling part for me (like seriously wtf). But the parts it didn't skip I think are quite good compared to when I use other software on the same kind of material.

I wish these things had a bit more...sanity...for lack of a better word. t769 is just ridiculous. TEcole isn't a word. Beaucoupde is clearly two words that shouldn't be smushed together. etc.

Interestingly, Apple quietly bakes high quality free OCR into macOS as a library that developers can invoke in their own software that works better than this in some ways and they just don't advertise it to end users or do anything with it at all themselves (they do on iOS but Preview.app could have had an OCR option since Catalina). It also doesn't recognize things like long S, though, so it's still annoying for old texts.

ant6n · on Sept 10, 2021

How can one find out more about this MacOS library?

BugsJustFindMe · on Sept 10, 2021

https://developer.apple.com/documentation/vision

sireat · on Sept 10, 2021

I wonder how Paddle handles additional languages.

Tesseract versions 4+ (when they started supporting LTSM) is pretty easy to train on obscure fonts and rare languages.

The downside is that you need labeling...a lot of labeling.

We used about 10,000 lines of humanly crowdsourced results to supplement an existing model of about 50000 lines.

Accuracy is over 99% which is sufficient for most uses.

donquichotte · on Sept 10, 2021

Tangentially related, is there a good open-source tool for handwriting OCR? I want to digitize hand-written notes and make them searchable. Google Lens does a really good job, but I'd like to remove the dependency on tools that I can't run myself.

mdani · on Sept 10, 2021

Does anyone know any OCR (including closed source) that can handle nastaliq script for Urdu, Farsi etc? Tesseract can't do this today due to complex ligatures I think.

nethunters · on Sept 10, 2021

I've used Google Vision API for a wide variety of Arabic fonts and it has worked pretty well with recognising ligatures but not diacritics as it either doesn't recognise them or adds non-existent ones.

wnscooke · on Sept 10, 2021

Uighursoft had an OCR app that did all kinds of ltr texts. Give that a try.

jalopy · on Sept 9, 2021

How does this compare to Tesseract?

aidenn0 · on Sept 9, 2021

Worse for the corpus of english text I tried on it; it doesn't seem to recognize punctuation at all, and it's marginally worse at I/1/l on sans-serif text (which, to be fair, trips up humans too).

Those were the only two relative deficiencies I noticed.

It does seem to beat tesseract on samples with mixed dark-on-light and light-on-dark text, but that was the only big win I saw in my brief look at it.

aidenn0 · on Sept 9, 2021

It seems to completely ignore punctuation for the corpus of English text I tried on it; punctuation came through either not at all (e.g. "Id" for "I'd") or as letters (e.g. "P" for "?").

timClicks · on Sept 9, 2021

Welcome to OCR. It's often possibly to overlay the raw results with a language model to improve them, but ultimately it's a probabilistic process.

aidenn0 · on Sept 9, 2021

I've done OCR before; it seems like they must not have had any punctuation in their ground-truth set for English here though...

danuker · on Sept 9, 2021

I apologize for clicking "Flag" in case it doesn't flag errors.

aliabd · on Sept 9, 2021

Yes the flag button is for wrong or unusual outputs. No worries though

vahid4m · on Sept 9, 2021

I'm wondering if there is any image that results in accuracy of 1? Or 0.999 is always the best.

magedqwani · on Sept 9, 2021

does not recognize Arabic letter system so no use in arabic .urdu.farsi etc..

jhvkjhk · on Sept 10, 2021

Can I use it offline? Does it require connection to a private server?

chrislusf · on Sept 10, 2021

The option has 7 languages. So 7 == 80 ?

pelasaco · on Sept 10, 2021

typo in the title: Language