Hacker News new | past | comments | ask | show | jobs | submit login
PaddleOCR: Lightweight, 80 Langauge OCR (huggingface.co)
200 points by zuhayeer on Sept 9, 2021 | hide | past | favorite | 58 comments



Thanks for sharing! For context this is a demo of PaddleOCR V2 [0] which was released yesterday. You can find their original repo here [1]. We built this demo using Gradio [2] and deployed it on HuggingFace's Spaces [3].

[0]: https://arxiv.org/abs/2109.03144 [1]: https://github.com/PaddlePaddle/PaddleOCR [2]: https://gradio.app/ [3]: https://huggingface.co/spaces


I went over to Gradio. They have a demo on their front page. I choose "Q&A with Paragraph" since "understanding" the text is a pet peeve of mine.

I entered a simple text "John kissed Mary in 2010. Two years later Adam was born." and a question "When was Adam born?" The answer was. . . "Two years."

This is one of the reasons I hate the current AI hype. People would do much better being honest and realistic about our progress in the field - it's enormous, but not near what the marketing teams claim it to be.


FWIW this is extractive question answering so the output of the model can only be a span of the input paragraph.

SQuAD is the prototypical example of a dataset for this task, see https://rajpurkar.github.io/SQuAD-explorer/


The problem is it fails even simplest tests (Q: "Adam was not born in 2012. When was Adam born?" A: "2012"; "Mary says Adam was born in 2012. John says Adam was born in 2013. And in fact the latter date is correct." A: "2012").


Simple tasks like digit recognition work much better, but still...:

https://i.imgur.com/q39gxip.png


Demo is cool, but it tells us nothing about this particular OCR.

* Github: https://github.com/PaddlePaddle/PaddleOCR

* PyPi: https://pypi.org/project/paddleocr/


I've been using it is a current project. The OCR is pretty good but I've been learning that OCR isn't as good as many of us think it is. Specifically with handwritten text and more "in the wild" type text. But website text? No problems.


For a long time there was this distinction that OCR meant "typefaces as in a scanned document" whereas if you wanted more "free form text recognition" you'd lean yourself towards some machine learning based approach (SVM, convnets, etc).

Now, this is an artificial divide: there's nothing in the defintion of what OCR is that means "oh it only works on scanned documents". It's just what it used to mean back in the day.

Still, if that's your kind of problem, you may look into something more along the lines of a deep convnet to do character detection and then recognize each bounding box into an alphabet.


Are you OCR'ing websites?


I think he meant, "bitmaps of computer rendered text (without "natural noise")".


Actually I mean samples of real peoples handwriting. That's "in the wild" text


I see. I thought you also wanted to stress the difference between "lab conditions" OCR (recognize rendered text, e.g. a PDF - not just necessarily websites) and the typical "OCR over the scans of old pages using real-fonts typography, with bleeding ink, stains etc".


No. That's the problem lol


Very nice.

But I see that (like tesseract) it cannot recognize different styles (italics, bold, monospace...) - not only it seems to just translate into pure (UTF) characters, it shows confusion on terms in alternative styling.


Would love to find a solution that can handle telling me the text is bold/italic. If you or anyone else knows of one please share!


Character style/formatting is a pretty standard feature in commercial OCR offerings such as ABBYY FineReader. Its accuracy rate depends on the origin text, and is generally less than the overall OCR accuracy, though.


Interesting, Azure and AWS do not support this


There could be a way by building a product next-in-pipeline which takes hocr output (which includes the bounding box coordinates in the image) from (e.g.) tesseract and iterates over the data "recognized_string, bounding_box, source_image" to decide the styling of each word through some statistical analysis

- for instance, take the cropped bitmap of the term and compare it with the renderings of the styled variants of the recognized word and see what is more similar, and/or, check the thicknesses, spacing, slanting in the bitmap of the word...

I think it would be intriguing to develop it.


I agree. Have been experimenting with headers/bold recognition so far, and it's promising.


I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.

Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.

By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.


Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.


How well does it perform compared to Abbyy?

This packages the tesseract very nicely: https://kebekus.gitlab.io/scantools/

I actually would buy Abbyy OCR but pricing for Linux ist just insane for private use. I just saw, the CLI is even discontinued: https://www.ocr4linux.com/


Another (multi-platform) one, that packages Tesseract very nicely, is gImageReader[1]

[1]: https://github.com/manisandro/gImageReader


I've been trying to use it multiple times, but I can't find it effective. I'm sure they just have in mind different needs from the ones I have, because I can tell it's a nice project.


They (Abbyy) in my understanding are a good example of a lazy company, sigh. They find something valuable and extract the most out of the franchise with the minimun effort. A bit like Lastpass. You can't deny they both excel at some stuff in their field but I still get pushed away by the lack of care/generosity.


ABBY is really great... but I was sorry to see they stopped offering it as a subscription based webapp :(


That is really odd https://pdf.abbyy.com/finereader-online-end-of-life/

What may drive this decisions?


>What may drive this decisions?

I mean, you provided the link:

"Why we are sunsetting FineReader Online

The entire ABBYY FineReader product family is getting a new look and feel. Our online recognition tools will be reworked and introduced at a later date to demonstrate the power of ABBYY's OCR technologies."


They launched a cloud version a few years ago and have seen most growth there. It’s not the best API (it requires polling; no webhooks. Yuck), but it does include their latest engine at entry level pricing.


Not sure what 80 languages are supported (couldn't find a list anywhere), but I guess Russian wasn't one of them

Tried to OCR some russian text from an image and got absolute nonsense.


I've tried Paddle, its great that its open source, but doesn't hold a candle to Google's OCR. Pro-tip: if you only need it for a few images, Google photos automatically does OCR if it recognizes its a doc.


Android does OCR natively as well. Drag up from the bottom part way (the gesture that opens the view that lets you scroll left and right through all your open apps) and then press and hold on some text and you'll be able to select and copy text being displayed by one of your apps.


Oh that would be interesting. Can you say more? I tried on a Samsung S7 but I can't seem to do it. Maybe it's a later Android version, last time the S7 was updated it was with v8 o_O


After crowdsourcing the training models, I would hope that Goog's OCR is pretty damn accurate. If the crowdsourcing doesn't help, then I feel sorry for those self-driving folks using the latest data being created.


OneNote desktop used to be great as well.

The new version was botched last I tried.


When I submit an image it just starts counting up until it reaches 60.0/4.9s (whatever that means) and then says ERROR. ¯\_(ツ)_/¯

Edit: I finally got it to work. The result looks good! https://i.imgur.com/hoS4oMP.png

Though it looks like yet another OCR program that doesn't understand archaic lexical paradigms like the long S or ligatures.


Sorry, this is because of the traffic right now. what you're seeing is a counter for how long the current prediction time is vs avg prediction time.


Ligatures aren’t rare and archaic.. they’re a standard part of many fonts today. I actually looked at that and didn’t think as favorably as you. Lots of mistakes all over.

To me good results is like 99%+ correct, and the ability to highlight where it’s confused.


Sorry that was meant to be two separate categories "archaic lexical paradigms like the long S" and "ligatures". I should have put ligatures first to avoid the ambiguity.

This kind of blobby faded printing is still challenging for OCR. The fact that it decided to just skip entire sections is the most troubling part for me (like seriously wtf). But the parts it didn't skip I think are quite good compared to when I use other software on the same kind of material.

I wish these things had a bit more...sanity...for lack of a better word. t769 is just ridiculous. TEcole isn't a word. Beaucoupde is clearly two words that shouldn't be smushed together. etc.

Interestingly, Apple quietly bakes high quality free OCR into macOS as a library that developers can invoke in their own software that works better than this in some ways and they just don't advertise it to end users or do anything with it at all themselves (they do on iOS but Preview.app could have had an OCR option since Catalina). It also doesn't recognize things like long S, though, so it's still annoying for old texts.


How can one find out more about this MacOS library?



I wonder how Paddle handles additional languages.

Tesseract versions 4+ (when they started supporting LTSM) is pretty easy to train on obscure fonts and rare languages.

The downside is that you need labeling...a lot of labeling.

We used about 10,000 lines of humanly crowdsourced results to supplement an existing model of about 50000 lines.

Accuracy is over 99% which is sufficient for most uses.


Tangentially related, is there a good open-source tool for handwriting OCR? I want to digitize hand-written notes and make them searchable. Google Lens does a really good job, but I'd like to remove the dependency on tools that I can't run myself.


Does anyone know any OCR (including closed source) that can handle nastaliq script for Urdu, Farsi etc? Tesseract can't do this today due to complex ligatures I think.


I've used Google Vision API for a wide variety of Arabic fonts and it has worked pretty well with recognising ligatures but not diacritics as it either doesn't recognise them or adds non-existent ones.


Uighursoft had an OCR app that did all kinds of ltr texts. Give that a try.


How does this compare to Tesseract?


Worse for the corpus of english text I tried on it; it doesn't seem to recognize punctuation at all, and it's marginally worse at I/1/l on sans-serif text (which, to be fair, trips up humans too).

Those were the only two relative deficiencies I noticed.

It does seem to beat tesseract on samples with mixed dark-on-light and light-on-dark text, but that was the only big win I saw in my brief look at it.


It seems to completely ignore punctuation for the corpus of English text I tried on it; punctuation came through either not at all (e.g. "Id" for "I'd") or as letters (e.g. "P" for "?").


Welcome to OCR. It's often possibly to overlay the raw results with a language model to improve them, but ultimately it's a probabilistic process.


I've done OCR before; it seems like they must not have had any punctuation in their ground-truth set for English here though...


I apologize for clicking "Flag" in case it doesn't flag errors.


Yes the flag button is for wrong or unusual outputs. No worries though


I'm wondering if there is any image that results in accuracy of 1? Or 0.999 is always the best.


does not recognize Arabic letter system so no use in arabic .urdu.farsi etc..


Can I use it offline? Does it require connection to a private server?


The option has 7 languages. So 7 == 80 ?


typo in the title: Language




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: