Thanks for sharing! For context this is a demo of PaddleOCR V2 [0] which was released yesterday. You can find their original repo here [1]. We built this demo using Gradio [2] and deployed it on HuggingFace's Spaces [3].
I went over to Gradio. They have a demo on their front page. I choose "Q&A with Paragraph" since "understanding" the text is a pet peeve of mine.
I entered a simple text "John kissed Mary in 2010. Two years later Adam was born." and a question "When was Adam born?" The answer was. . . "Two years."
This is one of the reasons I hate the current AI hype. People would do much better being honest and realistic about our progress in the field - it's enormous, but not near what the marketing teams claim it to be.
The problem is it fails even simplest tests (Q: "Adam was not born in 2012. When was Adam born?" A: "2012"; "Mary says Adam was born in 2012. John says Adam was born in 2013. And in fact the latter date is correct." A: "2012").
I've been using it is a current project. The OCR is pretty good but I've been learning that OCR isn't as good as many of us think it is. Specifically with handwritten text and more "in the wild" type text. But website text? No problems.
For a long time there was this distinction that OCR meant "typefaces as in a scanned document" whereas if you wanted more "free form text recognition" you'd lean yourself towards some machine learning based approach (SVM, convnets, etc).
Now, this is an artificial divide: there's nothing in the defintion of what OCR is that means "oh it only works on scanned documents". It's just what it used to mean back in the day.
Still, if that's your kind of problem, you may look into something more along the lines of a deep convnet to do character detection and then recognize each bounding box into an alphabet.
I see. I thought you also wanted to stress the difference between "lab conditions" OCR (recognize rendered text, e.g. a PDF - not just necessarily websites) and the typical "OCR over the scans of old pages using real-fonts typography, with bleeding ink, stains etc".
But I see that (like tesseract) it cannot recognize different styles (italics, bold, monospace...) - not only it seems to just translate into pure (UTF) characters, it shows confusion on terms in alternative styling.
Character style/formatting is a pretty standard feature in commercial OCR offerings such as ABBYY FineReader. Its accuracy rate depends on the origin text, and is generally less than the overall OCR accuracy, though.
There could be a way by building a product next-in-pipeline which takes hocr output (which includes the bounding box coordinates in the image) from (e.g.) tesseract and iterates over the data "recognized_string, bounding_box, source_image" to decide the styling of each word through some statistical analysis
- for instance, take the cropped bitmap of the term and compare it with the renderings of the styled variants of the recognized word and see what is more similar, and/or, check the thicknesses, spacing, slanting in the bitmap of the word...
I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.
Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.
By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.
Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.
I actually would buy Abbyy OCR but pricing for Linux ist just insane for private use. I just saw, the CLI is even discontinued: https://www.ocr4linux.com/
I've been trying to use it multiple times, but I can't find it effective. I'm sure they just have in mind different needs from the ones I have, because I can tell it's a nice project.
They (Abbyy) in my understanding are a good example of a lazy company, sigh. They find something valuable and extract the most out of the franchise with the minimun effort. A bit like Lastpass.
You can't deny they both excel at some stuff in their field but I still get pushed away by the lack of care/generosity.
The entire ABBYY FineReader product family is getting a new look and feel. Our online recognition tools will be reworked and introduced at a later date to demonstrate the power of ABBYY's OCR technologies."
They launched a cloud version a few years ago and have seen most growth there. It’s not the best API (it requires polling; no webhooks. Yuck), but it does include their latest engine at entry level pricing.
I've tried Paddle, its great that its open source, but doesn't hold a candle to Google's OCR. Pro-tip: if you only need it for a few images, Google photos automatically does OCR if it recognizes its a doc.
Android does OCR natively as well. Drag up from the bottom part way (the gesture that opens the view that lets you scroll left and right through all your open apps) and then press and hold on some text and you'll be able to select and copy text being displayed by one of your apps.
Oh that would be interesting. Can you say more? I tried on a Samsung S7 but I can't seem to do it. Maybe it's a later Android version, last time the S7 was updated it was with v8 o_O
After crowdsourcing the training models, I would hope that Goog's OCR is pretty damn accurate. If the crowdsourcing doesn't help, then I feel sorry for those self-driving folks using the latest data being created.
Ligatures aren’t rare and archaic.. they’re a standard part of many fonts today. I actually looked at that and didn’t think as favorably as you. Lots of mistakes all over.
To me good results is like 99%+ correct, and the ability to highlight where it’s confused.
Sorry that was meant to be two separate categories "archaic lexical paradigms like the long S" and "ligatures". I should have put ligatures first to avoid the ambiguity.
This kind of blobby faded printing is still challenging for OCR. The fact that it decided to just skip entire sections is the most troubling part for me (like seriously wtf). But the parts it didn't skip I think are quite good compared to when I use other software on the same kind of material.
I wish these things had a bit more...sanity...for lack of a better word. t769 is just ridiculous. TEcole isn't a word. Beaucoupde is clearly two words that shouldn't be smushed together. etc.
Interestingly, Apple quietly bakes high quality free OCR into macOS as a library that developers can invoke in their own software that works better than this in some ways and they just don't advertise it to end users or do anything with it at all themselves (they do on iOS but Preview.app could have had an OCR option since Catalina). It also doesn't recognize things like long S, though, so it's still annoying for old texts.
Tangentially related, is there a good open-source tool for handwriting OCR? I want to digitize hand-written notes and make them searchable. Google Lens does a really good job, but I'd like to remove the dependency on tools that I can't run myself.
Does anyone know any OCR (including closed source) that can handle nastaliq script for Urdu, Farsi etc? Tesseract can't do this today due to complex ligatures I think.
I've used Google Vision API for a wide variety of Arabic fonts and it has worked pretty well with recognising ligatures but not diacritics as it either doesn't recognise them or adds non-existent ones.
Worse for the corpus of english text I tried on it; it doesn't seem to recognize punctuation at all, and it's marginally worse at I/1/l on sans-serif text (which, to be fair, trips up humans too).
Those were the only two relative deficiencies I noticed.
It does seem to beat tesseract on samples with mixed dark-on-light and light-on-dark text, but that was the only big win I saw in my brief look at it.
It seems to completely ignore punctuation for the corpus of English text I tried on it; punctuation came through either not at all (e.g. "Id" for "I'd") or as letters (e.g. "P" for "?").
[0]: https://arxiv.org/abs/2109.03144 [1]: https://github.com/PaddlePaddle/PaddleOCR [2]: https://gradio.app/ [3]: https://huggingface.co/spaces