Demo is cool, but it tells us nothing about this particular OCR. \* Github: http...

godelski · on Sept 9, 2021

I've been using it is a current project. The OCR is pretty good but I've been learning that OCR isn't as good as many of us think it is. Specifically with handwritten text and more "in the wild" type text. But website text? No problems.

dr_zoidberg · on Sept 10, 2021

For a long time there was this distinction that OCR meant "typefaces as in a scanned document" whereas if you wanted more "free form text recognition" you'd lean yourself towards some machine learning based approach (SVM, convnets, etc).

Now, this is an artificial divide: there's nothing in the defintion of what OCR is that means "oh it only works on scanned documents". It's just what it used to mean back in the day.

Still, if that's your kind of problem, you may look into something more along the lines of a deep convnet to do character detection and then recognize each bounding box into an alphabet.

dmos62 · on Sept 10, 2021

Are you OCR'ing websites?

mdp2021 · on Sept 10, 2021

I think he meant, "bitmaps of computer rendered text (without "natural noise")".

godelski · on Sept 10, 2021

Actually I mean samples of real peoples handwriting. That's "in the wild" text

mdp2021 · on Sept 10, 2021

I see. I thought you also wanted to stress the difference between "lab conditions" OCR (recognize rendered text, e.g. a PDF - not just necessarily websites) and the typical "OCR over the scans of old pages using real-fonts typography, with bleeding ink, stains etc".

godelski · on Sept 10, 2021

No. That's the problem lol