Would love to find a solution that can handle telling me the text is bold/italic...

Ansil849 · on Sept 10, 2021

Character style/formatting is a pretty standard feature in commercial OCR offerings such as ABBYY FineReader. Its accuracy rate depends on the origin text, and is generally less than the overall OCR accuracy, though.

cdolan · on Sept 15, 2021

Interesting, Azure and AWS do not support this

mdp2021 · on Sept 10, 2021

There could be a way by building a product next-in-pipeline which takes hocr output (which includes the bounding box coordinates in the image) from (e.g.) tesseract and iterates over the data "recognized_string, bounding_box, source_image" to decide the styling of each word through some statistical analysis

- for instance, take the cropped bitmap of the term and compare it with the renderings of the styled variants of the recognized word and see what is more similar, and/or, check the thicknesses, spacing, slanting in the bitmap of the word...

I think it would be intriguing to develop it.

giampaolo44 · on Sept 12, 2021

I agree. Have been experimenting with headers/bold recognition so far, and it's promising.

mdp2021 · on Sept 14, 2021

I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.

Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.

By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.

giampaolo44 · on Sept 14, 2021

Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.