Hacker News new | past | comments | ask | show | jobs | submit login

Would love to find a solution that can handle telling me the text is bold/italic. If you or anyone else knows of one please share!



Character style/formatting is a pretty standard feature in commercial OCR offerings such as ABBYY FineReader. Its accuracy rate depends on the origin text, and is generally less than the overall OCR accuracy, though.


Interesting, Azure and AWS do not support this


There could be a way by building a product next-in-pipeline which takes hocr output (which includes the bounding box coordinates in the image) from (e.g.) tesseract and iterates over the data "recognized_string, bounding_box, source_image" to decide the styling of each word through some statistical analysis

- for instance, take the cropped bitmap of the term and compare it with the renderings of the styled variants of the recognized word and see what is more similar, and/or, check the thicknesses, spacing, slanting in the bitmap of the word...

I think it would be intriguing to develop it.


I agree. Have been experimenting with headers/bold recognition so far, and it's promising.


I also did, out of curiosity, the day I posted that: the values of the average pixel-by-pixel difference in the two cropped bitmaps of a word in the original document and a new rendering, normal vs italics, do not differ much.

Better techniques are required than the raw one I used: for example, finding the best overlapping of the two bitmaps, maybe with some sort of gradient descent over a few pixels distance in panning and scaling - this should give a near to 100% correspondence in the correct case (regular vs italics vs bold vs monospaced vs BI, BM, IM, BIM), but only if the font used for the comparison is the same.

By the way: in the fact that while adjusting the overlapping the computed difference should increase with the gradient when the style corresponds (R on R), but may be random in other cases (R on I, R on M, though not R on B), there could be another key in the heuristic.


Cool stuff. I did not push myself that far since I was mostly looking for headers/titles recognition. In that area while the height of the line is on average (meaning everything else being equal) a less-valuable-than-expected indication, word width is more accurate. You still need to rely on some statistics, but results are more descriptive and reliable to make inferences.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: