i used open source solutions to built it, like libreoffice, ghostscript, google'...

beagle3 · 2024-05-13T06:46:46 1715582806

I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR

harryf · 2024-05-13T08:05:47 1715587547

EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...

ianhawes · 2024-05-13T13:14:13 1715606053

ABBYY does indeed dominate, but Google Document AI is making inroads.

racl101 · 2024-05-13T13:18:06 1715606286

Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.