Hi HN - I released an open source OCR model yesterday that supports 93 world languages. It builds on a text line detector I created earlier.
In my benchmarks, it's more accurate than tesseract in every language except one. (see repo for benchmarking method)
Since it can run on GPU, speed is about equal to tesseract (when cost-matched with a 1x lambda A6000 vs 28 DigitalOcean CPU cores).
It's built using a modified donut architecture - I added an MoE layer, GQA for faster decoding, and UTF-16 decoding (can represent any character, and faster than UTF-8 since you can combine adjacent bytes.)
I theorized that character-level decoding would be an optimal compute allocation, and that a large embedding matrix (relative to UTF-8 decoding) would store language-specific information.
I trained it using 4x A6000s for about 2 weeks.
You can run surya via Python API, from the CLI, or via an interactive app in the repo.
In my benchmarks, it's more accurate than tesseract in every language except one. (see repo for benchmarking method)
Since it can run on GPU, speed is about equal to tesseract (when cost-matched with a 1x lambda A6000 vs 28 DigitalOcean CPU cores).
It's built using a modified donut architecture - I added an MoE layer, GQA for faster decoding, and UTF-16 decoding (can represent any character, and faster than UTF-8 since you can combine adjacent bytes.)
I theorized that character-level decoding would be an optimal compute allocation, and that a large embedding matrix (relative to UTF-8 decoding) would store language-specific information.
I trained it using 4x A6000s for about 2 weeks.
You can run surya via Python API, from the CLI, or via an interactive app in the repo.