Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero (github.com/danlock)
120 points by dlock17 on Nov 4, 2023 | hide | past | favorite | 24 comments
Tesseract is one of the largest Open Source OCR (Optical Character Recognition) projects. There is already a Go library for using Tesseract from Go with CGo, called Gosseract.

However if you are interested in OCR from Go without C complicating building and cross-compiling, there aren't any other options.

Wazero is a Go WASM runtime that doesn't have any CGo dependencies. With Emscripten Tesseract has been compiled to WASM and ran within Wazero.

Gogosseract provides a simple API on top of this. This project has been an interesting delve into the world of WASM.




I wrote a short blog post[1] on this method a while ago. I do think running WASM in embedded runtimes is a pretty good option, but overhead remains high, and WASI remains somewhat fragmented between compilers and runtimes.

I think this method really shines in Go as not having CGo simplifies a lot of things, and as a decently performant JITed runtime exists in the form of wazero.

[1]: https://yklcs.com/blog/universal-libs-with-wasm


To me, this is the real value of Wasm: platform independent libraries with a standard interface that doesn’t require C.


WASM runtimes miss out on a _lot_ of optimizations that a battle-tested C compiler will perform, and sometimes requires machine emulation (e.g. Go compiled to WASM results in a virtual machine/emulation layer to run Go code.)

It can work, but it's not the fastest thing in the world.

I think languages that make working with C/C++ code much more seamless, e.g. as nice as working with Go code can be, is a better approach. Zig does this well and feels quite natural coming from Go. It can also be used to make CGO cross compilation 'just work' and alleviate many of those pains.


I feel like inefficient but convenient has been the default trade-off in so many places during the last couple of decades. WASM is opening the doors for all kinds of new solutions. I wonder what kind of cultures will develop around it, as regards efficiency.


Yes, Zig is best in class for C-interoperability.

Go’s FFI support is alright, but I find using WASM/WASI more pleasant.


This is awesome and one of the things I’m really excited about with WASM, and specifically Wazero. The Wazero team is top notch. Now someone just needs to do this with zstd and make it fast…


There's a pure-go zstd at https://github.com/klauspost/compress - it's likely faster than running the upstream zstd under Wazero.


Just for reference I did give it a try

https://github.com/wasilibs/go-zstd

Mostly since I hadn't found `compress` supports zstd. Wazero performed reasonably well against the cgo library but was indeed much slower than this proper pure go port.


Another really interesting way to approach this problem would be to adapt wasm2c to emit Go output. It should result in better performance than wazero.


You mean this? https://github.com/WebAssembly/wabt/blob/main/wasm2c/README....

That seems like quite an undertaking. But at that point, It would make sense to cut out WASM entirely like https://datastation.multiprocess.io/blog/2022-05-12-sqlite-i...


Disclosure: I'm working on alternative Cgo-less bindings for SQLite, using wazero.

https://github.com/ncruces/go-sqlite3

One of the problems of the modernc approach (IMO) is that they're not just transpiling CPU/compute stuff, but entirely OS/platform stuff.

Each Go file of theirs is a xxx_os_arch.go that starts with 100s of OS-#defines-as-consts, and goes on to transpile fully #ifdefed code.

It also implements antithetical (in Go) stuff like goroutine local storage, because libc pthreads can't live without it.

And all IO is via direct syscalls that will never play nice with the Go scheduler, because again, this is OS level stuff.

WASM defines a cross platform CPU and an ABI, and using that for compute and the bottom OS layer in Go you get (IMO) a nicer end result.

Given the hard task of generating decent code from WASM at load time (wazero's compiler is pretty naive, a better one is being developed, but it will take seconds to generate good code for anything non trivial like SQLite) I wouldn't mind having a solution that translated to Go, or Go ASM, at build time.


Oh awesome. I was really hoping a native OCR would pop up but this really is the next best thing and a more realistic avenue.


Exactly, I expected to find one but couldn't, so I put together my own. It's not the fastest, but it'll do for my purposes.


Thanks for sharing!

Since OCR is a somewhat slow process, how does the WASM approach compare to running libtesseract in a subprocess and use some IPC layer to talk to Go? It would require a separate C++ compiler, but not CGo.

> one of the largest Open Source OCR

Tangential, but are there others as large as Tesseract? It seems to pop up anywhere I look.


> Tangential, but are there others as large as Tesseract?

The one serious competition is PaddleOCR, which is faster on GPU, and also works better for Chinese and other non-Western scripts.

There are some newer ML-based projects like DocTR that have been catching up, at least for some use cases.


My intentions was a "pure Go" approach, but that is probably more performant.

I imagine just calling the Tesseract CLI from Go would be simplest if that's all you wanted.


Is Tesseract currently the best open source OCR library? Best in terms of accuracy.

How much difference is there between Tesseract and the best proprietary solutions?


Tesseract is the current best open source OCR library.

When looking at the “best” prop solution, there are a few worth mentioning:

- If you are looking for the best OCR to DOCX solution, ABBYY OCR SDK is the front runner. Their OCR engine is not AS accurate as others I’ll mention, but their output engine (I.e. taking data beyond just the character, like bold or underlined or font name) is probably the best in the market.

- Google Document AI/Cloud Vision is probably the best all-around OCR. The 2 flavors determine whether you want to handle scanned PDFs/images (DocAI) or generalized photos (Cloud Vision). I believe they also have some level of training capabilities via Vertex but I haven’t checked it out.

- IRIS OCR.. Meh

- AWS Textract and Azure Vision are worth mentioning as contenders, but just like Google Document AI, they’re cloud based and that may factor into your decision.

- I haven’t tried DocTR or Paddle OCR


Thanks for the detailed answer.


It mentions that this is a rewrite of gosseract, however it is not a drop in replacement, so its more of a separate library in my opinion


Technically I said reimplementation. But you are right in that it's not supposed to be a drop in replacement at all.

The only feature missing right now is Bounding Box detection, which I plan to add in the future.


Off topic but in general how does something like this compare to cloud hosted ocr solutions?


Tesseract is worse than most commercial solutions, and/or requires more pre- and postprocessing.


this is sick




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: