Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.
Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.
Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.
I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.
I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.
Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:
An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format
Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.
1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too)
2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images
3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.
Thanks for the resource! I'm having a lot of fun exploring your notebook. I'm not entirely sure why I decided Go would be the net programming language I learnt, but I'm glad I did that!