Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs...

ydant · on July 26, 2022

Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli

simonw · on July 26, 2022

That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.

ydant · on July 26, 2022

Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:

   An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.

[0] http://www.tobias-elze.de/pdfsandwich/