Hacker News new | past | comments | ask | show | jobs | submit login

Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/




Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli


That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.


Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:

   An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format
Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.

[0] http://www.tobias-elze.de/pdfsandwich/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: