gapovaj742's comments

gapovaj742 · on Aug 1, 2022

great project

gapovaj742 · on July 27, 2022

when I think about all the notes I have I'm always worried importing and exporting might be an issue

gapovaj742 · on July 27, 2022

Hasn't DALL-E done this already?

alexcg1 · on July 27, 2022

And I'd say Disco creates a more ethereal-feeling art. It's cool to watch noise just de-noise over time and see strange creations coming forth

frumper · on July 27, 2022

Sure, but it's invite only. This could be useful for people that aren't included.

gapovaj742 · on July 25, 2022

okay but what if my PDF is non parseable? Not sure if Python's any good for that

nicodjimenez · on July 25, 2022

Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.

simonw · on July 25, 2022

Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/

ydant · on July 26, 2022

Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli

simonw · on July 26, 2022

That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.

ydant · on July 26, 2022

Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:

   An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.

[0] http://www.tobias-elze.de/pdfsandwich/

alexcg1 · on July 25, 2022

In that case I'd use:

1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai

gapovaj742 · on July 6, 2022

this is tough ngl

gapovaj742 · on April 27, 2022

ngl this seems really cool. Does this mean we can access chats a year old?

gapovaj742 · on April 27, 2022

oh wow, this is going to make things more accessible!

gapovaj742 · on April 26, 2022

Thanks for the resource! I'm having a lot of fun exploring your notebook. I'm not entirely sure why I decided Go would be the net programming language I learnt, but I'm glad I did that!