Hacker News new | past | comments | ask | show | jobs | submit | gapovaj742's comments login

great project


when I think about all the notes I have I'm always worried importing and exporting might be an issue


Hasn't DALL-E done this already?


And I'd say Disco creates a more ethereal-feeling art. It's cool to watch noise just de-noise over time and see strange creations coming forth


Sure, but it's invite only. This could be useful for people that aren't included.


okay but what if my PDF is non parseable? Not sure if Python's any good for that


Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.


Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/


Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli


That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.


Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:

   An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format
Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.

[0] http://www.tobias-elze.de/pdfsandwich/


In that case I'd use:

1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai


this is tough ngl


ngl this seems really cool. Does this mean we can access chats a year old?


oh wow, this is going to make things more accessible!


Thanks for the resource! I'm having a lot of fun exploring your notebook. I'm not entirely sure why I decided Go would be the net programming language I learnt, but I'm glad I did that!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: