Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects t... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

rahimnathwani on July 25, 2022 | parent | context | favorite | on: Search PDFs with Transformers and Python Notebook

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

alexcg1 on July 25, 2022 [–]

You mean the PDFSegmenter Executor in the notebook?

rahimnathwani on July 25, 2022 | [–]

Yes

alexcg1 on July 25, 2022 | | [–]

PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact