Hacker News new | past | comments | ask | show | jobs | submit login

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.



You mean the PDFSegmenter Executor in the notebook?


Yes


PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: