Hacker News new | past | comments | ask | show | jobs | submit login

When the PDF is very clearly structured it's working just fine. But let's say the layout consists of multiple columns and complex formatting the output gets very imprecise. If the material is scanned it won't function at all.



That is really hard, as there are no such things as columns in PDFs, only text starting at different (x,y) positions.

Hence most (if not all) programs export the text in the order they appear in the file.

And if it is scanned, there is no text at all (but you could OCR it).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: