Hacker News new | past | comments | ask | show | jobs | submit login

In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: