Hacker News new | past | comments | ask | show | jobs | submit login

Here is one of the most complex PDFs we've come across so far, at least in terms of the typography and number of crazy fonts: http://crocodoc.com/KhoD84?embedded=true



That's pretty impressive! However, I normally browse with "Allow web-pages to choose their own fonts" disabled (since 99% of websites have terrible font choices), and in this situation a lot of the text (most of page 37, or the block on the right-hand-side of page 2) is rendered as gibberish; it looks like it's using code-points from the Unicode private-use area. I guess that's probably what the original PDF source document was doing, but it misses the point of rendering things into HTML - even with font-changing enabled, I bet copy/paste doesn't work too well.

Is it possible to ensure all text uses its original code-points, even at the expense of exact reproduction, or is this just a fundamental problem with the technology?


Do you OCR PDFs as well? I deal with a ton of non-OCR'd scanned documents.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: