Hacker News new | past | comments | ask | show | jobs | submit login

Link to a sample doc:

http://crocodoc.com/CaDZXS?embedded=true

drop the "embedded=true" to see it with their interface.

Beautiful stuff!




Here is one of the most complex PDFs we've come across so far, at least in terms of the typography and number of crazy fonts: http://crocodoc.com/KhoD84?embedded=true


That's pretty impressive! However, I normally browse with "Allow web-pages to choose their own fonts" disabled (since 99% of websites have terrible font choices), and in this situation a lot of the text (most of page 37, or the block on the right-hand-side of page 2) is rendered as gibberish; it looks like it's using code-points from the Unicode private-use area. I guess that's probably what the original PDF source document was doing, but it misses the point of rendering things into HTML - even with font-changing enabled, I bet copy/paste doesn't work too well.

Is it possible to ensure all text uses its original code-points, even at the expense of exact reproduction, or is this just a fundamental problem with the technology?


Do you OCR PDFs as well? I deal with a ton of non-OCR'd scanned documents.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: