Link to a sample doc: http://crocodoc.com/CaDZXS?embedded=true drop the "embedde...

rdamico · on Feb 16, 2011

Here is one of the most complex PDFs we've come across so far, at least in terms of the typography and number of crazy fonts: http://crocodoc.com/KhoD84?embedded=true

thristian · on Feb 16, 2011

That's pretty impressive! However, I normally browse with "Allow web-pages to choose their own fonts" disabled (since 99% of websites have terrible font choices), and in this situation a lot of the text (most of page 37, or the block on the right-hand-side of page 2) is rendered as gibberish; it looks like it's using code-points from the Unicode private-use area. I guess that's probably what the original PDF source document was doing, but it misses the point of rendering things into HTML - even with font-changing enabled, I bet copy/paste doesn't work too well.

Is it possible to ensure all text uses its original code-points, even at the expense of exact reproduction, or is this just a fundamental problem with the technology?

bane · on Feb 17, 2011

Do you OCR PDFs as well? I deal with a ton of non-OCR'd scanned documents.