Hacker News new | past | comments | ask | show | jobs | submit login

Funny enough, this is because the PDF spec literally allows you to map glyphs like that. Some properly-produced PDFs are broken like this, but it's been less common in recent years.

You're supposed to provide mapping tables for text extraction but they are optional.

This fails pretty bad for security because you can detect the glyphs themselves in the font tables and provide a mapping yourself

It’s because PDF was designed before Unicode became viable, and was designed to be flexible regarding character sets, hence you can basically define your own encoding.

Coupled with embedded fonts that’s pretty clever and good foresight from Adobe.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
