Unicode In Python, Completely Demystified

aston · on March 25, 2008

Punchline: "decode early, unicode everywhere, encode late."

There're actually more caveats to unicode, especially if you have 3rd party code doing the decoding and encoding. There are ways to manufacture strings that are broken (not valid unicode for display, not valid __-encoded ascii), and if it happens, there's actually no way to recover inside of your app.

I'd love to see a more in depth treatment of the different decode modes ('replace', 'ignore', and 'strict') and how they can screw up your data.