Hacker News new | past | comments | ask | show | jobs | submit login

Łukasz Langa recently gave a PyCon talk [1] on the subject.

[1] https://www.youtube.com/watch?v=7m5JA3XaZ4k




That talk is proof as to just how difficult Unicode is in practice:

* @15:32, "UTF-32 uses the same amount of bytes for (almost) all code points" — there is no "almost" about it; UTF-32 always uses 4 octets per code point.

* There was some amount of conflation between code points and characters.

* It was implied that len() will always give you length-in-code-points in Python 3, whereas it doesn't in Python 2. In Python < 3.3, it's code units (just like it is in Python 2), which on a narrow build will be 16-bit and thus wrong for strings w/ code points outside the BMP. This particular problem wasn't solved until 3.3 with the introduction of PEP-393.

The author's main points regarding the difference between text, and how you encode it, is good.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: