Łukasz Langa recently gave a PyCon talk [1] on the subject. [1] https://www.yout...

deathanatos · on May 29, 2017

That talk is proof as to just how difficult Unicode is in practice:

* @15:32, "UTF-32 uses the same amount of bytes for (almost) all code points" — there is no "almost" about it; UTF-32 always uses 4 octets per code point.

* There was some amount of conflation between code points and characters.

* It was implied that len() will always give you length-in-code-points in Python 3, whereas it doesn't in Python 2. In Python < 3.3, it's code units (just like it is in Python 2), which on a narrow build will be 16-bit and thus wrong for strings w/ code points outside the BMP. This particular problem wasn't solved until 3.3 with the introduction of PEP-393.

The author's main points regarding the difference between text, and how you encode it, is good.