> Chars should be 32-bit, since 16 bits aren't enough to represent all Unicode codepoints.
This would be a lot better, but even this is oversimplifying things and is going to cause problems. In this case, thinking "chars are 32 bits" ignores the fact that, in Unicode, a "char" (that is, a codepoint) and a single character you see on the screen (that is, a glyph) are not the same thing: Some codepoints don't map to glyphs, such as bidirectional markers, [1] and some codepoints modify existing glyphs, such as combining forms. [2] And Zalgo waits for people who forget about combining forms. [3] Zalgo hungers, mortal. [4]
The underlying 'problem' is that Unicode is the first text processing standard that's actually complex enough to be useful for more than one language. Its complexity reflects how complex the real world is.
This would be a lot better, but even this is oversimplifying things and is going to cause problems. In this case, thinking "chars are 32 bits" ignores the fact that, in Unicode, a "char" (that is, a codepoint) and a single character you see on the screen (that is, a glyph) are not the same thing: Some codepoints don't map to glyphs, such as bidirectional markers, [1] and some codepoints modify existing glyphs, such as combining forms. [2] And Zalgo waits for people who forget about combining forms. [3] Zalgo hungers, mortal. [4]
The underlying 'problem' is that Unicode is the first text processing standard that's actually complex enough to be useful for more than one language. Its complexity reflects how complex the real world is.
[1] http://www.iamcal.com/understanding-bidirectional-text/
[2] http://en.wikipedia.org/wiki/Combining_character
[3] http://creepypasta.wikia.com/wiki/Zalgo
http://eeemo.net/
[4] http://en.wikipedia.org/wiki/Sinistar