It's also easier to get away with Rust's decision to say no, strings aren't sequences of code points (which they aren't) if you do all the work to support ASCII on bytes anyway.
Rust defines things like is_ascii_hexdigit() on both char (a Unicode scalar) and u8 (a byte) and so if you're writing some low-level code which cares only about bytes you aren't expected to either turn them into a string to find out if the byte you're looking at is an ASCII digit or improvise something.
This sort of thing means the programmer who is moving bytes and is very angry about the notion of character encoding needn't touch Rust's Unicode stuff, while the programmer who is responsible for text rendering isn't given a bunch of undifferentiated bytes and told "Good luck". Somebody needs to figure out what the encoding is, but likely that programmer actually cares about the difference between ISO-8859-1 and Windows code page 1252 or at least is aware that somebody else might care.
> This is because code points have no intrinsic meaning. They are not “characters”.
This is simply false. The true statement is: "Not all code points are characters".
- Many code points are characters. For instance, everything in the ASCII range.
- Codepoints in the ASCII range are often used for delimiters: quotes, commas, various brackets ... they have semantics.
- Generic text manipulating routines don't require code points to have semantics, but they require indexing.
We can make an analogy here to UTF-8. Let's pretend that "character" means "valid multi-byte UTF-8 code" and "code point" means "byte".
Code point (i.e. byte) access to a UTF-8 string is extremely useful.
UTF-8 strings can be processed by code that doesn't understand UTF-8 at all; for isntance you can split a UTF-8 string on commas or spaces using some function written in 1980. That function will use pointers or indices or some combination thereof into the string, using subroutines that just blindly copy ranges of bytes without caring what they mean. The UTF-8 won't be torn apart because the delimiters don't occur in the middle of a UTF-8 sequence.
> Many code points are characters. For instance, everything in the ASCII range.
It is definitely unclear in what sense NUL should be "a character" likewise for most of the ASCII control codes, are DELETE and ESCAPE characters? How about VERTICAL TAB?
> Generic text manipulating routines don't require code points to have semantics, but they require indexing.
While there have been programming languages which posit such routines, it's unclear what they mean. For example it's common to define some sort of sub-string function substr(string, A, B) and say that A and B are indexes into the code points. But, to what end? Typically such functions are dangerous (or even undefined) unless you got the indices A and B by some specialised means, suggesting actually indexing into the string isn't really how it works.
In contrast Rust offers lots of text manipulation routines which are well defined but lack such indexing. Such as split_once and trim_end_matches
We can see that UTF-8 has code units of bytes, and sure enough indexing the code units makes sense, but insisting these should be considered an analogy to code points misses why we have two different terms.
> For example it's common to define some sort of sub-string function substr(string, A, B) and say that A and B are indexes into the code points. But, to what end?
Right; this is what I'm getting at. substr(str, A, B) is a low level function: a servant. A and B have a meaning somewhere, but not to this routine.
substr does not need to know what the indices mean. The indices could mean "start and end of a file suffix" or "third field in a line of CSV", or "integer part of floating-point token".
Software is layered and not all the semantics can be known to every layer. Usually when we have semantic leakage/sharign across crossings, it's for some optimization purpose.
For instance, the ethernet driver and hardware don't know that the frames passing through them are IP or whatever else. (Though, for instance, when you have TCP offloading to the hardware, then the hardware groks that semantics info.)
I have a graphical window, and I have a bunch of text (in whatever Unicode encoding you want---it doesn't matter to me) that I need to display in said graphical window such that the text just doesn't run off one edge of the window and can't be read. How do I break the text up into segments that will fit within the horizontal width of the graphical window without changing the meaning of any of the text?
Fourty years ago I could assume ASCII and be done with it. These days? With the yearly changes to the "Rules of Unicode"? Screw that, let's build a self-landing rocket which appears to be easier these days than actually displaying text.
> How do I break the text up into segments that will fit within the horizontal width of the graphical window without changing the meaning of any of the text?
This is language dependent. For instance, the Japanese don't care. I was reading a novel a few months ago, and a chapter ended in a sentence ending in "た。" (ta.) due to some past tense verb. This "た。" did not fit, and so was just punted to the next page, which was otherwise entirely blank. There is no concept of hyphenation.
So if the text is Japanese, you just chop it into the required line length,
and you're pretty much done; no hyphenation is required, no multi-pass breaking algorithm to create an evenly gray page, nothing.
A character isn't just something that produces a picture, and advances the printing position to the right.
Yes, a byte index will work if we have a multi-byte string such as UTF-8; and a code point index would be inconvenient since displacement calculations on it are meaningless.
If we are talking about a string as an array of code points (like a wchar_t * string on sane platforms where a wchar_t can hold any code point) then we use code point indexing. C has wmemcpy, wmalloc, wstrlen, ...
Functions that index into the string-considered-as-a-sequence-of-characters can be meaningful, and sometimes useful, in cases where you know that the string contains only characters from some simple restricted subset.
(You might know that "by construction", or by checking.)
In a language like Rust which doesn't directly support refinement types, it's common to end up using the standard str or String types in cases where you know such a thing.
So I think there are times when that sort of indexing function would be sensible.
A common example is when you're working with a very restricted subset of printable ASCII (things like stock codes, ISBNs, EANs). Or not so long ago I was working with 64-character-long strings containing only U+0020 and U+2654 ..= U+265F .
> in cases where you know that the string contains only characters from some simple restricted subset
But I'm saying, even in cases where you don't!
A string function that works with byte characters, written before Unicode existed, can do useful processing on UTF-8 data which contains characters that didn't exist when that code was written.
That's true, but in most of those cases you don't need to be able to use numeric character-count-based indexes into the string (which is what the article is arguing that you don't need).
You'd typically be happy if the parsing function that you're using to find the location of (say) each comma in the string gives you an opaque token for each such location, with a way to use those tokens to get slices of the string back.
So in practice we can use byte offsets into a UTF-8 string as those tokens, while the programmer doesn't really have to care that that's what they are.
You can't assume that any code point is a character. This is because of combining marks.
If you have a string with an ASCII 'A' codepoint followed by U+0328 codepoint, then it's a two-codepoint 'Ą' character. The first code point wasn't a character itself, but a fragment of a different one.
Unicode is inherently a variable-width encoding, and later codepoints can change meaning of earlier ones.
If I specify that the first character of a string is a type code A, B or C, followed by some field of data, then if the type code is A and the data starts with U+0328, it is just the type code A followed by an unrelated U+0328. That's an example of why it's important to be able to say text[0] to just get that A.
It may be that if you send the whole datum to a text rendering device (say, for debugging) you get a funny result starting with the glyph Ą. If it bothers you, then don't be lazy and parse things out.
I'm not sure what you're describing, but it's not Unicode (type code and field of data are not in Unicode jargon, so I don't know if you're just loose with terminology or describing some other system).
Combining marks in Unicode are not some unrelated surprise. They're required to correctly interpret the code point they follow. Unicode requires that you to continue to consume them until they end or the input ends. In other words, Unicode is always a variable-length encoding. text[0] alone is never sufficient.
Fundamental problem with grapheme clusters as a basic programming language concept is that they are not uniformly defined, they depend on grapheme_extend property of unicode characters, which are known only for already defined codepoints. Which means that such splitting is problematic to be done in forward-compatible manner. It is also costly and in many cases not necessary. Which makes it more suitable to be provided by a library than to be a core language concept.
Secondary issue is that with codepoints (or bytes) as basic concept one can compose grapheme clusters in the same manner as general text composing, while if grapheme clusters are basic concept then one needs specialized operation for composing them from codepoints.
OTOH, pretty much agree that O(1) access to code point is not important and keeping internal representation in UTF-8 (or in input encoding) is OK for most purposes.
Completely agreed, and this is a pretty nice overview of the problems with thinking of strings in terms of "characters" or "code points".
Graphemes are almost always the closest to what people mean when they say "character".
But you also shouldn't do logic based on grapheme, unless you're contributing to harfbuzz and know enough to know exactly why this advice is wrong. Don't split or concatenate strings for any reason if you're doing internationalized stuff. E.g. ask your translators to give you a separate string for a drop-cap, and to remove it from the string that follows, do not just pluck out the first grapheme because it could look like nonsense.
---
You should literally never need or want to interact with Unicode directly, unless you're building the foundational layers of other systems (rendering, Unicode normalization, etc). If you find that you have to, you're probably losing encoding information somewhere - that loss is the bug to fix, don't try to patch it somehow, you'll just cause weirder errors elsewhere - or doing something fundamentally irrational, like splitting a string somehow. The never-ending pain you encounter while doing this stuff is a sign you're Doing It Wrong™ and should step back and question the basics, not that it just needs one more fix to work correctly.
If you're doing single-language logging for developers or whatever? Yeah, go wild. Though watch out for irrationally chopped user input, ya gotta make sure your log analyzer won't choke on bad UTF-8.
> Graphemes are almost always the closest to what people mean when they say "character".
But which glyphs[1] are considered graphemes and which are composites varies according to the individual writing system, which is a finer level than Unicode deals with.
In French, é, è, and à are all single graphemes; they are three different vowels and those are their written forms. It's just a coincidence that é and è appear to share an "e" element, and it's also a coincidence that è and à appear to share a diacritic marking. The grave accent has no meaning in isolation, and while the "e" does, its meaning is separate from and unrelated to the meaning of either é or è.
But in Hanyu Pinyin, é is two graphemes, è is two graphemes, and à is two graphemes. (With a total of four graphemes being represented across the set.) É and è share their underlying vowel, while è and à express different vowels but share their underlying tone.
You could demand that strings representing Chinese always use U+0301 COMBINING ACUTE ACCENT alongside U+0065 LATIN SMALL LETTER E, while strings representing French must always use U+00E8 LATIN SMALL LETTER E WITH ACUTE. But that would be insane. Unicode can't generally distinguish graphemes from composite glyphs and doesn't want to.
[1] In this comment, I will define "glyph" to be the unit of output produced by a font. You put some kind of binary data into the font and get image data out; anything that is considered a single indivisible unit of output from the font is a "glyph".
Language / Unicode / encoding interaction details is one of those things where I pretty much always learn something whenever it comes up. Including from your comment (thanks!).
It's fascinating how incredibly widely varied every possible pattern / "rule" is. There's just no end to the edge cases.
Don't split or concatenate strings! Period! Ask your translators to do it, they actually have some idea of what is linguistically and socially acceptable!
As someone who's written a lot of unicode-aware code in multiple languages, I don't agree at all. The problem I think a lot about is collaborative text editing. I need to express an insert into a text document - eg, "Insert 'a' at position X". The obvious question that comes up is - how should we define "X"? There's a mess of options:
- UTF-8 byte offset (from the start of the string). (Or UCS2 offset)
- Extended grapheme cluster offset
- Code point offset
You can also use a line/column position, but the column position ends up being one of the 3 above.
I want collaborative editing protocols to work across multiple programs, running on multiple systems. (Eg, a user is in a web browser on a phone, talking to a server written in rust.)
This post says that codepoint offsets are Bad, but in my mind they're the only sane answer to this problem.
Using byte offsets has two problems:
1) You have to pick an encoding, and thats problematic for cross-language compatibility. UTF-8 byte offsets are meaningless in javascript - they're slow & expensive to convert. UCS2 offsets are meaningless in rust.
2) They make it possible to express invalid data. Inserting in the middle of a codepoint is an invalid operation. I don't want to worry about different systems handling that case differently. Using codepoint offsets make it impossible to even represent the idea of inserting in the middle of a character.
Using grapheme cluster offsets is problematic because the grapheme clustering rules change all the time. I want editing operations to be readable from any programming language, and across time. Saying "Insert 'a' at position 20" (measured in grapheme clusters) is ambiguous because "position 20" will drift as unicode's grapheme cluster rules change. The result is that old android phones and new iphones can't reliably edit a document together. And old editing traces can't be reliably replayed.
Measuring codepoints is better than the other options because its stable, cross platform and well defined. If you aren't aware of those benefits, you aren't understanding my use case.
Yup, code points are the way to go. The LSP is a great example of what happens otherwise [1]. The LSP authors made the controversial decision to sync documents using line/columns and UTF-16 code unit offsets rather than relying on code points. I fear they dug the hole deeper by expanding the set of supported encodings to UTF-8 and 32 rather than standardizing on code points. I can only hope this is treated as a stop gap solution and a proper switch to code points happens in the next major revision.
It's not immediately obvious, but line/columns are not a good synchronization method either. Unicode has their own definition of what constitutes a line terminator [2], but different programming languages and environments might have their own definition.
> 2) They make it possible to express invalid data. Inserting in the middle of a codepoint is an invalid operation.
It’s already possible to express invalid data, because the offset may be out of bounds. Please insert 'a' at offset 100 in the string "hello".
This leaves only one problem for code unit offsets.
But as for scalar value offsets, you’ve ignored the fact that it has precisely the same problem if you use UTF-8 or UTF-16.
If you go all in on UTF-8 code units, then UTF-16 and UTF-32 systems are slower, but UTF-8 systems are fast.
If you go all in on UTF-16 code units, then UTF-8 and UTF-32 systems are slower, but UTF-16 systems are fast.
If you go all in on scalar values, then both UTF-8 and UTF-16 systems are slower, and only UTF-32 systems (which are very few, Python is the only one I can think of offhand) will be fast.
We agree that extended grapheme clusters are unsuitable for persistent indexing, but you have failed to sell the virtues of scalar values. They just ensure things are slower for everyone instead of half the people. (Possibly a shade easier in some systems, but probably not particularly, and certainly slower.)
(I’ve used more precise terminology than the parent comment: “scalar values” instead of “code points” just for clarity where UTF-16 could be involved, and “UTF-8 code units” instead of UTF-8 bytes and “UTF-16 code units” instead of UCS-2 thingies.)
> It’s already possible to express invalid data, because the offset may be out of bounds. Please insert 'a' at offset 100 in the string "hello".
Sure; but having only one way data can be invalid is strictly better than having two ways that data can be invalid. Checking for string length is needed in every case. Checking if a given byte offset is valid is only needed when you use byte offsets.
Speaking of checking byte offsets - that can be a very complex check to implement efficiently! For example, if you're using javascript, how do you check if a given UTF8 byte offset in a string is valid? Can you do that faster than O(n) time? Oh, and it gets worse. During collaborative editing sessions, its often the case that you have local modifications that the remote peer didn't know about when they created their change. How do you validate an offset from a remote peer without even having a local copy of the document at that point in time?
This is a really complex problem.
I agree with your later point though: Converting between scalar value counts and string indexes requires conversion in every language I use. In the systems I work with, it helps that I tend to use a custom data structure for strings anyway. This is needed because splicing strings is slow - usually O(n). And that is bad when I need to replay a lot of changes. I use libraries like jumprope[1], which supports offset conversion in O(log n) time. (Fun fact: even with all of its versitility, jumprope is still only 1/3rd the size of rust's unicode-segmentation crate!)
> you have failed to sell the virtues of scalar values
Maybe my strongest justification is simplicity: The actual semantic data of a text document is fundamentally a list of (indivisible) scalar unicode values. We can tell because inserting in the middle of a multi-byte character would be an invalid / illegal operation.
Given that, the most obvious index into a list of scalar values is an integer pointing to an index in that list. If we model a text document like that, we can reuse the same CRDT / OT algorithm for other sorts of lists too. The algorithm I have for merging changes in a list/text document for diamond types works without reference to the actual merged content (!!!). We just need to store the merged positions & lengths of old operations, and that data is weirdly tiny on disk.
I could model the document as a list of UTF8 bytes. But any interleaving problems[2] in the CRDT may result in illegal documents. Using scalar unicode values instead gives the system a beautiful, clean semantic separation between the list's semantics and the (local) resulting string - which I can store in whatever form makes the most sense.
All that said, it is definitely possible to make collaborative text editing system on top of UTF8 byte offsets / UTF16 code units. But having been in the collaborative editing world for over a decade, my professional opinion is that doing so is a trap for young players. The benefit is outweighed by the cost.
> I need to express an insert into a text document - eg, "Insert 'a' at position X". The obvious question that comes up is - how should we define "X"?
Perhaps using a cursor/rider/however-you-call-it into whatever data structure you're using for editable sequences (since there's a bunch of them) is the preferred solution here? I'm hardly an editor expert but I thought that's how this has been being done for decades (and for some possibly useful but really complicated text representations, numerical indices for positions of operations wouldn't really work very well anyway).
I hear what you're saying, but text editing operations need to be sent over the network and saved on disk. We need a stable encoding, not just an in-memory reference.
Do you end up having a system that basically just observes the state of the string (in code points) before and after a user operation occurs? So for example, say a document consists of code points (made up, obviously):
1 2 3 4 5 6
This is represented in whatever platform encoding is used, but you can convert to UCS-4 for (space-wasting) simplicity, or whatever make the most sense. Then the user makes an edit, and suddenly the document looks like:
1 2 3 8 9 4 5 6
So you can say "ok, the user inserted something into the middle of the string, and we trust that the platform knows correctly that there is some sort of boundary between 3 and 4 such that an insert there is legal, and that they put something in between there, and that thing is "8 9", which might be two graphemes, or one, who knows, doesn't matter.
Is that what you're talking about here?
I guess I still don't completely understand how this works in a cross-platform way, though. If grapheme clustering rules change all the time, and the above was done on Windows 11, and then someone views it on an 8-year-old Android phone, can we have confidence that it will be rendered the same? Is it even possible to make that work properly?
> Do you end up having a system that basically just observes the state of the string (in code points) before and after a user operation occurs?
You're describing diffing, which is one way (probably the worst way) for my code to find out what edits a user made. Good editing libraries will fire editing events instead, for each change made to the document. So in this case, I'd get an event handler firing with an insert event at position 3 & with content "8 9". But yeah, you still sometimes need to reverse engineer edits by looking at the before / after strings and diffing them.
We need to send that insert event over the network to other peers, and save it in a database. The question we're debating is: How should we express the edit position? Should we count the number of UTF8 bytes before the insert position ("1 2 3")? Or the number of unicode scalars? Or the number of grapheme clusters?
Eg, should a Ukrainian flag have a length of 8 (utf8 bytes), a length of 2 (unicode scalars) or a length of 1 (grapheme cluster)?
> If grapheme clustering rules change all the time, and the above was done on Windows 11, and then someone views it on an 8-year-old Android phone, can we have confidence that it will be rendered the same? Is it even possible to make that work properly?
Nope! New unicode characters (and new grapheme clusters) are being added all the time. For example, the polar bear emoji is made up of 3 unicode scalar values - (Bear) + (Zero-width joiner) + (Snowflake). If you look at this page[1] on an 8 year old android phone, you might just see BearSnowflake instead of the emoji.
The collaborative editing system can't fix the polar bear's rendering. But I want everything else to still work. I still want you to be able to delete those characters if you want, or edit anywhere else in the document correctly. I want my collaborative editing server to work with new emoji that haven't been released yet.
If we count characters using grapheme clusters, imagine a document which starts with a polar bear. An old android phone might count that as having a length of 3 (since it doesn't know to join them all together). All the edits made on that android device will look wonky for everyone else, since the editing positions will appear in the wrong locations. If we count characters using the other methods, we don't have this problem.
You have a point in that codepoints aren't useless. But I'm in agreement with the author that they are not the majoritarian concern, and so, they shouldn't be the default way to access a string.
Instead, there should be enough flexibility to use them, bytes, or whatever representation you need. But the default interface should still pretty much be grapheme based.
Yeah I agree that it should be possible to use any of the indexing systems depending on your needs. My criticism of grapheme cluster based indexing being the "default" is that it only really makes sense for UI code. In other situations (eg, passing JSON text over the network) it adds a lot of complexity and code size, with no benefit.
If I'm writing a regex parsing system, or a JSON parser, or an event logger, or a compiler, I'll use a lot of strings. But my code probably doesn't care very much about grapheme cluster boundaries. Handling grapheme cluster boundaries is very complex code. (Rust's unicode-segmentation crate is just shy of 100kb!). When parsing grapheme boundaries isn't needed, all that code is pure overhead.
This is just an FYI. I don't mean to say much to your overall point, although, as someone else who has spent a lot of time doing Unicode-y things, I do tend to agree with you. I had a very similar discussion a bit ago.[1] (A quick glance at that discussion might suggest that I disagree with you, but see the last paragraph of my big comment in that thread.)
Putting that aside, at least with respect to grapheme segmentation, it might be a little simpler than you think. But maybe only a little. The unicode-segmentation crate also does word segmentation, which is quite a bit more complicated than grapheme segmentation. For example, you can write a regex to parse graphemes without too much fuss[2]. (Compare that with the word segmentation regex, much to my chagrin.[3]) Once you build the regex, actually using it is basically as simple as running the regex.[4]
Sadly, not all regex engines will be able to parse that regex due to its use of somewhat obscure Unicode properties. But the Rust regex crate can. :-)
And of course, this somewhat shifts code size to heap size. So there's that too. But bottom line is, if you have a nice regex engine available to you, you can whip up a grapheme segmenter pretty quickly. And some regex engines even have grapheme segmentation built in via \X.
Well, it's for human language, not only UI. If you are indexing it for search, pattern matching, or validating, you will want graphemes.
But yes, computer languages are normally designed so you can parse them at the byte view. Except for a small, but still surprising common set of them that are tokenized at the byte view, but parsed after some human language dependent transformation that happens over graphemes. (And those do usually have problem with internationalization.)
If your parsing code cares about grapheme cluster boundaries, it'll behave differently on the same string when the unicode spec changes. So it may work differently on different operating systems, or with different versions of your unicode libraries.
And that means the output of your validation methods or tokenizers will randomly change!
I'm happy for the rendering of strings to change over time. But in my business logic, compilers and database code: That sounds like an unnecessary source of bugs.
Unpopular opinion: written languages need to modernize. The printing press and typewriter forced the Romantic language from a format optimized for hand writing (cursive) into a format optimized for the modern world (print).
I think other languages would benefit from having a "print" format.
Block-letter Roman-alphabet writing existed long before the printing press: consider e.g. ancient Roman monuments. Arguably, cursive was the anomaly: a form of writing optimized for a device capable of relatively continuous writing, in contrast to stone carving, clay tablets, brush writing, etc.
Cursive didn’t go away with the printing press. Non-romantic languages already “print” just fine too. If ASCII is for some reason needed, most non-romantic languages have a system for romanization. Cursive versions of romantic languages exist in unicode, they’re just neither popular, nor as well designed as languages with demand for script versions.
In my opinion, if any change was going to happen, switching to something similar to the International Phonetic Alphabet would make the most sense. Then any languages would be readable; already easy to encode any language in it, just not the norm.
Hangul covers Korean language, IPA covers all languages. For example, some languages use clicks; though to be fair, IPA only currently covers 5 of the 6 clicks. Again, point is not IPA is the solution, but that having a common notation would make learning languages much easier and learning only one written notation would be needed.
I just wanted to use a concrete example of a featural writing system. It would of course not be Hangul if it had to accommodate all phonemes. The other direction to take would be to create a sophisticated (probably ML) model that could enable certain features to be optimized, like speed of syllable recognition. If we are constructing a modern writing system then that would definitely be a good approach.
It is not clear what this argument gets at, because:
• Other languages also have had a printing press for centuries, and typewriters for decades. (Not just "Romantic" languages, by which I assume you mean the Romance languages.)
• Cursive can be accommodated on the printing press too, so it has nothing to do with this issue. (In fact, some of the earliest 16-century type founders like Garamond and Granjon also made cursive types.)
(In fact, other writing systems have also already been influenced and "modernized" because of the printing press. Those changes have little to do with Unicode or codepoints.)
Written languages had cursive and print forms from the very beginning. Egyptian had hieroglyphs (carvings, monumental inscriptions, painting on walls and objects) and Hieratic (ink on papyrus) emerge at essentially the same time. Cuneiform had lapidary (carved in stone) and cursive (impressed in clay) forms.
Python3 is a good example of what happens when you represent strings as an index of code points. I always found that a bit ironic in that one of the major justifications for python3 was to improve Unicode. In the end it failed to do that because of this.
Worse still, Python strings aren’t even valid Unicode, because they’re sequences of code points rather than scalar values as they should be; and so even UTF–8‐encoding a str is fallible:
>>> '\udead'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udead' in position 0: surrogates not allowed
Both you and parent are correct. (It should have been valid UTF-8, in practice it often is treated as such anyway. It should also never have existed in the first place.)
No, it should not have been valid UTF-8, and nothing that validates UTF-8 (which is the considerable majority of things these days, though certainly not all) will accept it.
UTF-16 fundamentally can’t represent surrogates: 0x0000 to 0xd7ff represent U+0000 to U+D7FF, and pairs of 0xd800–0xdbff (1024 code units) and 0xdc00–0xdfff (1024 code units) represent U+10000 to U+10FFFF (1024² code point), and there’s nothing left that could represent U+D800–U+DFFF.
With UTF-16, there were two possible choices: either make UTF-16 an incomplete encoding, unable to represent some Unicode text, or separate surrogates out as reserved for UTF-16, not encodable and not permitted in Unicode text. Neither solution is good (because too much software doesn’t validate Unicode, and too much software treats UTF-16 more like UCS-2), and the correct answer would have been to throw away UCS-2 as a failed experiment and never invent UTF-16, but here we are now. They decided in favour of consistency between Unicode Transformation Formats, which is fairly clearly the better of the first two choices.
> No, it should not have been valid UTF-8, and nothing that validates UTF-8 (which is the considerable majority of things these days, though certainly not all) will accept it.
Yes it should as those code points should have never been reserved. UTF-8 should only concern itself with encoding arbitrary 24-bit integers. And that those characters are reserved is also useful in practice because lots of things are encoded in UCS2 uhhm I mean "UTF-16" where unpaired surrogates are the reality. Hence: https://simonsapin.github.io/wtf-8/
That UTF-16 can't represent those code points is UTF-16's problem and should have been solved by those clinging to their UCS2 codebases instead of being enshrined in Unicode. It would have certainly been possible to make a 16-bit Unicode encoding that is as much UCS2-compatible as UTF-16 is but can represent all code points.
Unicode Transformation Formats are designed to be equivalent in what they can encode. This is simple fact, and sane. Yes, I hate how UTF-16 ruined Unicode, and the typical absence of validation of UTF-16 is a problem, but given the abomination that is UTF-16, surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.
Just about everything should validate UTF-8, because if you don’t, various invariants that you depend on will break. As a simple example, if you operate on unvalidated UTF-8, then all kinds of operations that care about code points will have to check that they’re in-bounds at every single code unit, because to do otherwise would be memory-unsafe, risking crashing or possibly worse. It’s faster and much more peaceful if you validate UTF-8 in all strings at time of ingress, once for all. Given Rust’s preferences for correctness and performance, it is very clearly in the right in validating UTF-8 in all strings.
> surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.
I disagree. Better would be to consider UTF-16 a legacy encoding that cannot handle all code points and agressively replace it with a better one (UTF-8 in pretty much all cases) just like we did with UCS-2 and ASCII before it.
> It’s faster and much more peaceful if you validate UTF-8 in all strings at time of ingress, once for all.
If you don't want bounds checks for every byte then all need is to add a buffer zone to the end of the string anyway (or parse the last < 4 bytes separately) which is exactly what vectorized UTF-8 parsers do already. But many operations on UTF-8 strings don't need to parse them anyway and those that do probably need to do much heaver unicode processing such as normalization or case folding anyway.
Being able to preserve all input is much more important than any theoretical concerns about additional bound checking performance.
> surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.
I'm okay with people not being able to paste emojis into the Windows port of an application; get yourself a better OS whose wide characters are 32 bits.
The Basic Multilingual Plane is good enough for business purposes in the developed world.
Surrogates are just some nonsense for Windows and some languages that start with J.
> UTF-8 should only concern itself with encoding arbitrary 24-bit integers
I mostly agree, but situations in which there are two or more ways to encode the same integer should use a canonical representation; so that is to say, it is well and good that overlong forms are banned in UTF-8.
So there is the rub; those pesky surrogate pairs create the same problem: a pair of code is used to produce a single character. When you're converting to UTF-8, a valid surrogate pair which encodes a character should just produce that character, and not the individual codes. The separate characters are kind of like an overlong form.
If a surrogate occurs by itself, though, and not part of a pair, that means the original string is not valid Unicode. That is the case here with the "\udead" string.
The choices are: encode that U+DEAD code point into UTF-8, thereby punting the problem to whoever/whatever decodes it, or else flag it here.
Overlong forms would break a lot of useful properties of UTF-8 that rely on the encoding being unique, most important of all perhaps being compatibility with ASCII tools.
Surrogates however have only one encoding (except for overlong forms) so this is not the same problem at all. You could say there is a disambiguity with decoding matching surrogate pairs in WTF-16 but this is a WTF-16 issue and would not exist if "UTF-16" would not have been built by requiring to carve out unicode code points.
> The choices are: encode that U+DEAD code point into UTF-8, thereby punting the problem to whoever/whatever decodes it, or else flag it here.
And encoding it in WTF-8 is the better choice because it retains more information, allowing you to perfectly round trip back to original WTF-16 string. But either way, handling unmatched surrogate pairs is not really a concern of the UTF-8/WTF-8 encoding (they are just code points as far as it is concerned) but of the UTF-16/WTF-16 decoding.
They didn't get it perfect, but, compared to what it was like to deal with Unicode in Python 2, I'd say they absolutely succeeded at improving the experience of working with Unicode text.
In my uninformed opinion they should have simply added the u"..." and b"..." formats and forbid mixing them or using a u with a b operation and viceversa.
python 3 could have simply changed the default from b to u for untagged strings.
$ python2
>>> "abc" + b"123"
'abc123'
>>> "abc" + u"123"
u'abc123'
$ python3
>>> "abc" + b"123"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>> "abc" + u"123"
'abc123'
>>> type(u"123")
<class 'str'>
P.S. Your use of the word "simply" there is a reminder of how easy it is to underestimate the complexity of handling encoding properly, since it took years (a decade?) to port libraries from Py2 to Py3, since it doesn't just affect string literals, but general IO when reading from files and sockets.
This is how Python 2 did it, only with the difference that you didn't need the b prefix for ascii/byte strings.
The problem was that this made a muddle of all the string handling functions, which needed to be prepared to handle both types. Including ones you write yourself. And, Python being a dynamic language, getting this right was fiddly and error prone. "Forbid mixing the two" was a workable, if ergonomically annoying, in statically typed languages that predated and then subsequently had to adopt Unicode. For Python, though, it was an enormous minefield of runtime errors.
The author must have missed the part of the Unicode spec where it ascribes dozens of properties to code points: it assigns them a general category (e.g. control character, lower case letter, punctuation, and more), a case mapping (if applicable), a numeric value (if applicable) and the kind of number it is (ordinal or otherwise), the script they are typically written in, and many more.
I think the confusion with Unicode is that programmers apply their non-technical preconception of what a "character" is rather than understanding the Unicode definition. Unicode defines a character as a representation of something - maybe it represents a line break (control character), a letter, an ideographic, a combining mark, or something else. A "code point" is just a number that a character is assigned for numerical representation in memory. A "grapheme cluster" is what a user perceives as a character - it's how non-programmers see text. What needs to happen is programmers need to hammer in their heads the Unicode definition of "character" just like they relearned to count from zero.
You’ve missed the point of the article, which is that optimising for code point indexing is silly. The author is well aware of all the properties you speak of, but they’re seldom things that end-developers should be touching directly, and none of them benefit from indexing by code point offset.
> UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Like all UCS-2 users, it is using UTF-16 today. Strings are sequences of UTF-16 code units. (If you don't believe me, you can verify this by seeing that emoji exists.)
The link uses the abhorrent term "characters" far too frequently to be saying anything meaningful.
I never liked Mathias’s description there. A better description is that JavaScript uses unvalidated UTF-16, with UTF-16 code unit indexing and access in most places. (Notwithstanding this, new functionality in the language tends to use Unicode scalar values where possible, e.g. string iterator decodes valid surrogate pairs.)
> Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.
It isn't possible to determine where to slice a string at grapheme cluster boundaries without indexing into the code points to find their combining classes.
The author's point here is self-contradictory, because if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.
What the author appears to want to argue is that the trade-offs are worth it. However, it's possible to make that argument without the bad faith claim that there's no value to the other perspective.
I believe I understand his point, I just think he overstated his case. Consider the two forms of the following argument against UTF-32/UCS-4:
A: Although reasoning about code points is more complex when dealing with UTF-8, the complexity can usually be remedied by an inner loop to iterate between code units, which is very simple in UTF-8, as (ch & 0b11000000) || !(ch & 0b10000000) being true will indicate the first byte of a code point. For this reason, the downside is limited and the upside - not having to convert to/from UTF-8, the most common encoding for Unicode text at rest - is significant.
B: There is no downside to working in UTF-8. If you find it easier to reason about UTF-32/UCS-4, you're just wrong. There's no virtue whatsoever in doing it that way.
Manish seems to be arguing B, and I simply can't agree. A strikes me as true enough without having to exaggerate like that.
> The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC).
> But for application logic, dealing with code points doesn’t really make sense.
> if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.
Why? I can safely slice anywhere in UTF-8 code units, and align to the next (or previous) grapheme cluster just as easily. And this is actually useful if I have some storage limitations, unlike codepoints.
Yes, it’s definitely annoying that you can’t get the grapheme clusters of a string without going outside the standard library. That’s a very basic need when you are dealing with user input and you care about the individual “characters” that have been sent in.
You can't get grapheme clusters but you can get actual codepoints by spreading a string into an array, ex [..."abc\uD83C\uDF08"] is ["a", "b", "c", "\u{1F308}"]
If you carefully manipulate the array to produce some different array, and then convert back to string/text, you can preserve the grapheme clusters just fine.
Just like the way string manipulating routines that don't know anything about Unicode can work with UTF-8 without breaking the text, simply by not truncating or separating the data in the middle of a UTF-8 sequence.
When I first read the headline, I thought "code point" referred to a specific location in a program's text. (E.g., file/line#/col#, or memory address.)
It was fun watching my brain try to make sense of that meaning.
Analyzing code in terms of "function points" was something done back in the day. Maybe it still is; in my mind it's part of the same general dustbin as UML et al.
Howso, given that function point analysis comes from requirements and doesn't even need code to exist? (Are you sure you know what function point analysis is? It sounds a lot like "sequence points" or "branching points", but in reality it's more like "story points"...)
> However, you don’t need code point indexing here, byte indexing works fine! UTF8 is designed so that you can check if you’re on a code point boundary even if you just byte-index directly.
Yes, UTF-8 is self-resynchronizing in either direction. I.e., if you pick a random byte index into a UTF-8 string, you can check if that byte's value is the start of a codepoint, and if not then you can scan backwards (or forward) to find the start of the current (or next) codepoint. Do be careful not to overrun the bounds of the string, if you're writing C code anyways.
Do note that start-of-codepoint is not the same thing as start-of-grapheme-cluster or start-of-character.
Also, TFA doesn't touch at all on forms and normalization. And I think TFA is confused about "meaning". Unicode codepoints very much have meaning, and that had better not change. TFA seems to be mostly about indexing into strings, and that content is fine.
> One very common misconception I’ve seen is that code points have cross-language intrinsic meaning.
Well, maybe that depends on what the meaning of "meaning" is.
Unicode codepoints have normative meanings -- that is, meanings assigned by the Unicode Consortium. Those "meanings" are embodied in a) their names and descriptions, b) their normative glyphs. Some codepoints are combining codepoints, and their meaning lies in the changes to the base codepoint's glyph when applied -- this is certainly a kind of "meaning".
But of course people can use codepoints (really, characters) in ways which do not comport to the meanings assigned by the UC. That's fine, of course. The real meanings of the words we write and how we use them, and the glyphs they are composed of, vary over time because human language evolves.
But TFA writes "cross-language intrinsic meaning". That's a more specific claim that is more likely to be true.
It's certainly true that Indic language glyphs have no meaning to me when mixed with Latin scripts, since I know nothing about them, though I can look up their meaning, and I can learn about them. And it's also true that confusable codepoint assignments (e.g., some Greek charaters look just like Latin characters, and vice-versa) may not, for the reader, have the UC's intended meaning, since to the reader the only meaning will only be the rendered glyph's rather than the codepoint's! After all, being confusable, the reader isn't likely to notice that some glyph is not Latin but Greek (or whatever).
But if some text does not mix confusable scripts with the intent of creating confusion, and if it is clear to the reader what is intended, then a reader familiar with the scripts (and languages) being used in the text will be able to discern intended meaning, and the UC-assigned meanings of the codepoints used will be... meaningful even though the text is mixed-script text.
So, yes, with caveats, Unicode codepoints do have "cross-language intrinsic meaning".
IMO, it would have been better for TFA not to mix the two things, codepoint meaning and string indexing issues.
Also, string indexing is just not needed. In general you have to parse text from first code unit to last. You might tokenize text, and then indexing into a sequence of tokens might have meaning / be useful, but indexing into UTF-8/16/32 by code unit is not really useful.
32 bits is not sufficient, emoji and combining characters go further. Even with UTF-32 you still can't treat "one 32 bit sequence" as "one character/grapheme/etc". It solves nothing, but it takes more space in the vast majority of cases.
Rust defines things like is_ascii_hexdigit() on both char (a Unicode scalar) and u8 (a byte) and so if you're writing some low-level code which cares only about bytes you aren't expected to either turn them into a string to find out if the byte you're looking at is an ASCII digit or improvise something.
This sort of thing means the programmer who is moving bytes and is very angry about the notion of character encoding needn't touch Rust's Unicode stuff, while the programmer who is responsible for text rendering isn't given a bunch of undifferentiated bytes and told "Good luck". Somebody needs to figure out what the encoding is, but likely that programmer actually cares about the difference between ISO-8859-1 and Windows code page 1252 or at least is aware that somebody else might care.