That should Just Do What It's Told and produce ED BA AD.

chrismorgan · on June 26, 2022

That would be baad, because it would not be valid UTF-8.

Surrogates are a mess. Blame UTF-16. It was a terrible idea that never should have been implemented.

morelisp · on June 26, 2022

Both you and parent are correct. (It should have been valid UTF-8, in practice it often is treated as such anyway. It should also never have existed in the first place.)

chrismorgan · on June 27, 2022

No, it should not have been valid UTF-8, and nothing that validates UTF-8 (which is the considerable majority of things these days, though certainly not all) will accept it.

UTF-16 fundamentally can’t represent surrogates: 0x0000 to 0xd7ff represent U+0000 to U+D7FF, and pairs of 0xd800–0xdbff (1024 code units) and 0xdc00–0xdfff (1024 code units) represent U+10000 to U+10FFFF (1024² code point), and there’s nothing left that could represent U+D800–U+DFFF.

With UTF-16, there were two possible choices: either make UTF-16 an incomplete encoding, unable to represent some Unicode text, or separate surrogates out as reserved for UTF-16, not encodable and not permitted in Unicode text. Neither solution is good (because too much software doesn’t validate Unicode, and too much software treats UTF-16 more like UCS-2), and the correct answer would have been to throw away UCS-2 as a failed experiment and never invent UTF-16, but here we are now. They decided in favour of consistency between Unicode Transformation Formats, which is fairly clearly the better of the first two choices.

account42 · on June 27, 2022

> No, it should not have been valid UTF-8, and nothing that validates UTF-8 (which is the considerable majority of things these days, though certainly not all) will accept it.

Yes it should as those code points should have never been reserved. UTF-8 should only concern itself with encoding arbitrary 24-bit integers. And that those characters are reserved is also useful in practice because lots of things are encoded in UCS2 uhhm I mean "UTF-16" where unpaired surrogates are the reality. Hence: https://simonsapin.github.io/wtf-8/

That UTF-16 can't represent those code points is UTF-16's problem and should have been solved by those clinging to their UCS2 codebases instead of being enshrined in Unicode. It would have certainly been possible to make a 16-bit Unicode encoding that is as much UCS2-compatible as UTF-16 is but can represent all code points.

Also, almost nothing should validate UTF-8.

chrismorgan · on June 27, 2022

Unicode Transformation Formats are designed to be equivalent in what they can encode. This is simple fact, and sane. Yes, I hate how UTF-16 ruined Unicode, and the typical absence of validation of UTF-16 is a problem, but given the abomination that is UTF-16, surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.

Just about everything should validate UTF-8, because if you don’t, various invariants that you depend on will break. As a simple example, if you operate on unvalidated UTF-8, then all kinds of operations that care about code points will have to check that they’re in-bounds at every single code unit, because to do otherwise would be memory-unsafe, risking crashing or possibly worse. It’s faster and much more peaceful if you validate UTF-8 in all strings at time of ingress, once for all. Given Rust’s preferences for correctness and performance, it is very clearly in the right in validating UTF-8 in all strings.

account42 · on June 28, 2022

> surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.

I disagree. Better would be to consider UTF-16 a legacy encoding that cannot handle all code points and agressively replace it with a better one (UTF-8 in pretty much all cases) just like we did with UCS-2 and ASCII before it.

> It’s faster and much more peaceful if you validate UTF-8 in all strings at time of ingress, once for all.

If you don't want bounds checks for every byte then all need is to add a buffer zone to the end of the string anyway (or parse the last < 4 bytes separately) which is exactly what vectorized UTF-8 parsers do already. But many operations on UTF-8 strings don't need to parse them anyway and those that do probably need to do much heaver unicode processing such as normalization or case folding anyway.

Being able to preserve all input is much more important than any theoretical concerns about additional bound checking performance.

kazinator · on June 27, 2022

> surrogates not being part of valid Unicode strings is certainly a better course of action than UTF-16 being unable to represent certain Unicode strings.

I'm okay with people not being able to paste emojis into the Windows port of an application; get yourself a better OS whose wide characters are 32 bits.

The Basic Multilingual Plane is good enough for business purposes in the developed world.

Surrogates are just some nonsense for Windows and some languages that start with J.

kazinator · on June 27, 2022

> UTF-8 should only concern itself with encoding arbitrary 24-bit integers

I mostly agree, but situations in which there are two or more ways to encode the same integer should use a canonical representation; so that is to say, it is well and good that overlong forms are banned in UTF-8.

So there is the rub; those pesky surrogate pairs create the same problem: a pair of code is used to produce a single character. When you're converting to UTF-8, a valid surrogate pair which encodes a character should just produce that character, and not the individual codes. The separate characters are kind of like an overlong form.

If a surrogate occurs by itself, though, and not part of a pair, that means the original string is not valid Unicode. That is the case here with the "\udead" string.

The choices are: encode that U+DEAD code point into UTF-8, thereby punting the problem to whoever/whatever decodes it, or else flag it here.

account42 · on June 28, 2022

Overlong forms would break a lot of useful properties of UTF-8 that rely on the encoding being unique, most important of all perhaps being compatibility with ASCII tools.

Surrogates however have only one encoding (except for overlong forms) so this is not the same problem at all. You could say there is a disambiguity with decoding matching surrogate pairs in WTF-16 but this is a WTF-16 issue and would not exist if "UTF-16" would not have been built by requiring to carve out unicode code points.

> The choices are: encode that U+DEAD code point into UTF-8, thereby punting the problem to whoever/whatever decodes it, or else flag it here.

And encoding it in WTF-8 is the better choice because it retains more information, allowing you to perfectly round trip back to original WTF-16 string. But either way, handling unmatched surrogate pairs is not really a concern of the UTF-8/WTF-8 encoding (they are just code points as far as it is concerned) but of the UTF-16/WTF-16 decoding.