> char's are UTF-8 code points I'm guessing you mean that char is a UTF-8 code *...

WalterBright · on Jan 21, 2022

Ah, code unit is correct. I am always getting code unit and code point conflated. Sorry about that.

> a UTF-8 code unit is almost but not quite the same as a byte

D dodges that problem by having `byte` for signed 8 bits, and `ubyte` for unsigned 8 bits. That gets rid of the "is char unsigned or signed" confusion and bugs.

tialaramex · on Jan 21, 2022

Aha, so if my D code doesn't care about text at all, I only need to worry about ubyte (or if I want them signed, byte) and char is irrelevant to me. That makes some sense.

Still it seems weird to me to separate the concerns "This byte isn't a valid UTF-8 code unit" and "This string isn't valid UTF-8". I don't know what flavour of error handling D has, but I can't conceive of too many places where I want to handle those cases differently, so if they're separate I have to do the same work twice.

Also, one very natural response to invalid UTF-8 is to emit U+FFFD (the replacement character �) but of course that doesn't fit into a UTF-8 code unit, so an API which takes a ubyte and gives back a char must error.

WalterBright · on Jan 21, 2022

I think you'll find that once you get used to the idea that char is for UTF-8, and ubyte is for other representations, it not only feels natural but the code becomes clearer.

As for invalid code points, you get to choose the behavior:

1. throw exception

2. use the replacement char

3. ignore it

All are valid in real life, and so D is accommodating.

emmelaich · on Jan 22, 2022

What is D's type for a unicode character (code point?)

WalterBright · on Jan 22, 2022

A `string`, which is a typedef for `const(char)[]`, or a readonly view of an array of characters.

tialaramex · on Jan 22, 2022

Note that because of the above definition of char (any UTF-8 code unit) these "strings" aren't necessarily valid UTF-8 and you might be better off with its dchar type (which is UTF-32 and thus can hold an entire Unicode code point)

It's unclear to me whether there's actually enforcement to ensure char can't become say 0xC0 (this byte can't occur in UTF-8 because it claims to be the prefix of a multi-byte sequence, yet it also implies that sequence should only be one byte long... which isn't multi-byte)

I spent a while reading the D website and came away if anything more confused on this topic.

WalterBright · on Jan 22, 2022

Checking to see if the string is valid UTF-8 requires code to execute, which gets in the way of performance. Therefore, you can check with functions like:

https://dlang.org/phobos/std_utf.html#validate

when it is convenient in your code to do so.

tialaramex · on Jan 22, 2022

I don't see much value in a distinct char type which has no real semantics beyond being an unsigned byte.

Calling validate on a whole string makes at least some sense, I don't love it, but it's not crazy. But it's obviously never worth "validating" a single char, so why not just call them bytes (or ubytes if that's the preferred D nomenclature)

WalterBright · on Jan 23, 2022

There are some extra semantics that strings have in D, but you have a point. Why make it special?

It's because strings are such a basic thing in programming, that baking them into language has a way of standardizing them. Instead of everyone having their own string type, there is one string type.

One of the problems with strings not being a builtin type is what does one do with string tokens? (C++ has this problem with its library string type.) Being an array of bytes in a function signature gives no clue if it is intended to be a string, or an array of other data.

It's just a big advantage to building it in, and making it interact smoothly with the other core language features.