> For example the Greek letter Big Alpha looks like uppercase A. If they're trul...

schoen · 2024-03-24T20:00:38 1711310438

One argument would be that you can apply functions to change their case.

For example in Python

  >>> "Ᾰ̓ΡΕΤΉ".lower()
  'ᾰ̓ρετή'
  >>> "AWESOME".lower()
  'awesome'

The Greek Α has lowercase form α, whereas the Roman A has lowercase form a.

Another argument would be that you want a distinct encoding in order to be able to sort properly. Suppose we used the same codepoint (U+0050) for everything that looked like P. Then Greek Ρόδος would sort before Greek Δήλος because Roman P is numerically prior to Greek Δ in Unicode, even though Ρ comes later than Δ in the Greek alphabet.

mmoskal · 2024-03-24T20:23:30 1711311810

Apparently this works very well, except for a single letter, Turkish I. Turkish has two version of 'i' and Unicode folks decided to use the Latin 'i' for lowercase dotted i, and Latin 'I' for uppercase dot-less I (and have two new code points for upper-case dotted I and lower-case dot-less I).

Now, 'I'.lower() depends on your locale.

A cause for a number of security exploits and lots of pain in regular expression engines.

edit: Well, apparently 'I'.lower() doesn't depend on locale (so it's incorrect for Turkish languages); in JS you have to do 'I'.toLocaleLowerCase('tr-TR'). Regexps don't support it in neither.

ninkendo · 2024-03-24T20:10:11 1711311011

To me, it depends on what you think Unicode’s priorities should be.

Let’s consider the opposite approach, that any letters that render the same should collapse to the same code point. What about Cherokee letter “go” (Ꭺ) versus the Latin A? What if they’re not precisely the same? Should lowercase l and capital I have the same encoding? What about the Roman numeral for 1 versus the letter I? Doesn’t it depend on the font too? How exactly do you draw the line?

If Unicode sets out to say “no two letters that render the same shall ever have different encodings”, all it takes is one counterexample to break software. And I don’t think we’d ever get everyone to agree on whether certain letters should be distinct or not. Look at Han unification (and how poorly it was received) for examples of this.

To me it’s much more sane to say that some written languages have visual overlap in their glyphs, and that’s to be expected, and if you want to prevent two similar looking strings from being confused with one another, you’re going to have to deploy an algorithm to de-dupe them. (Unicode even has an official list of this called “confusables”, devoted to helping you solve this.)

layer8 · 2024-03-24T20:16:02 1711311362

They can be drawn the same, but when combining fonts (one latin, one greek), they might not. Or, put differently, you don’t want to require the latin and greek glyphs to be designed by the same font designer so that “A” is consistent with both.

There are more reasons:

– As a basic principle, Unicode uses separate encodings when the lower/upper case mappings differ. (The one exception, as far as I know, being the Turkish “I”.)

– Unicode was designed for round-trip compatibility with legacy encodings (which weren’t legacy yet at the time). To that effect, a given script would often be added as whole, in a contiguous block, to simplify transcoding.

– Unifying characters in that way would cause additional complications when sorting.

andrewaylett · 2024-03-24T21:39:19 1711316359

In some cases, because they have distinct encodings in a pre-Unicode character set.

Unicode wants to be able to represent any legacy encoding in a lossless manner. ISO8859-7 encodes Α and A to different code-points, and ISO8859-5 has А at yet another code point, so Unicode needs to give them different encodings too.

And, indeed, they are different letters -- as sibling comments point out, if you want to lowercase them then you wind up with α, a, and а, and that's not going to work very well if the capitals have the same encoding.

michaelt · 2024-03-24T20:40:34 1711312834

Unicode's "Han Unification" https://en.wikipedia.org/wiki/Han_unification aimed to create a unified character set for the characters which are (approximately) identical between Chinese, Japanese, Korean and Vietnamese.

It turns out this is complex and controversial enough that the wikipedia page is pretty gigantic.

samatman · 2024-03-24T23:07:46 1711321666

The basic answer here is that Unicode exists to encode characters, or really, scripts and their characters. Not typefaces or fonts.

Consider broadcasting of text in Morse code. The Morse for the Cyrillic letter В is International Morse W.

In the early years of Unicode, conversion from disparate encodings to Unicode was an urgent priority. Insofar as possible, they wanted to preserve the collation properties of those encodings, so the characters were in the same order as the original encoding whenever they could be.

But it's more that Unicode encodes scripts, which have characters, it doesn't encode shapes. With 10,000 caveats to go with that, Unicode is messy and will preserve every mistake until the end of time. But encoding Α and A and А as three different letters, that they did on purpose, because they are three different letters, because they're a part of three different scripts.

schoen · 2024-03-25T06:24:00 1711347840

It occurs to me (after mentioning collation order, in a different part of this thread, as one reason that we would want to distinguish scripts) that it might be unclear even for collation purposes when scripts are or are not distinct, especially for Cyrillic, Latin, and Arabic scripts which are used to write many different languages which have often added their own extensions.

I guess the official answer is "attempt to distinguish everything that any language is known to distinguish, and then use locales to implement different collation orders by language", or something like that?

But it's still not totally obvious how one could make a principled decision about, say, whether the encoding of Persian and Urdu writing (obviously including their extensions) should be unified with the encoding of Arabic writing. One could argue that Nastaliq is like a "font"... or not...

samatman · 2024-03-26T14:47:39 1711464459

Characters in Unicode can have more than one script property, so the question "is this text entirely Bengali/Devanagari" can be answered even though they share characters. But Unicode encodes scripts, not languages, and not shapes.

Many things we might want to do with strings require a locale property, which Unicode tried allowing as an inline representation, this was later deprecated. I'm not convinced that was the correct decision, but it is what it is. If you want to properly handle Turkish casing or Swedish collation, you have to know that the text you're working with is Turkish or Swedish, no way around it.

adzm · 2024-03-24T19:52:56 1711309976

> If they're truly drawn the same (are they?) then why have a distinct encoding?

They may be drawn the same or similar in some typefaces but not all.

crote · 2024-03-25T01:46:34 1711331194

Because some characters which look the same need to be treated differently depending on context. A 'toLowercase' function would convert Α->α, but A->a. That would be impossible if both variants had the same encoding.

mgaunard · 2024-03-24T20:17:08 1711311428

Because graphemes and glyphs are different things.

hanche · 2024-03-24T20:23:24 1711311804

You may be amused to learn about these, then:

U+2012 FIGURE DASH, U+2013 EN DASH and U+2212 MINUS SIGN all look exactly the same, as far as I can tell. But they have different semantics.

layer8 · 2024-03-24T20:29:03 1711312143

They don’t necessarily look the same. The distinction is typographic, and only indirectly semantic.

Figure dash is defined to have the same width as a digit (for use in tabular output). Minus sign is defined to have the same width and vertical position as the plus sign. They may all three differ for typographic reasons.

hanche · 2024-03-24T21:27:30 1711315650

Ah, good point. But typography is supposed to support the semantics, so at least I was not totally wrong.

ahazred8ta · 2024-03-24T20:57:51 1711313871

In Hawaiʻi, there's a constant struggle between the proper ʻokina, left single quote, and apostrophe.