> *Or you can sacrifice the ability to display Chinese correctly for the sake of...

ajoberstar · 2024-11-13T01:04:39 1731459879

> The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be. For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical [⺿], whereas the simplified Chinese, Japanese, and Korean glyphs [⺾] use three. But there is only one Unicode point for the grass character (U+8349) [草] regardless of writing system. Another example is the ideograph for "one," which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently.

https://en.m.wikipedia.org/wiki/Han_unification

Seems like Wikipedia has a good overview of the issue.

lmm · 2024-11-13T06:36:53 1731479813

> aren't kanji and written Chinese the same language?

No. They're different languages, and the written forms are similar but distinct, akin to e.g. Fraktur - you can read glyphs from the other language, but it's harder and looks odd. (Yes Unicode doesn't have codepoints for Fraktur either, but no real-life world language uses Fraktur, so it's not a significant issue there).

> What's so special about Japanese that it can't be displayed by Unicode? Unicode seems to work fine for Korean and Chinese, and Japanese is basically a hybrid of those two.

Live everyday Korean is written in Hangul, Hanzi are only for historical documents, which limits the impact. Taiwanese glyphs (which you don't mention, but for completeness) get their own codepoints because the Unicode consortium had a certain amount of geopolitical realism. So the only collision in real-life world languages is between Chinese and Japanese, and everything gets set up to use Chinese glyphs by default (or, more often, to exclusively use Chinese glyphs) and Japanese is the only real-life world language whose glyphs don't have proper codepoints in unicode (and are fobbed off with mumble mumble font selection mumble ranges vaporware).

> If the Unicode standard has space for an ever-expanding list of emojis, they can fix their rendering issues with the Japanese language too.

Oh they absolutely could. They don't want to. Also migration would be difficult - you would have to make a hard compatibility break to ensure that people switched to the new standard, otherwise you'd have old documents that look almost but not quite right and now also where e.g. search doesn't work properly (because searching for the new codepoints wouldn't find characters encoded at the old ones, because no-one actually follows the standards for how to do text search, they just search for byte substrings and call it good).

mkl · 2024-11-13T09:21:39 1731489699

Unicode does have codepoints for Fraktur (for maths): http://xahlee.info/comp/unicode_index.html?q=fraktur

hinoki · 2024-11-13T00:42:45 1731458565

They were the same a long time ago, but there has been some drift. If you use the wrong typeface, it’s still intelligible but looks strange.

Blog post I found with some examples: https://blog.skritter.com/2015/06/font-differences-between-j...

spacehunt · 2024-11-13T05:35:59 1731476159

It doesn't work fine for Korean and Chinese either, we just accept it begrudgingly.

Check out the Noto Sans CJK fonts repo[1], as of now it has five variations: Japanese, Korean, Simplified Chinese, Traditional Chinese, Hong Kong. There wouldn't be a need for so many variations if Unicode works perfectly.

But Unicode is already infinitely better than what existed before, so as I said above, we just kind of accept it begrudgingly.

[1] https://github.com/notofonts/noto-cjk

FeteCommuniste · 2024-11-13T00:48:06 1731458886

Japanese adopted the "traditional" Chinese characters and their system evolved from that point. Both Chinese and Japanese characters underwent a simplification process from those original traditional forms, but the simplications aren't always the same. Japanese also has hiragana and katakana, which are syllabic and phonetic and often used to represent foreign names. So the short answer is that no, you can't simply use the Chinese unicode block, whether traditional or simplified, to represent Japanese.

Korean uses Hangul, which is an alphabet and entirely unrelated to Chinese or Japanese.

hnick · 2024-11-13T02:10:11 1731463811

I'm not Korean but my understanding is they do use the Chinese-derived script occasionally, for emphasis or to solve ambiguity, among other things. So some users will need access to that script too.

https://en.wikipedia.org/wiki/Hanja

kbolino · 2024-11-13T15:29:48 1731511788

My very limited understanding is that Hanja use is extremely rare today, only ever cropping up in niches of academia. Basically everyone will exclusively use Hangul and will disambiguate with synonyms, idioms, etc. instead of Hanja.

hnick · 2024-11-13T22:35:52 1731537352

Yes from what I heard it would be very few users, definitely not the general public. Maybe academia, signage, marketing, historical/legal use, etc. It might be enough to simply plug the gap with occasional images instead of fonts. But CJK fonts often do apparently include Hanja.

numpad0 · 2024-11-13T04:55:28 1731473728

It's not the same language, just the script. Someone brought Japan the newfangled concept of language scribbled on objects some time between 3 to 5 century AD, and the Japanese nobles adopted that Chinese written language for record keeping and transcriptions.

There was already the Japanese spoken language, and nobody in Japan had personal connections to Chinese people(having a rough sea between two countries tend to do that), so Japanese interpretation of "Chinese language" and its characters is completely its own thing - in China the script is called "Hanzi", pronounced like "kHang-zeh", rather than "knanJI" in Japanese.

With the "Simplified" form created by Chinese communist movement in the mid 20th century, there are currently at least 3 major branches of Chinese scripts: the OG "Traditional" kind, its alternate "Simplified" form, and the Japanese "Kanji" - plus (deprecated)Vietnam Chunom variant, (minor)Hong Kong PRC-traditional variation, Korean Kanji w/o post-war Japanese simplifications, etc.

Each ... has slight variations and significant overlaps. Unicode technically supports a lot of those, some by co-mingling, some by rubberstamped-in duplicates, some by IVS(Ideographic Variation Sequence).

To realistically support all languages in one font or app, there need to be distinctions based on languages rather than bandaging hand-wavy "it's all kanji" approach(the word kanji is Japanese, for starters), but the Unicode Consortium is not doing that.

wffurr · 2024-11-13T00:44:13 1731458653

Many Kanji and Chinese glyphs are distinct. There are also many variants of Chinese, among them simplified and traditional. They don't have distinct Unicode codepoints due to Han Unification which tried to cram them all into UTF-16.

You have to know the intended locale or the text to disambiguate and select the correct glyphs.

syncsynchalt · 2024-11-14T18:53:14 1731610394

Weirdly, there _are_ separate codepoints for simplified vs traditional chinese characters, cf. 丟 (traditional) vs 丢 (simplified) [1]. This contributes to the annoyance of Japanese users, it feels like the unicode consortium went out of their way to mangle only their language.

[1] One major reason is that pre-Unicode charsets exist which encode both traditional and simplified characters, and one of the primary goals of Unicode is to support round-trip mapping from any charset into Unicode and back without loss of information.

Onavo · 2024-11-13T00:47:24 1731458844

Well, the world is on UTF-8 now, they can extend the standard and put the Japanese characters there. There's plenty of space since it's variable length.

wffurr · 2024-11-13T01:17:01 1731460621

Sure but "they" can't undo the billions of already authored documents and the thousands of overlapping fonts.

hnfong · 2024-11-13T16:23:29 1731515009

It's actually what they did. I heard from people involved in the CJK-unification project that the Japanese representatives insisted this be done as a condition for their participation.

Given that there were thousands of such characters, some characters slipped through and the same code point was used by both Chinese and Japanese.

The Japanese have been bickering about the handful of those cases ever since.

wisty · 2024-11-13T03:37:32 1731469052

You are half right, just like unicode. IIRC you can mostly translate Kanji and Hanzi characters and they are almost 1:1 but you need to know whether to use a Japanese or Chinese font, and unicode just assume you have the right font IIRC (and good luck if a span of text has both for some reason).

Like, image if coordonner (french), coordinate (English) and coördinate (variant) all were encoded.as the same bytes.

hnfong · 2024-11-13T16:31:38 1731515498

IMHO this is as stupid as a Italian complaining about Americans using English fonts to display their language.

As I mentioned in another comment, there already are separate Japanese and Chinese code points for most CJK characters. It's just a handful of cases where for whatever reason the same code points were used (maybe they forgot to separate some characters due to human error), and the Japanese (it's always the Japanese) has been bickering about the situation ever since.

kmeisthax · 2024-11-13T06:45:58 1731480358

> and Japanese is basically a hybrid of those two.

This is prime /r/BadLinguistics fodder but you've ironically hit the head on the problem.

The underlying issue is that Unicode was run by people who thought 16 bits was enough[0], ran into the issue of Chinese characters, and imposed a bunch of very specific unification rules to work around their own self-imposed technical limitation so they could retain 16-bit codepoints. The rule is that characters that only differ in appearance are treated as the same character[1].

To explain how dumb this is, I'm going to invent a concept called UniPhoenician. You see, Latin, Greek, Cyrillic, and a few other phonetic scripts have significant derivation from Phoenician writing, so we're going to just merge the letters that happen to have a shared pedigree. i.e. Latin a, Greek alpha, and Cyrillic a. Of course, once we do that, software now has to be consciously aware of what language text is written in so it can substitute the right set of glyphs to work around UniPho.

To make this even dumber, the limitation that motivated UniHan went away with Unicode 2.0, which went to 20-bit codepoints. Except we didn't fix UniHan, AND we subtly broke old software. You see, 16-bit codepoints was the only encoding for Unicode 1.0. Unicode 2.0 added UTF-8[2] and UTF-16, the latter of which is a series of rules on how to fit 20-bit codepoints into 16-bit text in a way that subtly breaks old software, which hopefully will get updated and then people can just pick what codepoint length they want.

Well, uh... turns out Windows NT and JavaScript already were using 16-bit codepoints, and integrating the new UTF-16 rules into them subtly breaks existing software based on that. So those can never be fixed, and any software built on their text-handling capabilities is subtly broken in the face of emoji, rare characters, and so on. Bonus points is that, because they can't understand UTF-16's special characters, naively written conversion functions working with 16-bit-only software will leak invalid UTF-16 sequences into UTF-8, as documented in WTF-8[3].

[0] Competing proposals for a universal character set, including ISO's UCS, used 4 byte characters, see:

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#...

[1] Unless the difference is between Simplified and Traditional Chinese, because Unicode didn't wanna piss off Mainland China but was okay with pissing off Japan and Korea

[2] AKA Filesystem Safe Unicode, which shoves Unicode in 8 bits in a way that is mostly acceptable and doesn't impose any Latin-centrism that non-Latin script users need to worry about. Bonus points is that it was originally designed for 32-bit codepoints (anticipating UCS?). If we ever needed 32-bits, UTF-8 could handle them, while UTF-16 would require additional layers of hacks that would spill over into WTF-8.

[3] https://simonsapin.github.io/wtf-8/

hnfong · 2024-11-13T16:55:08 1731516908

> The underlying issue is that Unicode was run by people who thought 16 bits was enough[0]

It really depends on how you look at it. I wouldn't fault the original designers of unicode who thought 16 bits was enough.

In modern unicode there are apparently 90k+ CJK characters, but honestly "nobody" actually uses more than the 10k. The thing is the unicode codepoints proliferate:

One for Japan, one for simplified and one for traditional. That's potentially 3x. Korea and Vietnam wants some too.

Then you have this thing called the Kangxi dictionary which was the most comprehensive Chinese character dictionary that includes all obscure characters people could find in the classical literature. Probably half of the characters in Kangxi have this description: "Variant of <more common character>". (I'm pretty sure a significant portion of these variants are just "typos" in old books)

If they were more "economical" in the usage of codepoints, i.e. don't create new code points for each country (just let them use the right fonts), don't include all the frivolous characters in Kangxi, etc... I'm pretty sure it's technically possible for CJK to use less than 10k characters in total. (which, if it did, will let unicode fit within 16 bits)

But this is a political nightmare and nobody wants this compromise. And the Unicode Consortium isn't going to prescribe some solution to some Asian countries on how their languages should be used.

So while the 16-bit limit was a bit tight even assuming optimal conditions, I really wouldn't fault the people who designed this limit to realize Asian cultural politics was so complicated. Heck, AFAIK, before CJK-unification, ALL East Asian codecs were 16-bit, and it was sufficient for each country respectively.

kbolino · 2024-11-13T15:37:41 1731512261

Not just Windows and JavaScript, the JVM (Java etc.) and CLR (C# etc.) also use 16-bit encodings natively. It's actually kind of amazing how short-lived 16-bit Unicode was and yet how much of an effect it has on software today.

dwaltrip · 2024-11-13T15:51:25 1731513085

This sounds like a techno-cultural issue that would require nearly infinite patience to try to improve.

Phew… I got dizzy just reading your comment.

ThinkingGuy · 2024-11-13T16:34:56 1731515696

As usual, there's an relevant XKCD:

https://xkcd.com/1726/