Hacker News new | past | comments | ask | show | jobs | submit login

> Can you say more about this? Are there still reasons to prefer shiftjis over unicode for japanese characters?

Yes. Unicode uses codepoints that are primarily for Chinese characters to represent Japanese characters that they consider equivalent, even when those Japanese characters have different appearences; as a result, Japanese text in unicode looks bad (readable, but ugly) unless displayed in a specifically Japanese font (in which case you'd have the converse problem of Chinese text looking wrong). The unicode consortium suggests various vaporware approaches to combat this, but the thing that actually works is keeping Japanese text in Japanese encodings and Chinese text in Chinese encodings. (Of course this means that you need to be able to display text from multiple encodings in the same page if you want to display both languages on the same page, but all of the unicode consortium vaporware fixes require you to build something equivalently complex, and you wouldn't even be able to test it by using strings from two western encodings since it would be specific to Japanese and Chinese)




There's no difference between encoding a page in Shift_JIS and tagging the page as Japanese language (assuming your browser switches fonts correctly based on language tags). Chinese text in Shift_JIS isn't going to look correct either.


I think the suggested alternative though is having UTF-8 take care of this.

UTF-8 already has different codepoints for every other language, like I can type russian (День), greek (Ημέρα), english, even egyption hieroglyphs (𓀃) all of them in this one text field, and they all render right for both of us if we have a suitable font. If they don't render, it's unambiguous.

It's only chinese and japanese where I can type some characters, and depending on if you have a chinese or japanese font _first_ in your system, it might render wrong, and either way it's ambiguous to the computer without further metadata. The computer doesn't need metadata to know that "α" is "α", why does it need metadata to know if 直 is the chinese (http://www.hanzi5.com/bishun/76f4.html) or japanese (https://kanjivg.tagaini.net/viewer.html?kanji=%E7%9B%B4) character, two different looking characters that share the same codepoint.


Unicode does not have different codepoints for every language. Leaving aside scripts that aren't encoded at all, all languages using the Latin script get a single set of glyphs, and if you want Comic Sans MS or Carolingian miniscule or fine-grained control over whether "a" has a hook at the top or not, you need to specify a font that renders it the way you want, just like with 直.

Because Unicode is about the meaning of characters, not their appearance. If there were a context where the two different shapes of 直 have different meaning, Unicode would add a new codepoint to distinguish them. In fact "ɑ" without hook does have a separate codepoint because it is used in linguistics to mark a vowel different from "a".


If CYRILLIC SMALL LETTER A deserves a code point despite being so similar to LATIN SMALL LETTER A, why doesn't a Japanese character that actually looks different from a Chinese one get a codepoint? And the idea that something can have the "same meaning" across two languages is very wooly.


Why don't Japanese characters in seal script or handwriting get different codepoints from mincho font characters? They look different too.

Sometimes looking visually different doesn't matter because if you know how to write them by stroke order, you'll still be able to read it (which is how handwriting works, I think; I'm pretty bad at reading that…)

Hanzi simplification in the Mainland also complicated things, since I doubt they wanted to make all of those into different characters.


Sure, but every application handles encoding (or at least, every application did up until the recent UTF-8-only movement), whereas language tagging is web/html-specific.


If you're really hardcore processing multilingual text you need an equivalent to language tags anyway, because you need a dictionary for word-wrapping, date formatting, quote marks etc along with changing fonts (or glyph selectors in the same font.) TeX has them too.

But usually people only care about their language, so it goes by the system UI language and it works out.


> If you're really hardcore processing multilingual text you need an equivalent to language tags anyway, because you need a dictionary for word-wrapping, date formatting, quote marks etc along with changing fonts (or glyph selectors in the same font.) TeX has them too.

Sure. And once you're doing that you don't gain a lot of benefit from unicode AFAICS, because you have to track these spans of locale-specific text.


Lets you copy characters that are the same (like people's names) between text though.


People's names are the thing you should be most careful to not do that with! Most Japanese people will generally put up with seeing the Chinese rendering of a character in ordinary text, but they really don't want you to do that to their names!


What's wrong with just wrapping UTF-8 text in lang spans? Browsers and Electron will just do the right thing, no?


Even assuming that works, getting it to make its whole way through your whole tech stack is no easier (and more HTML-specific) than having spans with their own encoding.


Since when can HTML pages have multiple encodings at the same time?


This isn't about multiple encodings, this is about the one encoding, UTF-8, being able to represent multiple languages. Which is mostly can, except for han characters.

In my reply here, I can type in english and russian at once. Привет, мир.

Yet, if I try to type chinese on one line, and japanese on the next, I cannot do it. Hacker news does not let me enter "lang" tags, so I can only type either the chinese or the japanese variant of a kanji.


Parent post says

> spans with their own encoding.

so yes it is about pages with multiple encodings. A span's smaller than a page!

The Unicode answer is "variation selectors" which are used for some historical variant kanji, but not for whole language switching. I suppose they could be used for that too though.


I don't know about HTML specifically; I meant spans as a general concept rather than a literal <span> tag. If the web stack has actually implemented mixing languages on the same page to the point where you can use it in a "normal" application then that's very cool (and if they've done it with their lang tags rather than by allowing mixed encodings, well, fine), but I've yet to see a site that actually has that up and running.


I can't read it, but there will be hundreds of articles on Japanese Wikipedia covering Chinese literature etc that have text in both languages, all in Unicode.


Had a look, they have a macro for it. Cool!


Also,

  <html lang="ja">
    Japanese text ... 
    <span lang="zh">
      Chinese quote
    </span>
    ... 
  </html>
is much easier than mixing encodings. With the above entirely in Unicode, it will be handled reliably by anything that can handle Unicode, and is still reasonably readable even if the Chinese text is shown in a Japanese font. Reading just the fourth line without the third will still show something 'OK'.

Mixing Shift_JIS and Big5 sounds like a recipe for corruption, but something similar was done in an old Russian and Japanese encoding: https://en.wikipedia.org/wiki/Shift_Out_and_Shift_In_charact...

RFC 2482 was a Unicode adaptation of this, but it was deprecated 12 years ago: https://www.rfc-editor.org/rfc/rfc6082.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: