Not sure if the l33tspeak analogy is fully justified.
In case of the "missing" letter (called khanda-ta in Bengali) for the Bengali equivalent of "suddenly", historically, it has been a derivative of the ta-halant form (ত + ্ + ). As the language evolved, khanda-ta became a grapheme of its own, and Unicode 4.1 did encode it as a distinct grapheme. A nicely written review of the discussions around the addition can be found here: http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf
I could write the author's name fine: আদিত্য. A search with the string in the Bengali version of Wikipedia pulls up quite a few results as well, so other people are writing it too. The final "letter" in that string is a compound character, and there's no clear evidence that it needs to be treated as an independent one. Even while in primary school, we were taught the final "letter" in the author's name as a conjunct. In contrast, for the khanda-ta case, it could be shown that modern Bengali dictionaries explicitly referred to khanda-ta as an independent character.
For me, many of these problems are more of an input issue, than an encoding issue. Non latin languages have had to shoe-horn their script onto keyboard layouts designed for latin-scripts, and that has been always suboptimal. With touch devices we have newer ways to think about this problem, and people are starting to try out things.
[Disclosure: I was involved in the Unicode discussions about khanda-ta (I was not affiliated with a consortium member) and I have been involved with Indic localization projects for the past 15 years]
Well, yes and no. The jophola at the end is not actually given its own codepoint[0]. The best analogy I can give is to a ligature in English[1]. The Bengali fonts that you have installed happen to render it as a jophola, the way some fonts happen to render "ff" as "ff" but that's not the same thing as saying that it actually is a jophola (according to the Unicode standard).
The difference between the jophola and an English ligature, though, is English ligatures are purely aesthetic. Typing two "f" characters in a row has the same obvious semantic meaning as the ff ligature, whereas the characters that are required to type a jophola have no obvious semantic, phonetic, or orthographic connection to the jophola.
> The Bengali fonts that you have installed happen to render it as a jophola
It's not only the Bengali font - the text rendering framework of my operating system also needs to have a bunch of complex rules to figure out that a jophola needs to be rendered. It also needs to know that the visual ordering of i-kar is before the preceding consonant cluster (দ in আদিত্য).
> the characters that are required to type a jophola have no semantic, phonetic, or orthographic connection to the jophola.
Not so sure about that. The fact that it's called a "jo"-phola points to a relationship. The relationship may have become less apparent as the script has evolved (though there are words such as সহ্য which makes the relationship more visible), but the distinction is still not as pronounced as between "ta" and "khanda-ta". For the khanda-ta case, it was explicit from the the then-current editions of the dictionaries produced by the language bodies of both Bangladesh and West Bengal that the character had become distinct (স্বতন্ত্র বর্ণ was the phrase that was used). As far as I know, there hasn't been any such claim about jophola from the language bodies. Also, if you look at the collation in Bengali dictionaries, jo-phola is treated as (্+য) for collation.
While one can make the case that ত্য is simply "'to' - 'o' + 'ya' = 'to'"[0][1], it's rather confusing mental acrobatics, and it doesn't reflect either how the writing system is taught, or how native speakers use it and think of it on a day-to-day basis.
If anything, your comment makes a stronger argument for consolidating ই and ি (they are literally the same letter and phoneme, but written differently in different contexts) than for combining the viram and য into the jophola.
[0] To non-Bengali speakers reading this, yes, this is how that construction would work, and yes, I am aware that the arithmetic doesn't appear to add up (which I guess is part of the point).
[1] Also, now that I think about it, the য is a consonant, not a vowel, so using it in place of a vowel is doubly awkward. This is particularly an issue in Bengali, where sounds that might be consonants in English (like "r" and "l") can be either consonants or vowels in Bengali, depending on the word.
> [...] it's rather confusing mental acrobatics, and it doesn't reflect either how the writing system is taught, or how native speakers use it and think of it on a day-to-day basis.
Mental acrobatics are part-and-parcel of the language, either in digital or non-digital form. If I were to spell out your name aloud, I would end with "ত-এ য-ফলা", which doesn't really say anything about how ত্য is pronounced. While writing on paper, we say "ক-এ ইকার", and then we reorder what we just said to write the ইকার in front of the ক. Even more complicated mental acrobatics - we say ক-এ ওকার, and then proceed to write half of the ওকার in front of the ক and then the other half, after the ক. We don't necessarily think about these when we carry out these acrobatics in our head, but they exist, and we have made the layer on top of the encoding system (rendering, and to some extent, input) deal with these acrobatics as well. My point in the original comment (and to some extent in the preceding one) was to emphasize that a lot of these issues are at the input method level - we should not have to think about encoding as long as it accurately and unambiguously represent whatever we want it to represent.
Just out of curiosity - I would be interested to know more about your learning experience that you feel is not well aligned with the representation of jophola as it is currently.
> My point in the original comment (and to some extent in the preceding one) was to emphasize that a lot of these issues are at the input method level - we should not have to think about encoding as long as it accurately and unambiguously represent whatever we want it to represent.
I might be sympathetic to this, except that keyboard layouts and input (esp. on mobile devices) is an even bigger mess and even more fragmented than character encoding. Furthermore, while keys are not a 1:1 mapping with Unicode codepoints, they are very strongly influenced by the defined codepoints.
It'd be nice to separate those two components cleanly, but since language is defined by how it's used, this abstraction is always going to be very porous.
> Just out of curiosity - I would be interested to know more about your learning experience that you feel is not well aligned with the representation of jophola as it is currently.
I have literally never once heard the jophola referred to as a viram and a য, except in contexts such as this one. Especially since the jophola creates a vowel sound (instead of the consonant য), and especially since the jophola isn't even pronounced like "y". (I understand why the jophola is pronounced the way it does, but arguing on the basis of phonetic Sanskrit is a poor representation of Bengali today - by that point, we might as well be arguing that the thorn[0] is equivalent to "th" today, or that æ should be a separate letter and not a dipthong[1])
I'm not going to claim that it's completely without pre-digital precedent, but it certainly is not universal, and it's inconsistent in one of the above ways no matter how one slices it, especially when looking at some incredibly obscure and/or antiquated modifiers that are given their own characters, despite being undeniably joined of other characters that are already Unicode codepoints[2].
Not necessarily disagreeing with your broader point but I just want to point out that the examples are only obscure and antiquared in English. æ is common in modern Danish and unambiguously a separate letter, as is þ in Icelandic.
And Norwegian (for æ). We have æ/Æ and ø/Ø (distinct, in theory, from the symbol for the empty set, btw), while å/Å used to be written aa/AA a long time ago (but is obviously not a result of combining two a's). Swedish uses ö/Ö for essentially ø/Ø, ä/Ä for æ/Æ. And both of those I can only easily type by combining the dot-dot, with o/O, a/A because my Norwegian keyboard layout has key for/labelled øæå, not öäå.
For Japanese (and Chinese and a few others) things are even more complicated. It's tricky to fit ~5000 symbols on a keyboard, so typically in Japan one types on either a phonetic layout, or a latin layout, and translate to Kanji as needed (eg: "nihongo" or "にほんご" is transformed to "日本語" -- note also that "ご" itself is a compound character, "ko" modified by two dots to become "go" -- which may or may not be entered as a compound, with a modifier key).
As I currently don't have any Japanese input installed under xorg, that last bit I had to cut and paste.
It is entirely valid to view ø as a combination of o and a (short) slash, or å as a combination of "a" and "°" -- but if one does that while typing, it is important that software handles the compound correctly (and distinct from ligatures, as mentioned above). My brother's name, "Ståle" is five letters/symbols long, if reversed it becomes "elåtS", not "el°atS" (six symbols).
So, yeah, it's complicated. Remember that we've spent many years fighting the hack that was ascii, extended ascii (which may be (part of) why eg: Norwegian gets to have å rather than a+°). You still can't easily use utf8 with neither C nor, as I understand it C++ (almost, but not quite -- AFAIK one easy workaround is to use QT's strings if one can have qt as a dependency -- and it's still a mess on Windows, due to their botched wide char hacks... etc).
All in all, while it's nice to think that one can take some modernized, English-centric ideas evolved from the Gutenberg press, and mash it together with what constitutes a "letter" (How hard is it to reverse a string!? How hard is it to count letters!? How hard is it to count words?!) -- that approach is simply wrong.
There will always be magic, and there'll be very few things that can be said with confidence to be valid across all locales. What is to_upper("日本語"), reverse("Ståle"), character_count("Ståle"), word_count("日本語") etc.
This turned into a bit more of an essay than I intended, sorry about that :-)
To be fair, proper codepoint processing is a pain even in Java, which was created back when Unicode was in 16-bit mode. Now that it's extended to 32-bits, proper Unicode string looping looks something like this:
for(int i = 0; i < string.length();) {
final int codepoint = string.codePointAt(i);
i += Character.charCount(codepoint);
}
Actually, that's not correct, and it's the exact same mistake I made when using that API. codePointAt returns the codepoint at index i, where i is measured in 16-bit chars, which means you could index into the middle of a surrogate pair.
The correct version is:
for (int i = 0; i < string.length(); i = string.offsetByCodePoints(i, 1))
{
int codepoint = string.codePointAt(i);
}
Java 8 seems to have acquired a codePoints() method on the CharSequence interface which seems to do the same thing.
But this just adds to the fact, proper Unicode string processing is a pain :).
I think you missed the part where `i` is not incremented in the for statement, but inside the loop using `Character.charCount`, which returns the number of `char` necessary to represent the code point. If there's something wrong with this, my unit tests have never brought it up, and I am always sure to test with multi-`char` codepoints.
> except that keyboard layouts and input (esp. on mobile devices) is an even bigger mess and even more fragmented than character encoding.
Encoding was in a similar place 10-15 years ago. Almost every publisher in Bengali had their own encoding, font, and keyboard layout - the bigger ones built their own in-house systems, while the smaller ones used systems that were built or maintained by very small operators. To make things even more complicated, these systems needed a very specific combination of operating system and page layout software to work. Now the situation is quite better with most publishers switching to Unicode, at least for public facing content.
With input methods, I expect to see at least some consolidation - I don't necessarily think we need standards here, but there will clear leaders that emerge. Yes, keyboard layouts are influenced by Unicode code-points, but only in a specific context. Usually when people who already have experience with computers start to type in Bengali (or any other Indic language), they use a phonetic keyboard, which is influenced mostly by the QWERTY layout. Then, if they write a significant amount, they find that the phonetic input is not very efficient (typing kha everytime to get খ is painful), and they switch to a system where there's a one-to-one mapping between commonly used characters and keys. This does tend to have a relationship between defined codepoints and keys, but that's probably because the defined codepoints cover the basic characters in the script (so in your case, ্য would need to have a separate key, which I think is fine). There will still be awkward gestures, but that's again, a part of adjusting to the new medium. No one bats an eyelid when hitting "enter" to get a newline - but when we learn to write on paper, we never encounter the concept of a carriage-return.
> I have literally never once heard the jophola referred to as a viram and a য
Interesting - I guess we have somewhat different mental models. For me, I did think of jophola as a "hoshonto + jo", possibly because of the "jo" connection, and this was true even before I started to mess around with computers or Unicode. I always thought about jophola as a "yuktakshar", and if it's a "yuktakshar", I always mentally broke it down to its constituents.
> [...] especially when looking at some incredibly obscure and/or antiquated modifiers that are given their own characters
I think those exist because of backwards compatibility reasons. For Bengali I think Unicode made the right choice to start with the minimum number of code points (based on what ISCII had at that time). As others have pointed out elsewhere in the thread - it is an evolving standard, and additions are possible. Khanda-ta did get accepted, and contrary to what many think, non-consortium members can provide their input (for example, I am acknowledged in the khanda-ta document I linked to earlier, and all I did was participate in the mailing list and provide my suggestions and some evidence).
A better question is, Are there any native Bengali speakers creating character set standards in Bangladesh or India? If not, why not? If so, did they omit your character?
I ask, because although you prefer to follow the orthodox pattern of blaming white racism for your grievance du jour, the policy of the Unicode Technical Committee for years has been to use the national standards created by the national standards bodies where these scripts are most used as their most important input.
Twenty years ago, I spent a lot of time in these UTC meetings, and when the question arose as to whether to incorporate X-Script into the standard yet, the answer was never whether these cultural imperialists valued, say, Western science fiction fans over irrelevant foreigners, but it was always, "What is the status of X-Script standardization in X-land?" Someone would then report on it. If there was a solid, national standard in place, well-used by local native speakers in local IT applications, it would be fast-tracked into Unicode with little to no modification after verification with the national authorities that they weren't on the verge of changing it. If, however, there was no official, local standard, or several conflicting standards, or a local standard that local IT people had to patch and work around, or whatever, X-Script would be put on a back burner until the local experts figured out their own needs and committed to them.
The complaint in this silly article about tiny Klingon being included before a complete Bengali is precisely because getting Bengali right was more complex and far more important. Apparently, the Bengali experts have not yet established a national standard that is clear, widely implemented, agreed upon by Bengali speakers and that includes the character the author wants in the form he/she wants it, for which he/she inevitably blames "mostly white men."
(Edited to say "he/she", since I don't know which.)
I mostly agree with your point, but note that the author is male (well, the name is a commonly male one).
It's a bit telling that folks in the software industry[1] seem to assume that techies are male (a priori), but those who write articles of this kind are female.
Not blaming you for it, but it's something you should try to be conscious about and fix.
[1] I've been guilty of this myself, though usually in cases where I use terms like "guys" where I shouldn't be.
I had a female coworker by that name, so your assumption that I just assume that people who write articles like this are female and need to have my consciousness raised to "fix" my unconscious sexism is something you should try to be more conscious of and try to fix.
However, I clearly do need to question my assumption that since this was a female name before, it's a female name now, so I should change it to "he/she".
Oh, sorry about that. Not sure if you're joking about the assumption of assumptions, but asking people to take note of their behavior based on something that they _might_ have assumed is not dangerous. Assuming gender roles is. Apologies for making that assumption, but IMO it's a rather harmless one so I don't see anything to fix about it :P
> The complaint in this silly article about tiny Klingon being included before a complete Bengali is precisely because getting Bengali right was more complex and far more important.
This is factually incorrect. It seems you missed both the factual point about the Klingon script in the article as well as the broader point which that detail was meant to illustrate.
> although you prefer to follow the orthodox pattern of blaming white racism for your grievance du jour, the policy of the Unicode Technical Committee for years has been to use the national standards created by the national standards bodies where these scripts are most used as their most important input.
There's a huge difference between piggybacking off of a decades-old proposed scheme which was never widely adopted even in its country of origin, and which was created under a very different set of constraints than Unicode, and which was created to address a very different set of goals than Unicode, versus making native speakers an active and equal part of the actual decision-making process.
Rather than trying to shoehorn the article into a familiar pattern which doesn't actually fit ("orthodox pattern of blaming white racism for your grievance du jour"), please take note that the argument in the article is more nuanced than you're giving it credit for.
versus making native speakers an active and equal part of the actual decision-making process.
As I explained, native speakers are the primary decision makers, and not just any native speakers but whoever the native speakers choose as their own top, native experts when they establish their own national standard. For living, natural languages, you don't get characters into Unicode by buying a seat on the committee and voting for them. You do it by getting those characters into a national standard created by the native-speaking authorities.
So, I repeat: What national standard have your native-speaking authorities created that reflects the choices you claim all native speakers would naturally make if only the foreign oppressors would listen to them? If your answer is that the national standards differ from what you want, then you are blaming the Unicode Technical Committee for refusing to override the native speakers' chosen authorities and claiming this constitutes abuse of native Bengali speakers by a bunch of "mostly white men".
> As I explained, native speakers are the primary decision makers
No, the ultimate decision makers of Unicode are the voting members of the Unicode Consortium (and its committees).
> For living, natural languages, you don't get characters into Unicode by buying a seat on the committee and voting for them. You do it by getting those characters into a national standard created by the native-speaking authorities
As referenced elsewhere in the comments, there are plenty of decisions that the Unicode Consortium (and its committees) take themselves. Some of these (though not all) take "native-speaking authorities" as an input, but the final decision is ultimately theirs.
There's a very important difference between being made an adviser (having "input") and being a decision-maker, and however much the decision-makers may value the advisers, we can't pretend that those are the same thing.
You claim that native Bengali speakers on the UTC would have designed the character set your way, the real native speaker way, instead of the bad design produced by these "mostly white men".
But the character set WAS designed by native speakers, by experts chosen not by the UTC but by the native speaking authorities themselves. The UTC merely verified that these native speaking experts were still satisfied with their own standard after using it for a while, and when they said they were, the UTC adopted it.
You go on about how the real issue is the authority of these white men and how the native speakers are restricted to a minor role as mere advisers, and yet the native speakers, as is usually the case, had all the authority they needed to create the exact character set that THEY wanted and get it adopted into Unicode. That's the way the UTC wants to use its authority in almost all cases of living languages.
Unfortunately for your argument, these native speakers didn't need any more authority to get the character set they wanted into Unicode. They got it. You just don't like their choices, but you prefer to blame it on white men with authority.
It seems to me that the high-level issue here is that Unicode is caught between people who want it to be a set of alphabets, and people who want it to be a set of graphemes.
The former group would give each "semantic character" its own codepoint, even when that character is "mappable" to a character in another language that has the same "purpose" and is always represented with the same grapheme (see, for example, latin "a" vs. japanese full-width "a", or duplicate ideograph sets between the CJK languages.) In extremis, each language would be its own "namespace", and a codepoint would effectively be described canonically as a {language, offset} pair.
The latter group, meanwhile, would just have Unicode as a bag of graphemes, consolidated so that there's only one "a" that all languages that want an "a" share, and where complex "characters" (ideographs, for example, but what we're talking about here is another) are composed as ligatures from atomic "radical" graphemes.
I'm not sure that either group is right, but trying to do both at once, as Unicode is doing, is definitely wrong. Pick whichever, but you have to pick.
Unicode makes extensive use of combining characters for european languages, for example to produce diacritics: ìǒ or even for flag emoji. A correct rendering system will properly combine those, and if it doesn't then that's a flaw in the implementation, not the standard. It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
> Unicode makes extensive use of combining characters for european languages, for example to produce diacritics: ìǒ or even for flag emoji.
But it doesn't, for example say that a lowercase "b" is simply "a lowercase 'l' followed by an 'o' followed by an invisible joiner", because no native English speaker thinks of the character "b" as even remotely related to "lo" when reading and writing.
> It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker. It does it in Bengali even where it makes little or no semantic sense to a native Bengali speaker.
> > It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
> I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker.
Well, combining characters almost never come up in English. The best I can think of would be the use of cedillas, diaereses, and acute accents in words like façade, coördinate and renownèd (I've been reading Tolkien's translation of Beowulf, and he used renownèd a lot).
Thinking about the Spanish I learned in high school, ch, ll, ñ, and rr are all considered separate letters (i.e., the Spanish alphabet has 30 letters; ch is between c and d, ll is between l and m, ñ is between n and o, and rr is between r and s; interestingly, accented vowels aren't separate letters). Unicode does not provide code points for ch, ll, or rr; and ñ has a code point more from historical accident than anything (the decision to start with Latin1). Then again, I don't think Spanish keyboards have separate keys for ch, ll, or rr.
Portuguese, on the other hand, doesn't officially include k or y in the alphabet. But it uses far more accents than Spanish. So, a, ã and á are all the same letter. In a perfect world, how would Unicode handle this? Either they accept the Spanish view of the world, or the Portuguese view. Or, perhaps, they make a big deal about not worrying about languages and instead worrying about alphabets ( http://www.unicode.org/faq/basic_q.html#4 ).
They haven't been perfect. And they've certainly changed their approach over time. And I suspect they're including emoji to appear more welcoming to Japanese teenagers than they were in the past. But (1) combining characters aren't second-class citizens, and (2) the standard is still open to revisions ( http://www.unicode.org/alloc/Pipeline.html ).
Spanish speaker here. "ch" and "ll" being separate letters has been discussed for a long time and finally the decision was that they weren't separate letters but a combination of two [1]. Meanwhile, "ñ" stands as a letter of its own.
Accented vowels aren't considered different letters in Spanish because they affect the word they are in rather than the letter, as they serve to indicate which one is the "strong" syllable in a word. From a Spanish view of point, "a" and "á" are exactly the same letter.
I'm coming from a German background and I sympathize with the author.
German has 4 (7 if you consider cases) non-ASCII characters:
äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.
That's not related to composing on a keyboard. In fact, although I'm German I'm using the US keyboard layout and HAD to compose these characters now. But I wouldn't need to and the result is a single codepoint again..
> German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.
German does not consider "ä", "ö" and "ü" letters. Our alphabet has 26 letters none of which are the ones you mentioned. In fact, if you go back in History it becomes even clearer that those letters used to be ligatures in writing.
They still are collated as the basic letters the represent, even if they sound different. That we use the uncomposed representation in Unicode usually, is merely a historical artifact because of iso-8859-1 and others, not because it logically makes sense.
When you used an old typewriter you usually did not have those keys either, you composed them.
I'm confused by your use of 'our' and 'we'. It seems you're trying to write from the general point of view of a German, answering .. a German?
Are umlauts letters? Yes. [1] [2] Maybe not the best source, but please provide a better one if you disagree so that I can actually understand where you're coming from.
I understand - I hope? - composition. And I tend to agree that it shouldn't matter much if the input just works. If I press a key labeled ü and that letter shows up on the screen, I shouldn't really care if that is one codepoint or a composition of two (or more).
I do think that the history you mention is an indicator that supports the author's argument. There IS a codepoint for ü (painful to type..). For 'legacy reasons' perhaps. And it feels to me that non-ASCII characters - for legacy reasons or whatever - have better support than the ones he is complaining about, if they originate in western Europe/in my home country.
(basically I searched for old typewriter models, 'Adler Schreibmaschinen' results in lots of hits like that). Note the separate umlaut keys. And these are typewriters from .. the 60s? Maybe?)
I am not entirely sure if Germans count umlauts as distinct characters or modified versions of the base character. And maybe it is not so important; they still do deserve their own code points.
Note BTW that in e.g. Swedish and German alphabets, there are some overlapping non-ASCII characters (ä, ö) and some that are distinct to each language (å, ü). It is important that the Swedish ä and German ä are rendered to the same code point and same representation in files; this way I can use a computer localised for Swedish and type German text. Only when I need to type ü I need to compose it from ¨ and u, while ä and ö are right on the keyboard.
The German alphabetical order supports the idea that umlauts are not so distinct from their bases: it is
AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ
while the Swedish/Finnish one is
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
This has the obvious impacts on sorting order.
BTW, traditionally Swedish/Finnish did not distinguish between V and W in sorting, thus a correct sorting order would be
Vasa
Westerlund
Vinberg
Vårdö
- the W drops right in the middle, it's just an older way to write V. And Vå... is at the end of section V, while Va... is at the start.
German has valid transcriptions to their base alphabet for those, e.g "Schreoder" is a valid way to write "Schröder".
ß, however, is a separate character that is not listed in the german alphabet, especially because some subgroups don't use it. (e.g. swiss german doesn't have it)
1) To avoid confusing readers that don't know German or are used to umlauts: The correct transcription is base-vowel+e (i.e. ö turns to oe - the example given is therefor wrong. Probably just a typo, but still)
2) These transcriptions are lossy. If you see 'oe' in a word, you cannot (always) pronounce it as umlaut. The second e just might indicate that the o in oe is long.
3) ß is a character in the alphabet, as far as I'm aware and as far as the mighty Wikipedia is concerned, as I pointed out above. If you have better sources that claim something else, please share those (I .. am a native speaker, but no language expert. So I'm genuinely curious why you'd think that this letter isn't part of the alphabet).
Fun fact: I once had to revise all the documentation for a project, because the (huge, state-owned) Swiss customer refused perfectly valid German, stating "We don't have that letter here, we don't use it: Remove it".
1) It's a typo, yes. Thanks!
2) Well, they are lossy in the sense that pronunciation is context-sensitive. The number of cases where you actually turn the word into another word is very small: http://snowball.tartarus.org/algorithms/german2/stemmer.html has a discussion.
3) You are right, I'm wrong. ß, ä, ö, ü are considered part of the alphabet. It's not tought in school, though (at least not in mine).
Thanks a lot for making the effort and fact-checking better then I did there :).
Yes, that transcription approach is familiar; here the result of German-Swedish-Finnish equivalency of "ä" is sometimes not so good.
For instance, in skiing competitions, the start lists are for some reason made with transcriptions to ASCII. It's quite okay that Schröder becomes Schroeder, but it is less desirable that Söderström becomes Soederstroem and quite infuriating that Hämäläinen becomes Haemaelaeinen. We'd like it to be Hamalainen, just drop the dots.
Well, they have codepoints, but not unique ones (since they can be written both using combining characters or using the compatibility pre-combined form). Software libraries dealing with unicode strings needs to handle both versions, by applying unicode normalization before doing comparisons.
The reason they have two representations is for backwards compatibility with previous character encoding standards, but the unicode standard is more complex because of this (it needs to specify more equivalences for normalization). I guess for languages which were not previously covered by any standards, the unicode consortium tries to represent things "as uniquely as possible".
> But I wouldn't need to and the result is a single codepoint again..
Doesn't have to be though, it'd be perfectly correct for an IME to generate multiple codepoints. IIRC, that's what you'd get if you typed those in a filename on OSX then asked for the native file path, as HFS+ stores filenames in NFD. Meanwhile Safari does (used to do?) the opposite, text is automatically NFC'd before sending. Things get interesting when you don't expect it and don't do unicode-equivalent comparisons.
.Net seems to do the same thing, Javascript (according to jsfiddle) as well. So maybe this is more widespread than I thought (again - I have never seen that character in the wild)?
Java (as in Try Clojure) seems to do the 'expected' SS thing. Trying the golang playground I get even worse:
fmt.Println(strings.ToUpper("ßẞ"))
returns
ßẞ
(yeah, unchanged?)
So, while I agree that you're technically correct (ẞ exists!) I do stick to my ~7 letters list for now.. It seems that's both realistic and usable.
I think this is more related to the fact that there aren't many sane libraries implementing unicode and locales -- so you'll get either some c lib/c++ lib, system lib, java lib -- or an actual new implementation that's actually been done "seriously" -- as part of being able to say: "Yes, X does actually support unicode strings.".
Python3 got a lot of flac for the decision to break away from it's byte sequences, to it's a unicode string. But I think that was the right choice. I still understand why people writing software that only cared about network, on-the-wire, pretend-to-be-text type strings.
Then again, based on some other comments here, apparently there are still some dark corners:
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
>>> s="Åßẞ"
>>> s == s.upper().lower()
False
>>> s.lower()
'åßß'
Thanks for pointing that out -- I was vaguely aware 3.2 wasn't good (but pypy still isn't up to 3.4?) -- it's what's (still) in Debian stable as python3 though. Jessie (soonish to be released) will have 3.4 though, so at that point python3 should really start to be viable (to the extent that there are differences that actually are important...).
[ed: Also, wrt upper/lower being for display purposes -- I thought it was nice to point out that they are not symmetric, as one might expect them to (although that expectation is probably wrong in the first place...]
That's good to know. I learned Portuguese in '97-'99, so the information I had was incorrect at the time. We Anericans always recited the alphabet with k and y, but our teacher said they weren't official (although he also said that Brazilians would recognize them).
rr not being its own letter has no bearing on if you can pronounce carro correctly, just like saying church right has no bearing on if c and h are two letters or ch is a single letter.
I'm afraid that I have to add my voice to the list of people raised in Spanish speaking countries prior to the 90s who was VERY clearly taught that rr was a separate letter.
In your example... I wouldn't really care how it is stored, as long as it looks right on the display, and I don't have to go through contortions to enter it on an input device... for example, I don't care that 'a' maps to \x61 ... it's a value behind the scenes... it's the interface to that value.
As long as the typeface/font used can display the character/combination reasonably, and I can input reasonably it doesn't matter so much how it's used...
Now, having to type in 2-3 characters to get there, that's a different story, and one that should involve better input devices in most cases.
(I can't read Bengali, so I'm not entirely sure what the johphola is, but I'm trying to relate this to Devanagari -- if my analogy is mistaken or if you don't know Devanagari, let me know)
I don't see anything discriminatory about not giving glyphs their own codepoints. Devanagari has tons of glyphs which are logically broken up into consonants, modifying diacritics, and bare vowels. Do we really need separate codepoints for these[1] when they can be made by combinations of these[2]?
I mentioned this as a reply to another comment, but it's only a Unicode problem if:
- There is no way to write the glyph as a combination of code points
- There is a way to write the glyph as a combination of code points, but the same combination could mean something else (not counting any rendering mistakes, the question is about if Unicode defines it to mean something else)
If it's hard to input, that's the fault of the input method. If it doesn't combine right, it's the fault of the font.
> For me, many of these problems are more of an input issue, than an encoding issue.
I think you've hit the nail on the head here. I'm a native English speaker, so I may in fact be making bad assumptions here, but I think the biggest issue here is that people conflate text input systems with text encoding systems. Unicode is all about representing written text in a way that computers can understand. But the way that a given piece of text is represented in Unicode bears only a loose relation to the way the text is entered by the user, and the way it's rendered to screen. A failure at the text input level (can't input text the way you expect), or a failure at the text rendering level (text as written doesn't render the way you expect), are both quite distinct from a failure of Unicode to accurately represent text.
They're not unrelated though. You have to have a way to get from your input format to the finished product in a consistent way, and the glyph set you design has a large bearing on that. You can't solve it completely with AI, because then you just have an AI interpretation of human language, not human language. A language like Korean written in Hangul would need to create individual glyphs from smaller ones through the use of ligatures, but a similar approach couldn't be taken to Japanese, since many glyphs have multiple meanings depending on context. How should these be represented in Unicode? Yes, these are likely solved problems, but I'm sure there are other examples of less-prominent languages that have similar problems but nobody's put in the work to solve them because the languages aren't as popular online.
You need to be able to represent the language at all stages of authorship - i.e. Unicode needs to be able to represent a half-written Japanese word somehow (yes, Japanese is a bad example because it has a phonetic alphabet as well as a pictograph alphabet).
Anyway, trying to figure out a single text encoding scheme capable of representing every language on Earth is not an easy task.
Its not an AI issue, just a small matter of having lots of rules. Moreover this is not just an issue for non-Western languages: the character â (lower case "a" with a circumflex) can be represented either as a single code-point U+00E2 or as an "a" combined with a "^". Furthermore Unicode implementations are required to evaluate these two versions as being equal in string comparisons, so if you search for the combined version in a document, it should find the single code point instances as well.
> Unicode implementations are required to evaluate these two versions as being equal in string comparisons
What do you mean by "required"? There's different forms of string equality. It's plausible to have string equality that compares the actual codepoint sequence, vs string equality that compares NFC or NFD forms, and there's string equality that compares NFKC or NFKD forms. And heck, there's also comparing strings ignoring diacritics.
Any well-behaving software that's operating on user text should indeed do something other than just comparing the codepoints. In the case of searching a document, it's reasonable to do a diacritic-insensitive search, so if you search for "e" you could find "é" and "ê". But that's not true of all cases.
(OK, so "required" might be overstating it; you are perfectly free to write a program that doesn't conform to the standard. But most people will consider that a bug unless there is a good reason for it)
Unicode defines equivalence relations, yes. But nowhere does is a program that uses Unicode required to use a equivalence relation whenever it wishes to compare two strings. It probably should use one, but there are various reasons why it might want strict equality for certain operations.
In some languages those accented characters would be different letters, sometimes appearing far away from each other in collation order. In other cases they are basically the same letter. Whereas in Hungarian 'dzs' is a letter.
Different languages can define different collation rules even when they use the same graphemes. For example, in Swedish z < ö, but in German ö < z. Same graphemes, different collation.
And we may even have more than one set of collation rules within the same language.
E.g. Norwegian had two common ways of collating æ,ø,å and their alternative forms ae, oe and aa. Phone books used to collate "ae" with æ, "oe" with ø and "aa" with å, while in other contexts "ae", "oe" and "aa" would often be collated based on their constituent parts. It's a lot less common these days for the pairs to be collated with æøå, but still not unheard of.
Of course it truly becomes entertaining to try to sort out when mixing in "foreign" characters. E.g I would be inclined to collate ö together with ø if collating predominantly Norwegian strings, since ö used to be fairly commonly used in Norway too, but these days you might also find it collated with "o".
Why does Unicode need to represent a half-written Japanese word? If it's half-written, you're still in the process of writing it, and this is entirely the domain of your text input system.
Which is to say, there is absolutely no need for the text input system to represent all stages of input as Unicode. It is free to represent the input however it chooses to do so, and only produce Unicode when each written unit is "committed", so to speak. To demonstrate why this is true, take the example of a handwriting recognition input system. It's obviously impossible to represent a half-written character in Unicode. It's a drawing! When the text input system is confident it knows what character is being drawn, then it can convert that into text and "commit" it to the document (or the text field, or whatever you're typing in).
But there's nothing special about drawing. You can have fancy text input systems with a keyboard that have intermediate input stages that represent half-written glyphs/words. In fact, that's basically what the OS X text input system does. I'm not a Japanese speaker, so I actually don't know whether all the intermediate forms that text input in the Japanese input systems (there's multiple of them) go through have Unicode representations, but the text input system certainly has a distinction between text that is being authored and text that has been "committed" to the document (which is to say, glyphs that are in their final form and will not be changed by subsequent typed letters). And I'm pretty sure the software that you're typing in doesn't know about the typed characters until it's "committed".
Edit: In fact, you can even see this input system at work in the US English keyboard. In OS X, with the US English keyboard, if you type Option-E, it draws the ACCUTE ACCENT glyph (´) with a yellow background. This is a transitional input form, because it's waiting for the user to type another character. If the user types a vowel, it produces the appropriate combined character (e.g. if the user types "e" it produces "é"). If the user types something else, it produces U+00B4 ACCUTE ACCENT followed by the typed character.
>Why does Unicode need to represent a half-written Japanese word? If it's half-written, you're still in the process of writing it, and this is entirely the domain of your text input system.
Drafts or word documents (which are immensely simpler if just stored as unicode). Then there's the fact that people occasionally do funky things with kanji anyway, so you're doing everyone a favour by letting them half-write a word anyway.
Interestingly enough, one early Apple II-era input method for Chinese would generate the characters on the fly (since the hardware at the time couldn't handle storing, searching and rendering massive fonts) meaning it could generate partial Chinese characters or ones that didn't actually exist.
FWIW there is an interesting project called Swarachakra[1] which tries to create a layered keyboard layout for Indic languages (for mobile) that's intuitive to use. I've used it for Marathi and it's been pretty great.
They also support Bengali, and I bet they would be open to suggestions.
Wait, "ত + ্ + = ৎ" is nothing like "\ + / + \ + / = W".
The Bengali script is (mostly) an abugida. Ie, consonants have an inherent vowel (/ɔ/ in the case of Bengali), which can be overriden with a diacritic representing a different vowel. To write /t/ in Bengali, you combine the character for /tɔ/, "ত", with the "vowel silencing diacritic" to remove the /ɔ/, " ্". As it happens, for "ত", the addition of the diacritic changes the shape considerably more than it usually does, but it's a perfectly legitimate to suggest the resulting character is still a composition of the other two (for a more typical composition, see "ঢ "+" ্" = "ঢ্").
As it happens, the same character ("ৎ") is also used for /tjɔ/ as in the "tya" of "aditya". Which suggests having a dedicated code point for the character could make sense. But Unicode isn't being completely nutso here.
Native Bengali here. The ligature used for the last letter of "haTaat" (suddenly) is not the same as the last ligature in "aditya" - the latter doesn't have the circle at the top.
More generally, using the vowel silencing diacritic (hasanta) along with a separate ligature for the vowel ending - while theoretically correct - does not work because no one writes that way! Not using the proper ligatures makes the test essentially unreadable.
I don't understand Bengali at all. I'm trying to understand your second sentence though. When you say "no one writes that way", do you mean nobody hits the keys for letter, followed by vowel-silencing diacritic, followed by another vowel? Or do you mean the glyph that results from that combination of keystrokes doesn't match how a Bengali speaker would write it on paper?
If it's the latter, isn't that an issue for the text input system to deal with? Unicode does not need to represent how users input text, it merely needs to represent the content of the text. For example, in OS X, if I press option-e for U+0301 COMBINING ACUTE ACCENT, and then type "e", the resulting text is not U+0301 U+0065. It's actually U+00E9 (LATIN SMALL LETTER E WITH ACUTE), which can be decomposed into U+0065 U+0301. And in both forms (NFC and NFD), the unicode codepoint sequence does not match the order that I pressed the keys on the keyboard.
So given that, shouldn't this issue be solved for Bengali at the text input level? If it makes sense to have a dedicated keystroke for some ligature, that can be done. Or if it makes sense to have a single keystroke that adds both the vowel-silencing diacritic + vowel ending, that can be done as well.
---
If the previous assumption was wrong and the issue here is that the rendered text doesn't match how the user would have written it on paper, then that's a different issue. But (again, without knowing anything about Bengali so I'm running on a lot of assumptions here) is that still Unicode's fault? Or is it the fault of the font in question for not having a ligature defined that produces the correct glyph for that sequence of codepoints?
On page 3 I see two different pieces of Bengali. One is text, and the other is an image. I assume you're referring to the text? What makes it wrong? And what software are you using to view the PDF? If it's wrong, it's quite possible that the software you're using doesn't render it correctly, rather than the document actually being wrong.
I have viewed it in Acrobat Reader, epdfview, Chrome's PDF Reader, Firefox's pdf reader and my iPhone's pdf reader.
The problems are that joint-letter ligatures are not used, and several vowels signs are placed after the consonant when they should have been placed before them.
I'm a bit confused too. There's a principled argument in Unicode for when a glyph gets its own codepoint vs when it's considered sufficient to use a combining form. I don't know Bengali at all so can't comment on this case, although given the character now is in Unicode I guess the argument changed over time. Somewhere buried in the Unicode Consortium notes is an explicit case for the inclusion / exclusion of this character, it'd be interesting to find it.
I don't understand Bengali at all, but the character ৎ does have it's own Unicode codepoint (U+09CE, BENGALI LETTER KHANDA TA). It was introduced in Unicode 4.1 in 2005.
I get the author's point, but assigning blame to the Unicode Consortium is incorrect. The Indian government dropped the ball here. They went their separate way with "ISCII" and other such misguided efforts, instead of cooperating with UC. To me, the UC is just a platform.
The government is the de facto safeguarder of the people's interests; if it drops the ball, it should be taken to task, not the provider of the platform.
There are over 100 languages spoken in India, many of which are not even Hindustani in origin. The Indian government primarily recognizes two languages: Hindi and Urdu. Additionally, Urdu is the official national language of Pakistan.
Are speakers of Bengali, Tamil, Marathi, Punjabi, etc. really supposed to depend on the Indian government to assure that the Unicode Consortium supports their native tongues? What about speakers of Hindustani/Indo-Iranian languages who do not live in India?
I don't think it's right to pin the blame on India and tell Bengali speakers that they're barking up the wrong tree. If the Unicode Consortium purports to set the standard for the encoding of human speech, then it seems to me that the responsibility should fall squarely on them.
The Indian government recognizes 22 languages, not 2. The government has a department dedicated to their support: http://tdil.mit.gov.in Why isn't someone asking that department WTF has it been doing?
I don't purport to know what the Technology Development for Indian Languages Departments is doing with its time, but perhaps its inefficacy is a signal that speakers of such languages should not need to depend on them for techno-linguistic representation.
This is a terrible excuse: the Unicode Consortium should always seek out at least one (if not a group of) native speakers of a language before defining code points for that language. These speakers really should be both native speakers of and experts in the language.
There are countless ways to reach out to Bengali speakers, only one of which is the Indian government - whatever politics governments may play, a technocratic institution should be focused on getting things correct.
That's ridiculous. How can you expect UC to reach out to the 1000s of languages out there? Plus, shouldn't the party who expects to benefit put in the effort?
Did you know that the Indian Government has a department for just such a thing, http://tdil.mit.gov.in/ ? What has it been doing all these years?
"The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard, which specifies the representation of text in all modern software products and standards."
The fact that it is their goal to set the standard for the textual representation of human speech means that they take on that responsibility.
Wikipedia is not forced unto the world as everyone's only source of knowledge.
Unicode on the other hand is the only way many people have to input and see text in their native language. When one group proposes being the ultimate solution to everyone's problems, and then pushes their standard forward as such, complaints about inadequacies in the solution presented are perfectly fair and valid.
Unicode is not "forced" unto the world either. It's just a good way to do the thing it does, but it doesn't happen without contribution from those who are impacted.
The word is "unicode". "Uni" as in "united". Wikipedia doesn't claim to be the One True Encyclopedia, but with Unicode it's literally right there in the name.
How can you not expect UC to reach out to the 1000s of languages out there, when their entire and sole reason to exist is to allow computers to work with all of those languages?
There's no point in taking on a job and then declaring it to be too difficult.
UC's job is to coordinate. There are plenty initiatives for all kind of languages. I havn't contributed, but I followed the mediaeval unicode initiative a bit. The barriers for contributing seem very low. The Wikipedia-analogy elsewhere here describes the situation very well. Complaing Unicode sucks for language X and the consortium is to blame is like complain that the x language Wikipedia sucks and wikimedia is to blame.
> The Indian government dropped the ball here. They went their separate way with "ISCII" and other such misguided efforts, instead of cooperating with UC
According to the beginning of chapter 12 of the Unicode Standard 7.0:
"The major official scripts of India proper [...] are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for these scripts. The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions, which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding."
So blame the indian government for any problems with ISCII.
The problems with Unicode support for global languages are indeed to be blamed on the Unicord Consortium. Nothing the indian governemnt did prevents the UC from availing themselves of the global knowledge needed to create a good global standard.
do you have any evidence that the UC has actively ignored requests from Bengali speakers? Has any Bengali speaker made proposals to the UC for fixing these issues? If yes, and the UC chose to ignore them, then there is some blame to be assigned with the UC. Otherwise, this is a non-issue.
Take, for example, Tibetan. The number of Tibetan speakers is minuscule compared to, say, Bengali. But still Tibetan has good support, because enough people took an interest in getting it supported and worked with the UC to get it done.
I think it needs to be stressed that anyone can submit character proposals to the consortium and work with them to get them included in the next versions of the standard. You don't need to be a member (By paying the hefty fee. Someone needs to pay for the operational costs of the consortium.) to have your suggestions taken into account.
Not true. Anybody with an email address can send feedback. And individuals who want to be more active and become an actual member have a $75 membership fee.
But then consider the implementation path to fixing the problem for a minority linguistic group being deliberately repressed by their government--It would require blood. If there is some alternate process to work with the UC directly, that could be better but it puts the UC in the position of judging a linguistic group's claims for legitimacy.
I agree that this refutes claims that the UC was negligent, but we can still say that they failed to be particularly assiduous in this case.
Unicode has excellent Tibetan support, despite that language and its script being notoriously difficult to work with.
Also, Bengali is not just a major language in India, but the national language of an entire country (Bangladesh) with 100m+ people. It really is almost entirely the fault of the Bangladeshi and West Bengali authorities if they can't get their shit together enough to submit a decent proposal to Unicode.
An increasing corpus of children's literature is printed in Cherokee syllabary today to meet the needs of Cherokee students in the Cherokee language immersion schools in Oklahoma and North Carolina. In 2010, a Cherokee keyboard cover was developed by Roy Boney, Jr. and Joseph Erb, facilitating more rapid typing in Cherokee and now used by students in the Cherokee Nation Immersion School, where all coursework is written in syllabary.[8] The syllabary is finding increasingly diverse usage today, from books, newspapers, and websites to the street signs of Tahlequah, Oklahoma and Cherokee, North Carolina.
The world (especially the technological) is changing too fast to rely on nation-state channels that proceed by forming commissions and writing position statements blah blah blah. It would be better for the Unicode consortium to start actively soliciting input and/or technical contributions from people.
I am an Indian and it shocks me that Indians are still blaming the British after 70 yrs of independence.
Is 70 years of Independence not enough to make your language "first class citizen" ? Ofcourse Bengali is second class language because Bengalis didn't invent the standard.
Can we stop blaming white people for everything. Seriously WTF.
Hebrew (and I'd guess Arabic and other right-to-left languages) work rather badly in Unicode when it comes to bidirectional rendering; however, to the extent that it's a result of Israeli/Egyptian/Saudi/etc. companies and/or governments failing to pay $18K (the figure from TFA) to pay for the consortium membership, I kinda blame them and not the consortium. I mean, it's not a whole lot of money, even for a smallish company, not to mention a biggish state.
Also this bit was just amazing: "It took half a century to replace the English-only ASCII with Unicode, and even that was only made possible with an encoding that explicitly maintains compatibility with ASCII, allowing English speakers to continue ignoring other languages."
Seriously? Nobody forces anyone to use UTF-8, and even if it were the only encoding available - by how much the cost of these extra bytes reduces the standard of living in non-English-speaking countries, exactly?
It is in fact unfortunate that the pile of poo is standardized while characters from languages spoken by hundreds of millions are not, however. A pity that a good and surprising (to some) point was mixed with all that "white men" business.
The pile of poo is a great example. It is supported because the unicode consortium pays attention to the need of non western languages, not because it ignores them and would rather do silly things.
Japan has had a tendency to includes all sorts of crap, including this example of actual crap, into its character sets/encodings. Japan also has a history of working with the various standard bodies like the Unicode Consortium. Because Japanese systems had the pile of poo, unicode has it.
> Hebrew (and I'd guess Arabic and other right-to-left languages) work rather badly in Unicode when it comes to bidirectional rendering; however, to the extent that it's a result of Israeli/Egyptian/Saudi/etc. companies and/or governments failing to pay $18K (the figure from TFA) to pay for the consortium membership
I'm sorry? How does this have anything to do with encoding, versus software toolkits? Unicode and encoding generally has little to nothing to do with input methods or rendering, and if you can explain why Unicode in particular works better in one direction I am genuinely curious.
> It is in fact unfortunate that the pile of poo is standardized while characters from languages spoken by hundreds of millions are not, however
I don't think in 2015 Unicode lacks codepoints for any glyph in regular usage by hundreds of millions. OP has not cited any; the one he explicitly complains of was in fact added in 2005.
Before someone says Chinese, that's a bit more like citing some archaic English word missing from a spellchecker. I am skeptical that there is a Han character missing from Unicode that >> 1 million people recognize much less use on a regular basis.
Unicode defines a "logical order" and a "rendering order". The logical order is supposed to be the order in which you read the text - letters read earlier are closer to the beginning of the buffer. The rendering order is how the stuff appears on screen - where an English word inside Hebrew text, if you count the letters from right to left, will obviously have its last letter assigned a smaller number than its first letter.
This means that any program rendering Unicode text - which is stored in logical order - needs to implement the crazy "bidi rendering" algorithm which includes a guess of the "document language" (is it a mostly-English or a mostly-Hebrew text? that fact changes everything in how the character order affects on-screen positioning) and the arbitrary constant 64 for nesting levels.
In this sense Unicode has a lot to do with rendering. As to input methods...
Could your program ignore it all and just let users edit text however you choose? Well, you'd need to translate the on-screen result to a logical order that can then be rendered by someone else the way Unicode prescribes, and you'd be breaking a near-universal (though terrible) UI standard, confusing users mightily. For instance users are accustomed to Delete deleting a different character when the cursor is between two words, one English one and one Hebrew one, depending on where you came from (a single on-screen position corresponds to two logical order positions, see?) You also can't store any hints wrt how things should be interpreted in the Unicode text that are not a part of the Unicode standard, you can only store such hints in a side-band channel which is not available if you store plain text someone else might read. And if you don't do that, well, you can ignore Unicode altogether... except when handling paste commands which will force you to implement bidi in all its glory, including a guess of what the document language was.
Now the part that I don't remember (it's been a decade since I last touched this shit) is whether Unicode mandates anything wrt editing, versus just implicitly compelling you to standardize on the same methods that make sense in light of their logical & rendering orders...
BTW - would I, or some bunch of Jews, Arabs etc. with a background in programming and linguistics beat the white folks in the quality of their solution? I dunno, it's a really really tough problem because you kinda don't want to reverse words in the file, on the other hand you kinda ought to "reverse" them on the screen - that is, spell some in the opposite direction of some others. You could keep them reversed in the file - a different tradeoff - or you could have some other bidirectional rendering algorithm, perhaps with less defaults/guesses. How it "really" ought to work I don't know, obviously neither left-to-right nor right-to-left writing systems were designed with interoperability in mind. Much as I hated bidi when I dealt with it and much as I despise it every day when it routinely garbles my email etc., it's hard to expect much better...
As to your comparison of Chinese characters in Unicode and old English words in a spellchecker... The latter is what, 10000x easier to fix/work around? (Not that I know anything about Chinese; I do see people in the comments counter his point about his own language, I just can't know who is right.)
Okay, how would you do it?
If you want bidirectional mixed language support, you're not going to get around implementing bidi rendering at some layer.
Character encoding is not that layer. You are making legitimate complaints, but they are largely in the realm of poor internationalization support. There may be finger-pointing between applications, toolkits, and OS middleware, but the character encoding should not be taking the bum rap for that.
Internationalization is a fundamentally difficult engineering problem and no character encoding is going to solve it.
> Could your program ignore it all and just let users edit text however you choose?
Sure, that's what happens in my text editor in my terminal, the unicode rendering rules are ignored, each code point gets one character on the screen, from left to right. Once things look how I expect them to in there, I test on real devices, in the real output context to make sure things are looking how they should; often I also have someone who can read and understand the text check too.
This works for me, but I'm a software developer, and I'm also not fluent in many languages, so seeing code points in logical order doesn't cause me problems with readability.
Was British rule actually a net negative, in retrospect? Have there been studies done using objective criteria (not emotional) over counties that were colonies versus ones that weren't? I suppose you can't really quantify the value of people that were destroyed by colonization, but you can look at the current population.
Also I just gotta wonder: suppose European or other relatively simple-to-encode languages didn't exist, and everyone used the OPs language. How would they have handled advancing computers? I've seen photos of Japanese typewriters and they look... unwieldy to say the least. And graphics tech took a while to get advanced enough to handle such languages, let alone input. (MS Japanese IME uses some light AI to pick the desired kanji, right?)
Disclaimer: I don't mean this in an offensive way, just a dispassionate curiosity.
>"Was British rule actually a net negative, in retrospect? Have there been studies done using objective criteria (not emotional) over counties that were colonies versus ones that weren't?"
It's both a positive, and a negative. But if you want to calculate whether it was a "net" positive or "net" negative, then I'm afraid you're going to have to quantify very difficult-to-quantify concepts. And you'll have to do it on behalf of a lot of people, some of which don't exist anymore to answer your question. In essence, next to impossible.
However, particularly in the case of colonialism. I will add that because it's such a difficult concept, it is being exploited by manipulative scholars and politicians for personal/political agendas. So you have politicians in these countries claiming that it was a net negative, without any way to quantify it, and without acknowledging any good that came along with it. Now, of course, colonialism was bad, period. It can never be justified to enslave people, and it should never have happened.
Well that's why I said without taking into account people that were destroyed by it. That is, if we look purely at modern-day circumstances, are they better?
A comparison might be a country with overpopulation. If you kill x% of the population and enact some birth control plan, then 100 years later you might end up with a "net positive" discounting the people that were killed and the emotional effects of their loved ones.
Are there countries in Africa that weren't colonized, or some islands that weren't, that we could compare to ones that were and make some inferences?
Alternatively, you could compare countries that had "more" or "less" colonialist control/influence, and see if there is any correlation to their current performance. Though, as with all social-sciency it's very difficult to isolate the variables to get deep insight into possible causal links.
It is really hard to quantify something like that. While British siphoned off large amounts of natural resources and often caused widespread famines in India. They also gave us western democracy, education in western science,access to all the western knowledge.
You speak as though, these things wouldn't have come without the British; I'm sure Japan today is a poor backwards country, for they weren't blessed with the act of "civilizing" by the cross.
All this is out of context and has nothing to do with what OP talks about.
About what you are talking about, it is a fact that India was the richest countries in the world before British colonised it, and after British left it was one of poorest.
> suppose European or other relatively simple-to-encode languages didn't exist, and everyone used the OPs language. How would they have handled advancing computers?
Bengali isn't all that complicated; more so than English, yes, but definitely surmountable. Printing presses, typewriters, etc. have been used without incident for Bengali (and for other Indian languages) for a long time.
P.S.: It's not really a great idea to open a post with "Was British rule actually a net negative, in retrospect?" I understand you did not intend to offend, but at the very least it's a touchy subject, and a contentious and complicated history.
Although I agree with a lot of what you said, I perceive the last part, your post-scriptum, to be a little bit distorted. The Renaissance had many sources. One of them was the massive intellectual immigration from the crumbling Byzantine Empire, thus serving for much of intellectual works as a bridge over the time (from antiquity to enlightened medieval period) and space (near east to all over the Europe). And it was mostly Greek in its cultural background.
I wonder if the author has submitted a proposal to get the missing glyph for their name added. You don't need to be a member of the consortium to propose adding a missing glyph/updating the standard. The point of the committee as I understand it isn't to be an expert in all forms of writing, but to take the recommendations from scholars/experts and get a working implementation, though more diverse representation of language groups would definitely be a positive change.
The author's explanation of what characters Chinese, Japanese, and Korean share is very limited. All three languages use Chinese characters in written language to varying extents, and in some cases the differences begin significantly less than a century ago. Though there are cases where the same Chinese character represented in Japanese writing is different from how it is represented in Traditional Chinese writing (i.e. 国, Japanese version only because I don't have Chinese installed on this PC), which could be different still from how it is represented in Simplified Chinese, there are also many instances where the character is identical across all three languages (i.e. 中). Although I am not privy to the specifics of the CJK unification project, identifying these cases and using the same character for them doesn't sound unreasonable.
Edit- To be clear, Korean primarily uses Hangul, which basically derives jack and shit from the Chinese alphabet, and Japanese uses a mixture of the Chinese alphabet, and two alphabets that sort-of kind-of owe some of their heritage to the Chinese alphabet, but look nothing like it. If they are talking about unifying these alphabets, then they are out of their minds.
Not unifying them means that the fonts automatically work when you mix text/names written in these alphabets. It also means that mathematical/physical/chemical stuff (that typically uses Latin and Greek letters together) will just work. There is a similar reasoning behind all the mathematical alphabets in Unicode.
Furthermore, Unicode was supposed to handle transcoding from all important preexisting encodings to Unicode and back with no or minimal loss. Since ISO 8859-5 (Cyrillic) and 8859-7 (Greek) already existed (and both included ASCII, hence all the basic Latin letters), the ship had definitively sailed on LaGreCy unification.
On top of that, CJK unification affected so many characters that the savings would really matter and it happened at a time where the codepoints were only 16 bit so it helped squeeze the whole in. All continental European languages suffered equally or worse back when all their letters had to be squeezed into 8 bits /and/ coexist with ASCII.
> Not unifying them means that the fonts automatically work when you mix text/names written in these alphabets. It also means that mathematical/physical/chemical stuff (that typically uses Latin and Greek letters together) will just work.
These are already completely separate symbols. Ignoring precomposition, there are at least 4 different lowercase omegas in unicode: APL (⍵ U+2375 "APL FUNCTIONAL SYMBOL OMEGA"), cyrillic (ѡ U+0461 "CYRILLIC SMALL LETTER OMEGA"), greek (ω U+03C9 "GREEK SMALL LETTER OMEGA") and Mathematics (𝜔 U+1D714 "MATHEMATICAL ITALIC SMALL OMEGA").
Unicode kind of does this already with dotless 'i'; capital 'ı' and lowercase 'İ' are represented as regular latin 'I' and 'i' respectively, despite being semantically different letters.
hungarian "a" is also a separate letter from hungarian "á" but shares the same glyph with english "a" (edit: all vowels are actually considered different letters in their accented form in hungarian, while obviously they are the same letter with a modifier in some latin languages).
The forms 𝜙 and 𝜑 of "lowercase phi" to have different codepoints makes perfect sense to me. That doesn’t mean that upper-case variants of these can’t share a codepoint. As presented elsewhere in this thread, "X.toUpper().toLower()" doesn’t have to be "X". The same holds for "B → b" and "B → β" depending on the context. It’s just that the savings from such a unification would be far smaller due to the smaller sizes of the relevant alphabets.
OK, but you still have a problem because I want to use the same font for Greek and Russian. What if my font is CURSIVE?
Russian and Greek have different cursive forms. You might unify κ and к, but actually the cursive form of κ looks like a Roman u.
So really if this were to happen you'd have "Russian" fonts and "Greek" fonts. Kind of like how Japanese and Chinese have to use different fonts for their languages.
I think their distinct calligraphic representations mean that these would be destructive; whereas with regard to CJK, the characters are clearly represented the same way between the considered languages.
According to Wikipedia, this is being coordinated by the Ideographic Rapporteur Group, and "the working members of the IRG are either appointed by member governments, or are invited experts from other countries. IRG members include Mainland China, Hong Kong, Macau, Taipei Computer Association, Singapore, Japan, South Korea, North Korea, Vietnam and United States."
So this criticism of English speakers seems pretty unfounded! And his concerns about unification is being driven by a diverse group of experts in a variety of countries - so not sure why the concern?
Exactly. The Han Unificiation project never tried to unify everything. They just took the set of common characters and unified them, leaving the rest alone. They may have made some mistakes in choosing which characters to unify, but for the most part they did a splendid job.
Korean (Hangul) has its own, massive block of over 11K code points. Japanese (Hiragana, Katakana, and other assorted symbols) also has its own block outside of the Unified Han block. Chinese characters that are clearly distinct get their own code points as well. How else would I write 国 and 國 in the same sentence?
Also Antiqua and Fraktur used to be seen as different writing systems (ſ, I and J being equivalent, tironian et being some examples where they differ), yet this is largely ignored by Unicode (except when used in mathematics)
Yes, that's exactly right. In this case, Japanese uses the simplified version, but in others it uses the traditional version (or even a version that is slightly different from the current traditional version used in Taiwan or Cantonese.)
I found it extremely annoying that he doesn't even say specifically what the problem with his name is. So there's a letter that's unavailable? Which letter?
> Even today, I am forced to do this when writing my own name. My name is not only a common Indian name, but one of the top 1,000 names in the United States as well. But the final letter has still not been given its own Unicode character, so I have to use a substitute.
Not as descriptive as it could be, but this article isn't about him.
His writing there is pretty confusing. He started by complaining about a glyph that was missing until 2005, but was either fixed in 2005 or approximated by combining some characters 'ত + ্ + = ৎ'. He doesn't really make it very clear whether ৎ is a substitute for a glyph, or whether it's the correct glyph and a case of an input system not making that easy to enter, but it seems like the glyph was added in 2005 and he's complaining about the input method. Assuming it is a case of a clunky input system, then pointing the finger at the Unicode consortium seems pretty weak, since so far as I understand it, various OS vendors/app platforms handle that implementation.
Imagine if the letter Q had been left out of Unicode's Latin alphabet. The argument against it is that it can be written with a capital O combined with a comma. (That's going to play hell with naive sorting algorithms, of course, but oh well.) Oh, and also imagine your name is Quentin.
But the letter wasn't left out of Unicode; it's actually typed in the article. It's just internally represented as multiple codepoints, much like one of parts of my name (é) may be.
Frankly, this is irrelevant to the actual problem, which is the input system, and which has nothing to do with Unicode. Nothing prevents a single key from typing multiple codepoints at once.
> It's just internally represented as multiple codepoints
And in fact it is not, and even in the article it is U+09CE. One codepoint. If his input method irks him, he's as free to tweak it as I am to switch to Dvorak.
Also folks, there's no "CJK unification" project. It's Han unification. Han characters are Han characters, just like Latin characters are Latin characters. Just because German has ß and Danish has Ø doesn't mean A isn't a Latin character and not, say, a French one. Not to get all Ayn Rand-y, but A is A is U+0041 in all Western European/Latin alphabets. It makes sense for 中国 and 日本to have the same encoding in Chinese and Japanese.
I hate to say it, but I think the author's objections seem to stem from his lack of understanding of character encoding issues. I don't know Bengali at all and so I will try to refrain from commenting on it, but I do speak and read Japanese fluently and Han Unification is a very, very good thing. Can you imagine the absolute hell you would have to go through trying to determine if place names were the same if they used different code points for identical characters -- just because of geopolitical origins?
Yes, there are some frustrating issues -- it has been historically difficult to set priorities for fonts in X and since Chinese fonts tend to have more glyphs, you often wind up with Chinese glyphs when you wanted Japanese glyphs. But this is not an encoding issue. Localization/Internationalization is really difficult. Making a separate code point for every glyph is not going to change that for the better.
I feel that way too. The distinction between codepoint, glyph, grapheme, character, (...) is not an easy one, and that's what he seems to be stumbling over. Unicode worries itself about only some of these issues, many of the other issues are about rendering (the job of the font) or input.
Combining characters are not just used for Bengali though. E.g. umlauted letters in European languages can also be expressed using combining characters, and implementations need to deal with those when sorting.
> Imagine if the letter Q had been left out of Unicode's Latin alphabet.
To properly write my european last name I have to press between 2 and 4 different simultaneous keys, depending on the system. Han unification is beyond misguided, but combining characters is not the problem.
Han unification as a hole is misguided? I'll grant you that some characters which were unified probably shouldn't have been, and maybe some that some that should have been weren't, but what's the argument for the whole thing to be misguided?
Should Norwegian A and English A be different Unicode code points just because Norwegian also has Ø, proving that it is a different writing system? You may want to debate whether i and ı should the same letter (they aren't), but most letters in the Turkish alphabet are the same as the letters in the English alphabet.
We'll the Turkish i/ı/I/I is I think exactly the example I would have come up with of characters that looks the same as i/I, but should have it's own code point, just like cyrillic characters have their own code points despite looking like latin characters.
Absolutely. So i/ı/I/I do have their own codepoints. But the rest of the letters, which are the same, don't. Just like han unification. Letters which are the same are the same, and those which are not are not, even if they look pretty close.
The thing is that the turkish "i" and "I" don't have their own codepoints, it is the same one as latin "i" and "I", when they should have been their own codepoints representing the same glyphs. That way going from I to ı and from i to İ wouldn't be a locale dependant problem.
When Chinese linguists came up with hanyu pinyin, they specifically wanted to pick up Latin characters (1) for Chinese phonetics, so that Chinese phonetic writing could use what we'd call "white men's writing system".
Now, they did use the letter Q for the sound tɕʰ that was formerly often romanized as "ch". It is not really a "k" as Q is in English.
Are people now saying that hanyu pinyin should use a different coding to English, because it would be more "respectable" for non-English languages to have their own code points even if the character has same roots and appearance? That is absolutely pointless. The whole idea of using Q for tɕʰ is that you can use the same letter, same coding, same symbol as in English.
(1) OK they did add ü to the mix, although that is usually only used in romanization in linguistics or textbooks, and regular pinyin just replaces it with u.
My first choice as theoretical Quentin wouldn't be "how can I frame this accidental, perhaps even flagrantly disrespectful omission as antiprogressive and dissect the credentials, experience, and ethnicity of the people who made the mistake via culture essay," it would probably be "where do I issue a pull request to fix this mistake or in what way can I help?"
Maybe that's just me. I look forward to the future where any mistake not involving a straight white Anglo-Saxon man or his customs can be built up as antiprogressive agenda, and the best advocacy is taking the people who made them down rather than fixing the problem that is the, you know, problem.
(As an aside, imagine my surprise to see a Model View Culture link on HN given how much MVC absolutely hates and criticizes HN, including a weekly "worst of HN" comment dissection.)
Anyone can propose the addition of a new character to Unicode. It doesn't take $18,000 as some people think. You just need to convince the Unicode Consortium that it makes sense (preferably with solid evidence on use of the character). The process is discussed at: http://unicode.org/pending/proposals.html
I have a proposal of my own in the works to add a character to Unicode, so I'll see how it goes. There's a discussion of how someone successfully got the power symbol added to Unicode at https://github.com/jloughry/Unicode so take a look if you're thinking of proposing a character.
"My first choice as theoretical Quentin wouldn't be "how can I frame this accidental, perhaps even flagrantly disrespectful omission as antiprogressive and dissect the credentials, experience, and ethnicity of the people who made the mistake via culture essay," it would probably be "where do I issue a pull request to fix this mistake or in what way can I help?" "
Holy cow CJK unification is a terrible idea. Maybe if it originated from the CJK governments, it might be an OK idea, but the idea of a Western multinationals trying to save Unicode space by disregarding the distinctness of a whole language group is idiotic.
The fundamental roll of an institution like the Unicode Consortium is to be descriptive, not prescriptive. If there is a human script, passing certain, low, low thresholds, it should be eligible to be included in its distinct whole in Unicode.
To oppose Han unification is to say that an 'a' in English and an 'a' in French should be different code points, because they're different languages.
Alternatively, if Unicode directly encoded words, rather than letters, of Western languages akin to ideographs in East Asian languages, it's like arguing that 'color' and 'colour' should be separate code points.
I don't disagree, while observing that it plays hell with security if (forgive my use of Latin; I don't have a codepoint translator handy) 'bat.com' and 'bat.com' are two different websites because the 'a' in the first is a Chinese-a and the 'a' in the second is a Korean-a.
(Of course, this calls into question the wisdom of expanding DNS into the Unicode space in the first place---a space that does nothing like guarantee 1-to-1 association between visual glyph and code for an application that has been built on the assumption that different codes are visually distinguishable. But that ship has sailed).
This article is imbued with its own form of curious racism. In particular, I became suspicious of its motives at the line:
> "It took half a century to replace the English-only ASCII with Unicode, and even that was only made possible with an encoding that explicitly maintains compatibility with ASCII, allowing English speakers to continue ignoring other languages."
ASCII-compatibilty was essential to ensure the adoption of Unicode. It's not because English speakers wanted to ignore anyone or anything, it's because without it, it would never have been adopted, and we'd be in a much worse position today.
In other words, the explanation is technical, not racial.
It's sort of hilarious that he said that, given that English speakers were able to encode every character in 7-bit ASCII. The issues around standardising characters was because non-English characters were being squabbled about between the French, Russians and a whole bunch of other non-English countries.
In essence, he's not understood that really ASCII was used as the base for Unicode because it was widely used. In fact. It's actually ISO8859-1 that has been used because of its wide coverage of a variety of languages, far more than any of the other 8859-x character sets.
I cannot speak to anything else he's said, aside from saying that trying to encode all the world's languages is bloody hard.
Even when a limited number of countries try to nut out a standard for 128 characters, it's a nightmare. And don't forget that they were competing with EBCDIC.
Things like this:
"No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet."
The author probably ignores that different European languages used different alphabet scripts until very recently. For example, Gothic and other different script were used in lots of books.
I have old books that take something like a week to be able to read fast, and they are in German!.
But it was chaos and it unified into a single script. Today you open any book, Greek, Russian, English or German and they all use the same standard script, although they include different glyphs. There is a convention for every symbol. In cyrilic you see an "A" and a "a".
In fact, any scientific or technical book includes Greek letters as something normal.
It should also be pointed out that latin characters are not latin, but modified latin. E.g Lower case letters did not exist on Roman's empire. It were included by other languages and "unified".
Abut CJK, I am not an expert but I had lived on China, Japan and Korea, and in my opinion it has been pushed by the Governments of those countries because it has lots of practical value for those countries.
Learning Chinese characters is difficult enough. If they don't simplify it people just won't use it, when they can use Hangul or kana. With smartphones and tablets people there are not hand writing Chinese anymore.
It makes no sense to name yourself with characters nobody can remember.
Right, I was going to argue against that too. Changing fonts every letter will always look weird. There's a lot of different ways to shape these letters that are all counted as the same.
> Learning Chinese characters is difficult enough. If they don't simplify it people just won't use it
CJK unification doesn't make learning Chinese characters any easier, it only means that characters that look exactly the same will use the same code point. This is not related to simplifying characters.
Also, I regularly see people hand-writing Chinese characters on tablets, so it does happen. And for those who use Pinyin to enter characters, it doesn't matter if the characters are simplified or traditional or unified or whatever, because they only have to pick the right one in their IME.
The poo emoji is not part of Unicode because rich white old cis-gendered mono-lingual oppressive hetero men thought it'd be a fun idea (outrage now!!).
Emoji were adopted from existing cellphone "characters" in Japan, and Japan is famously lagging in Unicode adoption because some Japanese names cannot (could not?) be written in Unicode. It all just seems to be very normal (=inefficient, quirky) design by committee.
The comparison to the play "My Fair Lady" is not very convincing. I would suggest the author remove it as it weakens the argument. First, the fictional character states 'no fewer than' as an admittance that there are more. Second, even if we consider this as a valid complaint, the author himself points out in his example that this 'common sentiment' is from a play written a century ago. Using a fictional character from a time when the world's knowledge of language was incredibly smaller than it is today does not help support your goals.
Even though I'm not a native english speaker and couldn't write my name in ASCI, I really despise Unicode.
Its broken technically and a setback socially.
Unicode itself is such a unfathomably huge project that it's impossible to do it right, too many languages, too many weird writing systems, and too many ways to do mathematical notation on paper that can't be expressed.
Just look at the code pages, they are an utter mess.
Computers and ASCI were a chance to start anew, to establish english as a universal language, spoken by everybody.
The pressure on governments who would wanted to partake in the digital revolution would have forced them to introduce it as an official secondary language.
Granted english is not the nicest language, but is the best candidate we have in terms of adoption, and relative simplicity (Mandarin is another contester, but several thousand logograms are really impractical to encode.).
Take a look at the open source world, where everybody speaks english and collaborates regardless of nationality.
One of the main factors why this is possible, is that we found a common language, forced on us by the tools and programming languages we use.
If humanity wants get rid of wars, poverty and nationalism, we have to find a common language first.
A simple encoding and universal communication is a feature, fragmented communication is the bug.
Besides.
UTF-8 is broken because it doesn't allow for constant time random character access and length counting.
Why do you think English is the best candidate for the universal language, how do you define simplicity? First of all, pronunciation and spelling are almost unrelated and you have to learn them separately. That results in really different accents throughout the world. Even if you look at AmE and BrE, they differ much at the word level. Which one you want to choose? Besides, personally I find English really ambiguous and density of idioms in average text repelling, although that's only a subjective opinion.
Usage of Latin alphabet in English seems like it's on plus, but there's at least one language that uses that simple alphabet better.
> Besides. UTF-8 is broken because it doesn't allow for constant time random character and length counting.
And why you'd want that? And how do you define length? Are you a troll?
English is the best candidate because it has the second largest user base (1.2 Billion vs 1.3 Billion for Mandarin), http://en.wikipedia.org/wiki/List_of_languages_by_total_numb...
and is twice as spoken as the third most popular language Spanish. (0.55 Billion)
If I got to pick the universal language, it would be Lojban (a few hundred speakers), but that is not a realistic goal, teaching the other 6 Billion people a language that is already spoken by 1/7th of the population is at least plausible.
> Why would you want that...
Why would, you not want that?! Many popular programming languages are based on array indexing through pointer arithmetic, having a variable width encoding there is a horrible idea, because you have to iterate through the text to get to an index.
Length is the number of characters, which is just the number of bytes in ASCI, but has to be calculated by looking at every character in UTF-8.
Even if 1.2 billion seems a lot, that's still a small fraction of a world's population. So every choice of a universal language would force majority of a world to learn new one. So that's why I think winning popularity contest is a poor argument and we shouldn't look at that and focus on things like simplicity (which I don't find in English), speed of learning, consistency, expressiveness etc. I'd be happy to use Lojban (it's easier for machines too, I guess) or any other invented language. If I had to pick one from popular ones, I'd like Spanish more than English.
I was asking what are your specific usecases, which forbid you to treat UTF-8 string as a black box blob of bytes? If dealing with international code, you'd rather want to use predefined functions. If you want to limit yourself to ASCII, just do it and simply don't touch bytes >= 0x80.
And what is a character? Do you mean graphemes or codepoints? Or something else? Few years before I was thinking like you – that calculating length is a useful feature. But most often when you think about your usecase, you realise either that you don't need length or you need some other kind of length: like monospace-width, rendered-width or some kind of entropy-based amount of information. Twitter is the only case I know, where you want to really count "characters". And I find it really silly: eg. Japanese tweet vs. English tweet.
I don't understand; I don't feel like character combination using the zero width joiner is on the same level as 13375p34k. It looks like the character just doesn't have a separate code-point, but is instead a composite, but still technically "reachable" from within Unicode, no?
That's effectively the only way to write several indian language with unicode - ZWJ + NZWJ and a decent font which supports all the ligatures.
The recent Bullshit Sans font is a clear example to describe how ligatures works. And for Malayalam there are consonant patterns which do not match itself to the Sanskrit model which makes it rather odd to write half-consonants which are full syllables (വ്യഞ്ജനം +് zwj).
And my name is written with one of those (ഗോപാൽ) & I can't say I'm mad about it because I found Unicode to be elegant in another way.
Somewhere in the early 2000s, I was amazed to find out that the Unicode layouts for Malayalam as bytes in UTF-8 were sortable as-is.
As a programmer, I found that detail of encoding to sort order to be very fascinating as it meant that I had to do nothing to handle Malayalam text in my programs - the collation order is implicit, provided everyone reads the ZWJ and NZWJ in their sorting orders.
It's like typing ` + o to get "ò", isn't it? You can argue that ò is actually an o with that tilde, while that character is not ত + ্ + an invisible joining character, but that's an input method thing, and there is a ৎ character after all.
Most devanagari glyphs don't have their own codepoint. Marathi/Hindi/Sanskrit (which use devanagari) have a bunch of basic consonants and some vowels (which can appear independently or as modifiers). All the glyphs are logically formed by mixing the two, so the glyph for "foo" would be the consonant for "f"[1] plus the vowel modifier for "oo". When typing this in Unicode, you would do the same, type फ then ू, creating फू.
It gets interesting when we get to consonant clusters. As mentioned in [1], the consonants have a schwa by default, so the consonant for s followed by the consonant for k with a vowel modifier for the "y" sound would not be "sky", but instead something like "səky" (suh-ky).
So how do we write these? We can do this in two ways. One way is to use the no-vowel modifier, which looks like a straight tail[3] (on स, the consonant for "s", or "sə", the tail looks like स्), and follow that by the other consonant. So "sky" would be स् कै [2]. This method is rarely used, and the straight-tail is only used when you want to end a word with a consonant[4].
The more common way of doing multiple consonants are by using glyphs known as consonant clusters or conjuncts[5]. For "sky", we get स्कै, which is a partial glyph for स stuck to the glyph for कै. For most clusters you can take the known "partial" form of the first glyph and stick it to the full form of the second glyph, but there are tons of exceptions, eg द+द=द्ध, ह+म=ह्म (the second character was broken), and whatnot. See http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... if you want a full table.
There aren't individual Unicode codepoints for this, not even codepoints for the straight-tail form of the consonants. I typed स्कै as स्+कै which was itself typed as स + ् + क + ै. This isn't an irregular occurrence either, consonant clusters (with a vowel modifier!) are pretty common in these languages[6].
I personally don't see anything wrong with having to use combining characters to get a single glyph. If it's possible and logical for it to be broken up that way, it's fine. With this trick, it's possible to represent Devanagari as a 128-codepoint block (http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29), including a lot of the archaic stuff. It's either that, or you make a characters for every combined glyph there is, which is a lot[7]. One could argue that things like o-umlaut get their own codepoint, but स्क doesn't, but o-umlaut is one of maybe 10 such characters for a given European language, whereas स्क is one of around 700 (and that number is being conservative with the "usefulness" of the glyph, see [7]).
The article is never quite clear about which glyph Aditya finds lacking for his name (sadly, I don't know Bengali so I can't figure it out), but from the comments it seems like it it's something which can be inputted in Unicode, just not as a single character. That's okay, I guess. And if it's not showing up properly, that's a fault of the font. (And if it's hard to input, a fault of the keyboard).
It becomes a Unicode problem when:
- There is no way to input the glyph as a set of unicode code points, or
- The input method for the glyph as a set of unicode code points can also mean and look like something else given a context (fonts can only implement one, so it's not fair to blame them)
[1]: well, fə, since the consonants are schwa'd by default. Pronounced "fuh" (ish)
[2]: the space is intentional here so that I can type this without it becoming something else, but in written form you wouldn't have the space. Also it's not exactly "sky", but close enough.
[3]: called paimodi ("broken foot") in Marathi
[4]: which is pretty rare in these languages. In some cases however, words that end with consonant-vowel combinations do get pronounced as if they end with a consonant (http://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_la...), but they're still written as if they ended with a vowel (this is true for my own name too, the schwa at the end is dropped). Generally words ending with consonants are only seen in onomatopoeia and whatnot.
[5]: called jodakshar ("joined word") in Marathi
[6]: Almost as common as having two side by side consonants in English. We like vowels, so it's a bit less common, but still common.
[7]: technically infinite, though that table of 700-odd biconsonantal conjuncts would contain all the common ones (assuming we still have the vowel modifying diacritics as separate codepoints), and adding a few triconsonantal conjuncts would represent almost all of what you need to write Marathi/Hindi/Sanskrit words. It wouldn't let you represent all sounds though, unless you still have the ् modifier, in which case why not just use that in the first place?
Getting rid of CJK unification would better model actual language change in the future (France, for instance, has a group that keeps a rigorous definition of the French language up to date -- I would enjoy giving them a subset of the codes to define how to write French).
But the general principle sounds odd. Should 家, the simplified Chinese character and 家, the traditional Chinese character have different codepoints? Should no French be written using characters with lower, English code points because of their need for a couple standard characters? Should latin be written using a whole new set of code points even though it needs no code points not contained in ascii?
There was an academic proposal in the '90's for something called "multicode" (IIRC) that did exactly this: every character had a language associated with it, so there were as many encodings for "a" as there were languages that used "a", and all of them were different, or at least every character was somehow tagged so the language it "came from" was somehow identifiable.
Fortunately, it never caught on.
The notion that some particular squiggle "belongs" to one culture or language is kind of quaint in a globalized world. We should all be able to use the same "a", and not insist that we have our own national or cultural "a".
The position becomes more absurd when you consider how many versions of some languages there are. Do Australians, South Africans and Scots all get their own "a" for their various versions of English? What about historical documents? Do Elizabethan poets need their own character set? Medieval chroniclers?
Building identity politics into character sets is a bad idea. Unifying as much as practically possible is a good idea. Every solution is going to have some downsides, some of them extremely ugly, but surely solutions that tend toward homogenization and denationalization are to be preferred over ones that enable nationalists and cultural isolationists.
> Building identity politics into character sets is a bad idea. Unifying as much as practically possible is a good idea. Every solution is going to have some downsides, some of them extremely ugly, but surely solutions that tend toward homogenization and denationalization are to be preferred over ones that enable nationalists and cultural isolationists.
glib white supremacists are the best kind of white supremacists. "it's progress!"
Re: the Han Unification debate that's going on in parallel here,
I think CJK unification makes sense from the Unicode point of view (although if they had to choose again after the expansion beyond 16-bit I doubt they'd bother with the political headache). The problem stems from the fact that only a few high-level text storage formats (HTML, MS Word, etc) have a way to mark what language some text is in. There's no way to include both Japanese and Chinese in a git commit log, or a comment on Hacker News.
Sure you can say "that's just the problem of the software developer!" but that's what we said about supporting different character sets before Unicode, and we abandoned that attitude. Hacker News is never going to add that feature on their own.
What's needed is either a "language switch" glyph in Unicode (like the right-to-left/left-to-right ones) or a layer ontop of Unicode which does implement one that gets universally implemented in OSes and rendering systems.
While it is good to bring awareness to this, we are still growing in this area. In fact we should applaud the efforts so far that we even have a standard that somewhat works for most of the digital world. Does it need to evolve further, yes.
I am sure the engineers and multi-lingual people that stepped up to do Unicode and organize it aren't trying to exclude anyone. Largely it comes down to who has been investing in the progress and time. It may even be easier to fund and move this along further in this time, it was hard to fund anything like this before the internet really hit and largely relied on corporations to fund software evolution and standards.
In no way should the engineers or group getting us this far (UC) be chided or lambasted for progressing us to this step, this is truly a case of no good deed goes unpunished.
Generally true, but the problem here is not an issue of bandwidth or racism. Unicode can represent this character, but does so with two codepoints, a technical decision the author doesn't feel is useful. He blames this on the dominance of white people in the work (a questionable assumption, given he didn't link to the extensive list of international liaisons and international standards bodies). The participants in Unicode, including a native Bengali speaker who responded above, considered the argument presented but chose a different path to be consistent with how other characters are treated. The author needs to more carefully distinguish the codepoint, input, and rendering issues raised in his argument.
I think their argument was: The characters look the same (e.g. Russian's first character and the English A) but have different meanings.
So in this example if you searched for the English word "Eat" that is also a completely legal Russian word (E, A, and T, exist in English and Russian), however it means nothing remotely similar.
I don't know if they're right or wrong. I am just saying that might be the point they were trying to make. You could make a Greco Unified unicode set and it would work fairly well, but you might wind up with some confusing edge cases where it isn't clear what language you're reading (literally).
This could be particularly problematic for automation (e.g. language detection). Since in some situations any Greco-like language could look similar to any other (in particular as the text gets shorter).
English, French, German, Italian, Spanish and several other European languages have mostly identical character sets and even large numbers of similar or identical words. Computers detect these languages just fine. I think we'll be okay.
Quite a few English and Russian Cyrillic letters unify just fine. E and A unify, and have identical lowercase forms, e and a. They don't really have different meanings, no more so than the letters E and A in English and French. T is more interesting: it has the same phonetic sound, but a different lowercase appearance: t in English, т in Russian. In this case, unification would be pretty terrible.
For simple alphabet-type languages, the basic rule should be: if the uppercase and lowercase look the same, then unify mercilessly. P (English) and Р (Russian) should unify even though they represent different consonants. But not V (English) and В (Russian): they sound the same, but have totally different graphemes. On the other hand, unifying B (English) and В (Russian) does not make sense: the lowercase forms look different: b (English) and в (Russian).
Sounds like the major problem with Unicode (and the author's complaints) was always where to draw the line. Han unification went too far and included too many characters that look different. With other languages, some common combinable characters were forced into diacritic representation rather than getting their own code points. To me, the first problem seems way more serious.
I think the real reason was to preserve round-trip compatibility (legacy char -> unicode char -> legacy char) with the existing encodings for those alphabets.
It would be impossible to do. Even back in 1991 when unicode was conceived almost all the encodings in use were ASCII-compatible.
For languages that would be affected by a greco-unification that meant the encodings that were in use before unicode had both the latin script and their "national" script.
Implementing greco-unification in unicode would mean that round-trip lossless conversion (from origin encoding to unicode back into origin encoding) would be impossible, greatly limiting unicode's adoption.
No such problem existed with han characters, in fact JIS X 0208 (the character set used for Shift-JIS) did a very similar thing to unicode's han unification.
In absence of backwards compatibility problems I would be in favor of greco-unification too.
I see many comments about Han unification being a bad idea but I am not seeing any reason why it was such a bad idea. I am from a CJK country and I find it makes a lot of sense. Most commonly used characters should be considered identical regardless of whether it is used in Chinese Japanese Korean or Vietnamese. Sure there are some characters that are rendered improperly depending on your font but I don't think that makes Han unification a fundamentally bad idea.
Is it really depending on the font, or is it depending on some language metadata? Having it depend on the font seems stupid, since a font ideally be able to represent a languages which use a script encodable using Unicode.
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a pathetically inaccurate count. (Today, India has 57 non-endangered and 172 endangered languages, each with multiple dialects – not even counting the many more that have died out in the century since My Fair Lady took place)
So, how many were there really? At the time, I mean.
I believe the number of "dialects" named in My Fair Lady can be largely explained by the lack of clear distinction between language and dialect over the years. From [1]:
"There is no universally accepted criterion for distinguishing a language from a dialect. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference."
Getting upset about Henry Higgins's estimation of the number of Indian "dialects" in a play from many decades ago doesn't make sense to me. His character was deliberately portrayed as a regressive lout, and terminology has surely changed in the intervening years.
I think the numbers are somewhat disputed. The People's Linguistic Survey of India says there are at least 780, with ~220 having died out in the last half century.[1] The Anthropological Survey of India reported 325 languages.[2]
The discrepancies are made particularly tricky because of the somewhat ambiguous distinction between languages and dialects.
I'm not an anthropological linguist, so I can't say with any authority. I only listed two figures there, but there are many more disputed figures.[1] There are a lot of difficulties classifying languages in India because the line between language and dialect can be tricky to pin down.
The meanings of "language" and "dialect" are surprisingly tied up in politics. Aside from what various folks in India speak, consider...
In China, too, we say that people in different regions speak different dialects: the national standard Mandarin; Shanghainese; Cantonese; Taiwanese; Fukanese; and others. Someone who speaks only one of these languages will be entirely unable to speak to someone who speaks only a different one. I'm friends with a couple, the guy being from Hong Kong and the girl being from Shanghai; at home, their common tongue is English. So in what way can these different ways of speaking be considered mere dialects?
But on the other side of the coin, there are the languages of Sweden and Norway. We like to call these different languages, but a speaker of one language can readily communicate with a speaker of the other. Wouldn't these be better considered dialects of the same language? I was recently on vacation in Mexico, and at the resort there was a member of the entertainment staff who came from South Africa, a native speaker of Afrikaans. She told me that she recently helped out some guests who came from Dutch, and spoke poor English (which is usually the lingua franca when traveling). Apparently Afrikaans and Dutch are so close that she was able to translate Spanish or English into Afrikaans for them, and they were able to understand that through skills in Dutch. Again, Afrikaans and Dutch seem to be dialects of the same language (and, I think, Flemish as well).
I think the answer is that language is commonly used as a proxy for, or excuse for, dividing nations. So if you want to claim that China is all one nation, you have to claim that those different ways of speaking are just dialects of the same language. Conversely, to claim separate national identities for Norwegians and Swedes, we have to say that those are different languages.
I Can Text You A Pile of Poo, But I Can’t Write My Name
...
- by Aditya Mukerjee on March 17th, 2015
What is the glyph missing from this?
I know its not ideal but some uncommon glyphs have always been omitted from charsets, for example ASCII never included http://en.wikipedia.org/wiki/%C3%86, and it was replaced by "ae" in common language.
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a pathetically inaccurate count.
Wow. How can a country function like this? Is everyone proficient in their native language plus a 'common' one, or are all interactions supposed to be translated inside the same country? Regardless of historical and cultural value, if that's the case, it seems... inefficient.
I do realize that there are more countries like this, but the number of languages seems way too high. I am really curious how that works.
> How can a country function like this? Is everyone proficient in their native language plus a 'common' one
Yes, there may be multiple "common" languages if the country is large/populated enough or for historical reasons (Switzerland has 4 official languages — though official acts only have to be provided in 3 of them, Swiss-German, French and Italian — and they only have 8 million people)
> Regardless of historical and cultural value, if that's the case, it seems... inefficient.
Well telling people to fuck off with their generations-old gobbledygook and imposing a brand new language on them tends to make them kind-of restless, so unless you're willing to assert your declarations in blood (or at least in the specific suppression of non-primary languages)…
The latter has happened quite bit, e.g. post-revolutionary France tried very hard to stamp out both dialects and non-french languages until very recently (the EU shows a distinct lack of appreciation for the finer points of linguicide and linguistic discrimination), the UK did the same throughout the Empire.
> I do realize that there are more countries like this
Almost all of them aside from former british colonies (where native languages were by and large eradicated), at least to an extent (many countries carried out linguistic unification as part of their rise as nation-states, to various levels of success).
India has two "official languages" used across the country in business, government, and education: Hindi and English. There are also more widely used regional languages recognized by state governments, such as Tamil, Gujarati, Punjabi, and the author's native Bengali. Educated Indians are usually multilingual, speaking Hindi, which they learn at school, and one or more regional languages.
This is not a unique situation. Every country with a large enough area and a long enough history has a vast array of minority languages and dialects. North America is just different because it was settled relatively recently and the indigenous population was almost entirely wiped out.
> No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet.
Ok, we all understand it sucks. Whats the fix? You think Unicode is racist and terrible. Its stupidly difficult to work with for sure, but racist is a stretch.
What do you propose? More complexity layered on a system that people don't understand isn't really a fix.
Also they organizational mystery of how to open the floodgates for unpopular languages while still keeping Klingon and Tolkien Elvish out is something of a mystery.
I'm not that surprised Unicode includes very shallow code points (pile of poo), because nobody really cares. Bengali on the other hand, requires no mess up in order to satisfies the Billion+ users.
And a native Bengali speaker has discussed his input into the Unicode discussions and why he disagreed with the author. What do you know about the linguistic input into the work or the participants? Did you check the huge international lists of liaisons that the author ignored? It costs $75 to join Unicode.
This is one of the most interesting HN post+comments I've read yet, in part because it mixes technology with culture and history. It also takes advantage of the diverse HN community and their own native languages.
As an American (English speaker) who has studied French, Hebrew and Japanese, I can appreciate the complexity of input balanced against standards and the needs of programmers.
It's a fucking hard problem, but I don't think that blaming the unicode consortium is the right place to do so. They seem to be doing a reasonably good job in trying to get everything in, and really they need input from outsiders to do this well. It requires people with linguistic & technical backgrounds which is probably why random governments may have a harder time providing input.
Further from all the points people are making about uppercase/lowercase/hyphenation across languages, it sounds to me like there really needs to be a super-standarized open source implementation of the things you want to do with text, not just purely encode it. I don't think that exists, and it might be a good place for the unicode people to branch to.
Out of curiosity, before Unicode came along what was the state of the art for encoding/writing/displaying Bengali?
Sometimes I think issues with Unicode might be because it's trying to solve issues for languages that haven't yet arrived at a good solution for themselves yet.
Latin-using languages ended up going through a very long orthographic simplification and unification process after centuries of divergent orthographic development. These changes all occurred to simplify or improve some aspect of the language: improve readability, increase typesetting speed, reduce typesetting pieces. Early personal computers even did away with lowercase letters and some punctuation marks completely before they were grudgingly reintroduced.
I'm more familiar with Hangul (Korean), which has sometimes complex rules for composing syllables but has undergone fairly rapid orthographic changes once it was accepted into widespread and official use: dropping antiquated or dialectal letters, dropping antiquated tone markers, spelling revisions, etc. In addition, Chinese characters are now completely phased out in North Korea and are rapidly disappearing in the South.
It's my personal theory that the acceleration of this orthographic change has had to do with the increased need to mechanize (and later computerize) writing. Koreans independently came up with workable solutions to encode, write and display their system, sometimes by changing the system, sometimes by figuring out how to mechanically and then algorithmically encode the complex rules for their script. It appears that a happy medium has been found and Korean users happily pound away at keyboards for billions of keystrokes every year.
I'm digressing here, but pre-unicode, how had Bengali users solved the issue of mechanically and algorithmically encoding, inputting and displaying their own language? Is it just a case that the unicode team is ignorant of these solutions and hasn't transferred this hard-earned knowledge over?
I'm asking these questions out of ignorance of course, I don't know the story on either side.
On a second point, I'm deeply concerned about Unicode getting turned into a font-repository for slightly different (or even the same) character, that just happens to end up in another language. For example, Cherokee uses quite a few Latin letters (and numbers) (and quite a few original inventions). Is it really necessary to store all the Latin letters it uses again? Would a reader of Tsalagi really care too much if I used W or Ꮃ? When does a character go from being a letter to being a specific depiction of that letter?
I'd imagine that from the point of view of a Unicode consortium member, the question as to whether to include a particular Bengali glyph some argued to be obsolete looks more like "should we lower the threshold for what characters in a language are considered deserving of a separate codepoint, potentially exposing ourselves to a deluge of O(#codepoints/#characters per language) requests for obscure variant characters until we actually run out of them", whereas the question as to whether to include a sign for poop in the end boils down to "can we spare O(1) codepoints to prove to the world that we are not humourless fascists". The particular decision in this case might well be ill-informed, but I think any judgement that the Unicode Consortium is engaging in cultural supremacism (as opposed to doing the usual thing of wanting anglo-american capitalist money) is somewhat far-fetched.
The right solution, I think, would be to replace Unicode with a truly intrinsically variable-length standard such as an unbounded UTF-8 - many of the arguments that were fielded in favour of having the option of fixed-width encodings seem to have melted away now that almost everything that interfaces with users has a layer of high-level glue code, naive implementations of strings have been deemed harmful to security and even ostensibly "embedded" platforms can get away with Java. Rather than having an overwhelmed committee of American industrialists decide over the faith of every single codepoint, then, they could simply allocate a prefix to every catalogued writing system and defer the rest to non-technical authorities whose suggested lists would only require basic rubber-stamp sanity-checking.
> "You can't write your name in your native language, but at least you can tweet your frustration with an emoji face that's the same shade of brown as yours!"
This seems fairly characteristic of the apparent belief of social justice activists - and I can't imagine that Unicode's inclusion of skin colours would be a result of anything other than pressure by the same - that they can improve the whole world with remedies conceived against the background of US American race/identity politics.
Had nothing to do with proving they were not humorless fascists. There was a legitimate need for a universal codepoint among Japanese cellphone operators. Keep in mind, Japanese is a language where frequently entire concepts are represented in a single character, so this isn't perhaps as odd as you think. "Poo" had a specific semantic in "cellphone Japanese" that the market demanded. To be used interoperably with other Unicode characters, various 'emoji' were added to Unicode. I, for one, retain the right to be called a humorless fascist.
The argument that author makes is "every letter in English alphabet is represented, why not every letter/grapheme in Bengali/Tamil/Telugu/Name your language" argument is specious at best
Except that the whole purpose of Unicode is to create a character encoding that "enables people around the world to use computers in any language" - taken from the Unicode Consortium website. Bengali is also not an obscure language. It is the 10th most spoken language in the world and the national language of Bangladesh.
Bangladesh is a member of the Unicode Consortium. Why haven't they ensured that Bengali is well supported? The Unicode Consortium's job isn't to support every language in the world; it's to coordinate between the multitude of different parties who do the actual work to ensure that they produce can coexist nicely.
And besides, the argument isn't "Why aren't all Bengali characters represented when all English characters are?" The argument is "Why aren't all Bengali characters represented when a pile of poo is?"
No, the argument is "Why didn't the Unicode authors make the same technical choice I would have based on my limited knowledge of this topic -- including not knowing that the pile of poo was created for use in the Japanese market; as well as not knowing that I could participate myself for just $75, not the $18,000 lie in my story; and not understanding the nuances of international standardization; or reviewing the list of international liaisons to the Unicode organization where much of the language-specific work is done?"
Most of the discussion here centers on how linguistic commitments ought to drive decision making in determining the Unicode spec. It's already been covered how Unicode provides a number to symbol (i.e. codepoint) mapping; composition, rendering, input, etc. is left to the system implementation to determine.
I actually worked with Lee Collins (he was my manager) and a whole bunch of the ITG crew at Apple in the early 00's. The critique that this was just a bunch of white dudes that only loved their own mother tongues but studied other languages dispassionately as a superficial proof of their right to determine an international encoding spec is, to me, misinformed, and DEEPLY offensive.
I had just gotten out of college and it was AMAZING to see how much these people loved language! Like it was almost weird. Our department had a very diverse linguistic background and, oftentimes, when you had a question about a specific language property you could just go down the hall and just ask "Is this the way it should be?"
All the discussion here happened on the unicode mailing lists as well as in passing at lunch and at team meetings. Lots of people felt VERY passionately about particular things; I just liked watching the drama. But to write an article like this that somehow intimates that people didn't care is wrong. People cared A LOT.
It's been touched in a couple of comments, but a big factor in Unicode was also the commercial viability of certain decisions. You have Apple, Adobe, Microsoft, etc with all of their previous attempts at internationalization/localization. If you wanted these large companies to adopt a new standard you had to give them a relative easy pathway to conversion and adoption.
I think the article, in general, lacks a perspective that dismisses the work of the Unicode team as well as the different stakeholders. Historically, accessible computing was birthed in the United States. The tools and processes are naturally Ameri-centric. I'm not saying it's right, I'm just saying it is. The fact that the original contributors to Unicode were (mostly?) American shouldn't be a surprise; they had the most at stake in creating interoperable systems that spanned linguistic boundaries.
A purely academic solution may have been better. That's not certain. But it's pretty clear that any solution that didn't address the needs of pre-existing commercial systems would never be adopted. I'm surprised this hasn't been emphasized more.
I have more thoughts on how to solve the problem that the author complains about but I'll leave that for another day.
I can type anything in bengali without any issue -- হঠাৎ, কিংকর্তব্যবিমূঢ়, সংশপ্তক, বর্ষাকাল...
Not sure about other "second class" languages, this open source phonetic (unicode) keyboard is extremely popular in bangladesh, and everyone does all sorts of bengali stuffs with it --
"in My Fair Lady (based on George Bernard Shaw’s Pygmalion)" which of course draws on Ovid's Metamorphoses poem about Pygmalion from ancient Greek tradition.
I dont speak Bengali, but some european languages have similar problem. In Czech language letter "ch" is a single character with its own place in alphabet.
What is amazing here is that everybody is talking about the character "ৎ" instead of talking about the real issue.
I guess that the main point of the article is that the Unicode Standard is being decided by a bunch of North American and European people, instead of being decided by the speakers of the languages Unicode is intended to.
In my opinion, for instance, Han Unification is a botch and no Japanese would consider that it makes sense.
get your hundreds of millions of native speakers off their asses and solve the problem! the current internet is in roman glyphs because that's what all the people doing the work speak!
Given the number of Indians working in the software field, I think it's safe to say that there are quite a few Bengali speakers "doing the work" to keep the internet running. Actually, I'd posit that the majority of programmers are not native English speakers.
“Whatever path we take, it’s imperative that the writing system of the 21st century be driven by the needs of the people using it. In the end, a non-native speaker – even one who is fluent in the language – cannot truly speak on behalf the monolingual, native speaker.”
Not sure how the author can simultaneously say this, while criticizing the CJK unification, which makes total sense, and has never been a point of contention in the communities concerned with it.
I regularly work with both Chinese and Japanese texts, and have studied Japanese language for more than five years; more than not opposing CJK unification, I benefit from it greatly. It means that I can search for the same character(yes, the same, I said it) and get results in Japanese, Chinese, and sometimes Korean context without having to consult some sort of unicode etymology resource.
There should not be three codepoints for 中, 日, 力, and most other characters. The simplified and specific post-han forms have their own codepoints anyway, so there is(to my eyes at least) no value in fragmenting instances of the same character.
Actually yes, the CJK unification is a problem for many people, including me when I want to read Japanese on a phone bought in Europe.
Example 1. Typically, any time you want to mix the 2 languages you're getting in trouble.
Let's say you write a textbook for Chinese people to learn Japanese as a second language. Or a research article in Japanese citing old Chinese literature.
In your text, you'll have to mark specifically which part are in Japanese and which part are in Chinese and use different font for them. If you don't, the characters look wrong.
Example 2. Most phone software don't switch font for languages, so they pick one. Let's say you buy a Japanese phone, it's a Japanese font everywhere. For some reason on a Samsung or Motorola phone bought in Europe, it's a Chinese font everywhere and it looks ugly for Japanese. Wasn't Unicode supposed to be universal, and allow you to read in any language with a device bought anywhere?
On the other hand, I don't think the "I can't write my name" is really a problem for Japanese people, or at least it's not a Unicode problem. Some people have rare characters in their names. They accept and understand not being able to type their name on a computer and having to settle for a close characters. Unicode actually provided more characters than previous Japanese-only encodings like JIS so there are people who could write their name thanks to Unicode.
> In your text, you'll have to mark specifically which part are in Japanese and which part are in Chinese and use different font for them. If you don't, the characters look wrong.
Most non-trivial text-rendering systems need to know the language of the text in order to properly render it. This is not only an issue for Japanese/Chinese but also e.g. English/German with different hyphenation rules[0], different rules for ligatures[1] and different rules/preferences for spacing after punctuation[2]. None of these can sensibly be solved by Unicode.
[0] ‘backen’ (to bake) becomes bak-ken when hyphenated at the end of a line, ‘airlock’ can’t be hyphenated at all between c and k.
[1] ‘Schiff’ (ship) should be rendered with a ligature ff, ‘Dampfflut’ (~ steam flow?) shouldn’t. ‘afferent’ probably should have it.
[2] Single space after ‘.’ and ‘,’ in German, whereas French tends to prefer more space after ‘.’.
I respectfully disagree. If Japanese ideograms and Chinese ideograms actually used different code points (i.e. no "Han unification"), then the problem wouldn't exist - the phone could trivially use a Japanese font for Japanese text, and a Chinese font for Chinese text.
No. Using different code points for the same character used in different languages creates big problems. It would be like having different code points for 'A' depending on whether it was used in English, Spanish, German, etc. If you somehow ended up writing "color" with both 'o' characters from the Spanish ABCs and the others from the English ABCs, you'd have a real mess when it came to sorting, searching, name matching (what language is "Hans"?) etc. It is far more convenient to allow the character sequence "color" or "Hans" to be language independent, even if the font choices, pronunciation, sort order, etc., are language dependent.
Chinese, Japanese, and Korean writers face similar issues. The characters they use to write the name of China or Japan, the ten digits, the characters for year, month, and day in dates, and so many thousands of others in Chinese characters are what they all consider to be the same characters. That is not all characters, but it includes so many that insisting on different code points by language would make a real mess. Hong Kong has many characters that are unique to HK Cantonese. So, should Cantonese have a full set of all Chinese characters that are the "Cantonese characters"? How about Shanghainese, then? Or Hakka? Teochew (Chaozhou) or a dozen Chinese languages? Full, independent sets of all Chinese characters for each? Suppose you accidentally used an input method in HK and wrote the name of some Beijing gov't ministry using characters that looked identical to their Mandarin counterparts but were entirely different codepoints? Now what? You can't find your search term? You mess up the database and have two identical-looking keys?
No, Han unification is not conceptually different from unifying ABCs used by English and Spanish speakers, Cyrillic used by Russians or Serbs, etc., except that there are many more characters, so the boundary between what should be unified and what shouldn't contains more items in the gray zone to cause debate. Having no Han unification at all wouldn't solve all problems, it would create all sorts of absurdity.
Sure, having no unification at all would be bad, but the issue is with the gray zone. Some characters are written identically in each CJK language, but among those that aren't the amount of difference varies widely. The trouble is that Unicode leaves separate codepoints for each version that somebody, somewhere decided were "different enough" (even when they are the same character historically and linguistically) but merges many characters with (consistent, well-defined) differences because somebody felt they were close enough for horseshoes. People often think that characters were only merged if they were linguistically the same, but that's not the case.
Also, comparisons like "different ABCs for English and Spanish" are spurious and unhelpful. If you could tell an English "b" from a Spanish one by looking at it, the comparison would be sound.
Actually, Cyrillic is an interesting case. The Unicode standard does define completely-separate codepoints for the Cyrillic letters, even for the ones that look "just like" letters of the Latin Alphabet. Greek letters that look exactly like Latin letters get the same treatment.
It's difficult to come up with a logical explanation for why European languages that use their own alphabet get their own codepoints, but ideographic languages need to be "unified", even though the actual letters as used in those languages look different.
The "Han unification" was fundamentally a bad idea, and persists for historical reasons. Back when (some) people thought a fixed-width 16-bit character representation would be "enough", it made sense to try to reduce the number of "redundant" code points. Now that Unicode has expanded to a much-larger code space, I would think they'd choose differently.
Unfortunately, that kind of sweeping change is unlikely any time soon.
When you phrase it this way, "using different code for the same characters" it sounds obvious, but the problem precisely is whether they are the same characters or not. Are they like sans serif (used in English) or Gothic (traditionally used in German), or are they like the Roman alphabet and the Cyrillic alphabet, different but from the same root?
Gather 5 linguists, you won't get them to agree on that. Unicode says they're the same, but not everyone agrees with them, and the practical problems are real.
But.. you already posted a solution to your problem - use chinese font for chinese and japanese font for japanese. There is no problem. I mean, you would also presumably use a western font for latin and another separate font for cyrillic, since japanese fonts universally have absolutely dreadful kerning on latin (and often omit cyrillic entirely).
Having to know metadata about text is precisely what made the pre-unicode days so bad. If you're writing a word document, sure, choose your fonts, but if you're rendering a web page that doesn't happen to declare its language, things aren't so simple. (And if you're writing software meant to correctly handle user input in multiple languages, good luck...)
I run into this problem quite often. Has anyone proposed creating an alias for each character that indicated that it was chinese, japanese, or korean? In this way you could mix words within documents and when it is rendered each language could be represented by different fonts? The same could be used for variations in chinese -- so that if you used a Hong Kong Cantonese or Taiwanese namespace or alias it would use traditional characters, and if using a Mandarin, or Singapore Chinese namespace it would use simplified characters.
I agree, but it seems tricky like it would be tricky to strike exactly the right balance between unifying too much and too little.
The arguments for Han unification could have just as well been applied to unifying the Nordic languages – the Swedish Ä is really exactly the same letter as Danish/Norwegian Æ (except that Swedish words never ever use the latter and presumably Danish words never use the former), so it could be argued that it should have the same codepoint, forcing us to use country-specific fonts and making it impossible to use both languages in a simple text file.
So, for example, running "git log" on a project with an international contributor base would render either Swedish och Danish names incorrectly, depending on which font the user has selected. That would look extremely strange to me, so I imagine Chinese/Japanese/Korean users feel similarly.
(Luckily for us Nordic users, our letters were already separate characters in ISO-8859-1, which Unicode is a superset of.)
Ä and Æ are more different from eachother than the characters in Chinese and Japanese which have been merged. They are used for the same thing, they share etymology, but they are not the same letter.
By the way, Unicode is about scripts, not languages. If we started distinguishing by language, we might need to start remembering that china doesn't have a single language. Duplicating all those characters again to cover Mandarin and Cantonese and Wu and Hakka and Hokkien?
Should we have a different letter for Æ in Bokmål and Nynorsk? How about Riskmål?
The equivalent of Æ and Ä in Han scripts have not been unified. The equivalent of a in Futura and a in Verdana have. There's tons of chinese characters, so a few of them are bound to be borderline and maybe contentious. But overall, Han unification was unavoidable for a project like Unicode.
気 气 and 氣 are actually not merged by han unification, and could be described as similar to similar to Ä and Æ: shared etymology, same meaning, some languages decided to use a simpler character because the old one was too complicated to write.
The variations on characters which have been merged are usually even closer than that. More like the single or double storey "a", or the single or double loop "g".
Well, maybe if the double-story "a" were used exclusively in one country but considered incorrect in another.
I think what you're missing is that Han unification never particularly concerned itself with whether characters are "the same" or not. Somebody basically just decided that some differences were important and others weren't. I mean, the difference between 語 and 语 is analogous to printing and cursive, but for whatever reason it made the cut. Meanwhile the SC and TC versions of 骨 are approximately mirror images, but to show you I'd need two different fonts.
Anyway the merged differences are not just analogous to different fonts.
Not to say there aren’t problems with CJK unification. It’s difficult or impossible to represent old family names or newly coined characters without some means of composing characters from radicals—even if you do want “precomposed” characters a majority of the time, as is the case with, say, é (00E9) versus é (0065 0301).
As far as these characters being “the same”, I think it’s better to say they’re analogues. A in English is analogous to A in Polish—both are Latin alphabets, but neither is the Latin alphabet, and these languages use “the same” letter quite differently. And, say, Å is a letter in Swedish, but in English it’s A with a separate accent mark.
These are also just as well analogous to Α in Greek or А in Cyrillic, so hell, I would actually support the author’s “Greco Unification” strawman if it could be done in a principled way.
> I would actually support the author’s “Greco Unification” strawman if it could be done in a principled way
A 5 year old tech note by a Unicode Consortium member at http://www.unicode.org/notes/tn26 gives 7 reasons "why the Latin, Greek, and Cyrillic scripts have been separately encoded, rather than being encoded as a single script", and 12 reasons why the Han script was unified.
Edit: These 7 reasons make a good case why such "Greco Unification" wouldn't work.
I'm not convinced that this isn't just post-hoc justification.
All seven reasons boil down to "It's always been done this way".
I don't see how "a" in the Latin alphabet as used in English is any different to the "a" in the Latin alphabet used in Polish - in particular they are rendered in the same way.
Yet the article claims that even though Chinese and Japanese versions of characters now appear wildly different they still represent the same script so bad luck. Should the Unicode consortium be concerned with the glyphs or the semantics? Seems they are selectively doing both.
I'm having trouble finding the reference now, but years ago I did read a story about how an early version of Unicode's reference document depicted CJK characters in a font with Chinese-styled characters, causing slowed adoption in Japan who had a gut reaction that "Unicode is too Chinese". Later editions printed several variations of characters, and the objections mostly evaporated.
(Apologies if I've gotten the countries reversed in this story, it's been years and google is failing me)
> [...] CJK unification[...] has never been a point of contention in the communities concerned with it.
I am not very familar with the CJK unification project so take my points with a grain of salt.
> more than not opposing CJK unification, I benefit from it greatly.
I think think that is a different point of view. Isn't it? You are seeing your benefit whereas the author is seeing his. Here's an alternative solution: What if the search engine understood what you were searching for and returned results in all the languages? Unification can result in a lot of information loss the same way a photo can be compressed but it comes at the cost of loss in quality.
> so there is(to my eyes at least) no value in fragmenting instances of the same character.
But no-one is fragmenting instances of the same character. They _are_ different characters from different languages. To take an example from the article, I am not sure how I feel about combining B and β. You are either ignoring the whole of English speaker population or the greek speaking one. Given that you have complete flexibility to assign a code for both of them, why not do it(responsibly)?
The example characters expressed: 日、中、力 however are written the same way in both Chinese and Japanese from my understanding. (Albeit, I studied Japanese).
There are admittedly variations which should be done separately, however unification of visually identical glyphs is a "good thing" imho
I'm fluent in Japanese and speak some Mandarin Chinese as well. These 3 characters are identical, not similar.
For a different example, 国 and 國 used to be the same character, but China and Japan (left) have both diverged the traditional form still used in Taiwan (right). Unicode treats them as separate.
今 Looks slightly different in traditional Chinese vs other languages. In traditional Chinese, the little straight line between the two angled lines is sloped, while it is horizontal in simplified Chinese, Japanese or Korean. Any reader of any of these languages would have no issue if the variant they are used to was replaced by the other one. They might think you have a sloppy handwriting or an ugly font if they even notice, but that's about it. Unicode treats them as the same.
This seems again to be a perfect place for rendering rather than encoding. The english letter 'a' can be rendered as a ring with a tail (the way I handwrite), or a ring with a cap and a tail (the way the font usually renders). Both are the same letter, if rendered differently based on my (contextually sensitive) font.
I think not. I do want to be able to say both People's Republic of China 中华人民共和国 and Republic of China 中華民國 in the same text, and if I had to choose rendering either 国+华 or 國+華 then it wouldn't work.
Curious, as I'm not sure when that would actually happen in real life (in Chinese). Generally in mainland China , the ROC would always be rendered with 国, even officially [1]. And in Taiwan the PRC would be rendered with 國 [2].
It gets a bit weirder in Japanese where the word is distinctly not the same - one is a traditional version (proper noun) of the other and you could imagine a text using both (William vs Wilhelm vs Will).
But I still do want to be able to write texts that are like this discussion: mainly in English, but contain fragments in Chinese, and so that I can use both the traditional and simplified characters.
And it also makes total sense to me that 日本 is Japan, both in Japanese and Chinese, using the exact same Unicode characters.
If it's up to font rendering, you would specify the language tag for each of those individually which would render the appropriate font (as you are writing that country's name in its language).
Though that's far more difficult for the laman, and markdown certainly doesn't seem to have any language markers.
Except you can still recognize the 'a' as 'a' no matter which way it is rendered.
Not so with Chinese characters. For instance, the character for "fly" in simplified (飞) and traditional (飛) look very different. Someone who only learned simplified may not recognize the traditional character as being the same.
As far as I can see, Unicode has mostly settled down into a sort of "good enough" state: characters that have sufficiently different renderings have gotten separate "variant" codepoints for each rendering, while characters that are very similar (even if not completely identical as commonly written) are still only present as unified codepoints.
I've no idea if these variant codepoints are actually supposed to show up in user files, or are intended mainly for the use of font rendering systems, etc... the whole thing seems a bit of a mess, even if the information is technically present.
Judging from unicode.com, "冷" does seem have separate codepoints: 冷 (chinese/unified), and 冷 (japanese z-variant). However my browser renders both as similar characters. Similarly, on my phone, the same character gets input whether using a Chinese or a Japanese input method, and both get rendered using the Japanese rendering (it's a Japanese phone) which makes Chinese text look a little funny.
An interesting example is "晩" / "晚", which has one more stroke in the Japanese variant, but it's situated in a location which makes both variants look pretty much identical (and in small bitmapped fonts, they are identical). Nonetheless, Unicode includes codepoints for both...
The fact that some characters are debatable does not change the fact that han unification is a good idea. In a few cases you can disagree, but not unifying at all would be madness.
As for this particular character (cold), both variants are familiar to Japanese readers, with the one described as Japanese in your link being the one you'd typically see in print, while the other one is common in handwriting, and nobody in Japan would treat these two as different. From a Japanese point of view, this is definitely the kind of thing you change by switching fonts.
This pdf is the official list of basic chinese characters, published by the Japanese governement. Look on page 9, it shows both variants of this characters in hand writing.
The fact that one of the variants in not familiar to Chinese readers complicates the issue, but there are at least reasonable reasons to argue that this is one, not two, characters.
I think it is possible to argue that han unification was not done very well and that the UC made too many classification mistakes (although I personally think it is generally not that bad), but I don't think arguing that unification is a bad thing entirely has legs.
Encoding those as the same code point makes sense. On the other hand, just the codepoint is obviously not enough on its own for rendering the glyph. The Unicode consortium seem not to care about the actual rendering part of the whole stack and are happy just defining the low-level bits. But then why do we have skin colour coding for emoji and no language coding for CJK glyphs? The entire thing is a mess, but heaping another pile of standards on top of it will make it even more of a mess, I'm afraid.
I would like to take a moment to have people recognize that there are differences between the glyphs used for these characters, and some people find this to be a point of contention.
For example, in the case of the text fragment “福祉”, there are three glyph representations for each character.
Referenced below is an image demonstrating these three; first is Chinese, second is Korean, third is Japanese. [0] (I left out the Taiwanese etc. because they don't differ in this case from the CN)
The UC have clear rules for this, as explained in their section "Characters, not glyphs", which starts on page 8 of the chapter 2 pdf of the Unicode 7.0 standard. They make the case that codepoints are not for representing stylistic or glyphic distinctions, as seen here, but rather for semantic separations; that is, people still agree that these are the same characters, even if they typeset and write them differently in Chinese, Korean, and Japanese text.
They make cases specifically against a greco-latin unification, and for the CJK unification in Technical Note #26[1]
I think it is a problem when 2 characters that are written differently share the same code point.
Having 大 be the same code point for Chinese, Japanese and Korean is to me as obvious as having the same A for English and Italian. We agree on this point.
But some of the unified characters listed on Wikipedia ought to have different code points because they are no longer written the same across languages.
Take characters like 今 or 骨. Whether the C/J/K variants are the "same characters" or not is semantic, but philosophy aside they are written differently in each language. That is to say, if you are localizing an app into a CJK language, your app would not be working correctly if it displayed Japanese text with a Chinese 今, or vice versa. If you showed it to a user you'd get a bug report.
But such characters have only one codepoint, so unless you have language metadata for your text, you can't render it correctly (which is half of was so bad about the pre-unicode days!).
(Note: Unicode later added "variation selectors", which as far as I know solve the problems mentioned above. I don't know why the aren't in widespread use, or perhaps they are and I'm misunderstanding something.)
I came in expecting to read an article bemoaning some niche language and playing the diversity card. I was not disappointed, but as I kept reading, the author made some very good points.
I don't really care that the organization is run by white men who speak English, because frankly the entire computing industry and telecommunications industry is based on that. I'm not going to argue about the original sin there, because the facts speak for themselves: they got shit done.
What I do find really disturbing, though, is that somehow both Linear A and Linear B are included, in their (as known) entirety, while Bengali is not--two languages spoken and written by nobody for over a millenia are included whereas a language spoken and written by millions of people is not. That's bad.
The discussion of somehow creating a CJK superset and then deviations to support each of those languages is worse, and the author's remark about trying to "unify" Greek, Roman, Cyrillic, and Swedish struck home.
Personally, I think that it's nice to have an ever-decreasing number of languages to support, preferably English. The first part of that is because it's annoying enough to unambiguously parse one human language much less dozens, and the second part is pure convenience and small-mindedness on my part.
That said, we don't have to buy into diversity or identity or anything else to note that something here is amiss. This is bad allocation of engineering resources.
EDIT:
And yes, one could make the argument that the computer/teleco industry is being produced in Shenzhen, to which I would respond that the intellectual capital was handled by the West. Nintendo hired SGI, Sony IBM. There is certainly a wonderful and burgeoning domestic talent, but the paths were set elsewhere, to the best of my knowledge.
Han unification has been overly aggressive about merging some characters, but the basic principle is not as flawed as it is some times (as in this article) made to sound. The vast majority of Japanese and Chinese characters are not only similar, they are identical. Not all are. Some are clearly different characters deriving from a common historical root, and should not be unified. Sometimes, when characters are a bit different, it is a matter of font. Out of the tends of thousands of characters used by these languages, it is not surprising that mistakes have been made, and yes, han unification has been a bit too far reaching. But not unifying at all would have been much worse. You don't want a in German and a in English to be different letters just because Helvetica and Baskerville look different.
Also, since CJK languages are comfortable with having an unbounded set of letters to work from, they tend to be comfortable with declaring every odd thing a separate letter. Witness the mess that half-width katakana is. Or notice that the origin of most emoji is not in the disregard of white old men for "lesser" languages, but comes from importing existing characters from Japanese encodings.
Which also hints at one more factor. Historically, and still very much to this day, Japan has been much more active in standardization bodies than most other non western countries. Unsurprisingly, Japanese is much better supported by modern software than many other non western languages.
> You don't want a in German and a in English to be different letters just because Helvetica and Baskerville look different.
A much more apt comparison would be Antiqua and Fraktur. It was a commonly-held belief that Fraktur was the authentic German alphabet, separate from the alphabets used by other languages. Back in the 19th century, English-German dictionaries even used Antiqua for English words and Fraktur for German words.
no, it isn't a more apt comparison. Antiqua and Fraktur look much more different from each other than most Chinese and Japanese characters are from eachother.
Here are two screenshots in Japanese and Chinese, taken from the http://www.yomiuri.co.jp/ for one, and http://www.cntv.cn/ for the other. Neither media outlet can be accused of being cultural sellouts misrepresenting the written tradition of their countries.
These are the same characters, and unifying them is the only sane thing to do. The same is true for most characters. There are also some characters which are clearly distinct in both scripts, and unicode treats them as different. Some are grey-zone, and there is room for reasonable disagreement.
But as a whole, the case of unification of Chinese and Japanese is stronger than for Antiqua and Fraktur.
Your screenshots are showing the exact same glyphs because you don't have the "宋体" font installed that's specified by the Chinese web page (it's a script font so you'd notice instantly it's different), so your browser picked a Japanese font to render it instead.
The only reason they look the same in your screenshots is due to Han Unification. Your reasoning is "they're unified in Unicode which is proof they should be unified in Unicode".
The fallback font used on my computer isn't the most appropriate one, but Japanese also uses Song/Ming fonts, and this is much more similar to the difference between a serif and a sans serif font than between different characters.
> The vast majority of Japanese and Chinese characters are not only similar, they are identical. Not all are. Some are clearly different characters deriving from a common historical root, and should not be unified.
And what about traditional versus simplified? Which glyph set do I use? Oh wait, thanks to Han unification, I now need to rely on bloody environment variables to decide!
For Chinese text, rendering a string of text involves not just the string, but also the local runtime. Joy of joys.
> You don't want a in German and a in English to be different letters just because Helvetica and Baskerville look different.
If following the spec means my users become partially illiterate then there are problems with the spec.
Antiqua, Fractura, Schwabacher, Textura and all those other charming variants of writing European languages don't have seperate code points for all those presentational variations.
Do we white Western men happen to discriminate against ourselves?
The difference between traditional and simplified Chinese characters is more than simply different fonts. Part of the difficulty is that some simplified characters map to multiple traditional characters, which means that converting from one to the other may be lossy. There's also the Japanese equivalent of simplified characters (shinjitai), many of which differ from their Chinese counterparts, as well as characters that were invented in Japan and may or may not have Chinese equivalents (kokuji).
Simplified and traditional characters have different unicode codepoints. Japanese and mainland simplifications have different codepoints when they differ, and Japanese-only characters of course have their own codepoints.
The argument is about characters like 冷. It is given a single codepoint, but Chinese typefaces draw the bottommost stroke diagonally, and Japanese typefaces tend to draw it vertically. (When writing it by hand, both versions are acceptable in Japanese also, http://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q105... )
The Cyrillic R (looks like P) and and Greek Roh (also looks like P) and the Latin P are not unified either. Although I think unifying some Greek and Cyrillic letters would have made sense.
Those are all stylistic differences. You can still read any of those fonts. Sure some designer can go overboard and make a font hard to read, but that is just a designer being overly fancy!
Think instead of the difference between Cyrillic and Latin.
Sure if you squint hard enough they both have a common origin (Greek), but you'd be rather annoyed if you setup your phone, selected "English" and the OS used Cyrillic letters to spell out English words.
Likewise you'd be annoyed if Greek letters were used.
And you'd be even more upset if some products decided to use Cyrillic, some Greek, and some used Latin.
Actually, I can't. I can decipher quite a bit with lots of effort, but I wouldn't call that "reading".
And there are very few people around who can read all those (especially Textura) without problems.
On your other example: when I was in Russia, I found those "unknown" letters difficult, but fun. But the letters they share with Latin? I didn't see a difference.
And to your phone example: of course, I'd be annoyed. But that's exactly the point: you have to set up your local system correctly, to your expectations and standards.
I'd be much more annoyed if some web page showed an English text in some Chinese transliteration, just because the author (writing English!) was Chinese. But that's basically what you propose!
Sorry, I truly think you've run up an argumentative dead end.
> On your other example: when I was in Russia, I found those "unknown" letters difficult, but fun. But the letters they share with Latin? I didn't see a difference.
Again, Han unification does this.
Most characters are the same, great! But some are different. Sucks for those that are different.
> And to your phone example: of course, I'd be annoyed. But that's exactly the point: you have to set up your local system correctly, to your expectations and standards.
The problem here is that for almost everything else, Unicode is a mapping from a code point (or a set of code points) to some distinct and unique representation on screen.
Except there are some code points for which that isn't true.
> Sorry, I truly think you've run up an argumentative dead end.
You are not arguing any point, other than "let's keep this historical attempt at saving on encoding space around, even though we have expanded out the encoding space so that we don't need the savings anymore."
That isn't a strong argument.
My argument is "user's don't like this, it upsets them, we shouldn't do it."
If something we do as engineers angers or upsets our users, we are doing it wrong. Flat out.
As far as I know, those "some are different" characters have not been unified, but got separate code points. And that's where your whole argumentation falls apart, unless you claim that the Unicode consortium mis-classified lots of characters.
But then I'd just retreat, because I don't speak any Asian language and cannot verify the claim myself. I can only defer to the experts, and they say that issue has been taken care of.
As far as space savings are concerned, I replied to that in another comment to you. It's not okay to dismiss Han unification as just some space saving sttempt that's not needed anymore. Space saving was one motivation, but according to the consortium not the primary one.
Bengali is the seventh most widely spoken language in the world, with more native speakers than Russian. That's hardly a "niche language".
And, as pg has pointed out, making the computer industry English-centric and America-centric is greatly limiting us. Intelligence and will to power are equally distributed across the globe. But institutional barriers to success are not. This is a classic example of an institutional barrier, and it makes all of us poorer.
It's also the national language of Bangladesh. Once a language has become the official language of a recognized country it can hardly be considered a "niche" language.
Right, but some languages are insanely complex to implement. It might be a better idea to teach English to people around the globe rather than cater to every individual need (which will still leave people unable to communicate across languages).
I'm not saying other languages should go away -- but the world would also benefit from having a "universal" language, which is more or less English at this point (Mandarin is spoken by more people but is rarely spoken outside of Asia). If we want to maximize intelligence and will to power, it would be best if like-minded individuals could communicate regardless of where they're from or what their native language is.
It's one thing to say English will be the international language of trade and commerce. It's another thing altogether to say we won't bother making character sets that represent other languages, because it's too hard.
Since when has "too hard" ever stopped an engineer? That should be catnip for us!
Unicode is notoriously tricky to implement, there have been bugs where users could take over others accounts by choosing a different unicode representation of the same string.
I'd much rather have something that works and is simple. Than something that makes it easy for everybody and is broken.
> some languages are insanely complex to implement
I don't understand this, is there more to implementing a language than creating glyphs for its character set? I wouldn't think the linguistic complexity would matter at all, only the number of glyphs in the 'alphabet' or similar?
Languages aren't all just simple alphabets like English. Some languages use ligatures to combine characters. In English, things like 'fi' and 'ffl' can be done almost automatically, and is optional, but other languages have stronger and more important rules.
As a simple example, in German the ligature ß is not a simple ligature for 'ss' but a combination of two previous ligatures; long s with round s ("ſs") and long s with (round) z "ſʒ"). Various spelling reforms have simplified the orthography, but "Maßen" and "Massen" are still different words.
Quoting from a Wikipedia page, "Urdu (one of the main languages of South Asia), which uses a calligraphic version of the Arabic-based Nasta`liq script, requires a great number of ligatures in digital typography. InPage, a widely used desktop publishing tool for Urdu, uses Nasta`liq fonts with over 20,000 ligatures"
Then there are rules for presentation. "Complex text layout ... refers to the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes." - http://en.wikipedia.org/wiki/Complex_text_layout . Cursive English is closest we have to complex text layout; while there are "cursive" fonts where each of the characters is in cursive the letters don't merge. Now imagine a language where smooth connections and fancy curlicues in the "right" places were essential for being seen as erudite, and where "right" depended on 5 years of learning.
Yes, if the way the characters look depends on the other characters in the word including some that are nowhere near being neighbours. Especially if there were weird and complex rules about how this changed depending on the kind of word being written.
There are a number of writing systems that are evil like that, including several Indic ones.
The code that handles this complicated process is often called a shaper. Choosing and combining the correct glyphs involves a complicated dance between that and the font(s), possibly including large tables (and code!) in the font itself on top of what the shaper does.
It really depends on the language, and it's not totally about the glyphs. Text entry is a huge challenge for languages like Mandarin where glyphs can have multiple pronunciations and meanings depending on context. Consider that Mandarin (which shares many, but not all glyphs with Japanese kanji) has upwards of 20,000 different glyphs, and that other languages have a similar level of complexity, and it becomes hard to find an encoding standard capable of handling all of that complexity and variance.
What constitutes a "glyph" isn't even consistent - in some languages a glyph is a syllable, in some (like English) it's less than a syllable, and in yet others a single glyph can be an entire word.
In a language like Japanese, multiple glyphs are often combined to create new composite glyphs with different meanings. For example, the word for "forest" is a glyph comprised of 3 "tree" glyphs, but has an unrelated pronunciation.
How do you handle text entry between these differences? It may seem like a pedantic question, but it makes sense to define the characters in the way they will be written, or else the text entry scheme will be so complex you'll need an interpreter to convert from some entry scheme into the Unicode format. I think this is the problem the Unicode Consortium is grappling with - and it's not an easy problem. I don't claim to have the answers here; but I do recognize the complexity.
User interface isn't the problem, though - bitwise representation is the problem. How do we represent all the valid characters in Unicode? Data entry is an entirely separate issue (as is display).
Hypothetically you could construct a language where the glyphs are easy to generate procedurally on the fly by people who are fluent in that language, but who's full space of possible glyphs is staggeringly massive.
Suppose a language with tens or hundreds of thousands of "base" glyphs, but with a unique variant on each glyph depending on what is to the left and right of it. With that alone, for N base glyphs, you could have N^2 variants of each glyph.
I don't know if that sort of language exists. I don't see any reason why it couldn't though.
Yeah, there's only one minor problem with the article: those emoji symbols it bemoans were added for the Japanese market. Western and English-language support for them was very much an afterthought. What's more, the reason they were added was because Japanese mobile providers were already offering emoji support through various non-standard and not entirely compatible encodings.
Han unification makes things really hard for programmers. You end up with code that tries to guess what language a string is in to pick out which character set should be used!
But because of Han unification I all of a sudden DO need to know the language.
The same Unicode code point needs to be rendered differently for a user in Mainland China versus a user in Japan or else the user may not be able to read the text! Even if the user can read the character, they are going to experience a degradation in reading speed and comprehension, and be generally frustrated. Not to mention showing the wrong character is insensitive to the customer's culture, and if I pick to and stick just one set of characters to use, I end up (being accused at least of) promoting cultural hegemony based on which character set I go with.
In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?
This can happen if the user is intentionally reading mixed-language text or text not in their computer's UI language, of course. In that case different CJK languages also have different preferred fonts, so having language tagging or just guessing is pretty important.
> In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?
For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.
Except if any characters in the CJK Unified Ideograph range. Then my function call suddenly has to jump out to read environment variables, which are hopefully setup correctly.
My code to do a lookup into a font file should not depend upon the users environment variables due to a space saving optimization made two decades ago.
> For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.
Why are you implementing OpenType? It's got working libraries already.
But if you are getting into that, glyphs in a font are stored by "glyph name", not necessarily by code point. There's a bunch more steps than that.
- Font substitution: Find fonts that cover every character in the text. The order of your search list depends on the language.
- Text layout and line breaking: for best results, you don't want to line break in the middle of a word, and you need to place punctuation on the correct side of right-to-left sentences. I think both of these need dictionaries.
You have to read the GSUB tables and do a bunch of expected features, like ligatures, automatic fractions, beginning of word special forms (see Zapfino), &c. This includes language specific glyphs, but fonts can also just choose glyphs with a random number generator.
- Drawing the glyph. Remember not to draw each one individually, or a translucent line of overlapping characters (like in Indian languages) will look bad.
Sorry, Han glyphs render the same in Chinese and Japanese.
Regarding simplified versus traditional, no one is seriously unifying those.
There's some minor disagreements as to when a minor stylistic or historical variant deserves a separate glyph, but this isn't about rendering different glyphs in Chinese or Japanese. If Unicode is doing its job no one should have difficulty reading unified Han characters in one font regardless of language.
Well, if you find Hiragana/Katakana it's Japanese, if you find Chữ Nôm it's Vietnamese. Otherwise it's Chinese (Well, given the definition of "language" is very hard in the context of Chinese).
From a purly theoritical perspecitve the Han unification looks like a great idea. Image the horror of normalisations if it didn't happen. ; and greek questionmark would have been a joke in comparison.
Imagine a world where the British always write the lowercase letter g as a single-story glyph (http://en.m.wikipedia.org/wiki/G#Typographic_variants). The colonies start writing it identically, but after a while, they start writing it as a double-story g.
After a century or so, nobody in he colonies writes the single-story variant, and all Brits always do.
The unicode consortium studies the case and concludes that there is only a single g with variations in the way it is written. Because of that, it creates a single code point for the lowercase 'g' character.
Now, suppose a web page stores the text 'goto' in Unicode as the code points for 'g', 'o', 't', and 'o'. To write the code that renders that string, you will find you need to know whether the text is written in British English or in colonial English.
No, you don't. It's the same letter and it's always "goto".
Your local setup determines the look of the glyph, so nobody sees an unfamiliar form.
But maybe you'd like to encode typefaces/fonts in the Unicode code points, as well? To make sure that I'm seeing the exact same arrangement of pixels you want me to see?
"G" and "g" are the same letter. They started off as stylistic forms of a unicameral alphabet. Over time they took on separate meanings, and now we have a bicameral alphabet, where the two forms have different code points.
Of course, over the last 2000 years, we've developed rules for how to use them. "I was reading a nice book on Polish polish on the way from Reading to Nice" contains three pairs of words where the capitalization changes the meaning and pronunciation. (In simplified form, "What do you know about polish?" is different than "What do you know about Polish?")
If there were a simple rule to specify capitalization, eg, only the first letter of a sentence, and it were easy to detect the start of a sentence, then the alternate you might say it's pointless to have both "g" and "G"; we should have only a single form and let the local setup determine how to display it. (Something like the Greek sigma, which has the form ς when used at the end of a word, though Unicode has them as two different glyphs.)
In Someone's nice example, it's easy to think of how the two divergent forms of 'g' might take on their own meaning. Perhaps the Americans have decided that double-story g was the sign of true patriots, and that single-story g was for traitors. (Akin to the shibboleth of how to say 'H' in Northern Ireland; aitch was Protestant, haitch was Catholic, and using the wrong version could get you into trouble.) Perhaps they started to use the new 'g' preference as a currency symbol, in the way that £ is the same letter as L, from the Latin libra pondo.
The premise of Someone's hypothetical was to explain to theon144 why it might be both hard and important to guess. The hypothetical assumed that the difference already existed. It echos the larger context of Han unification that com2kid started, but with an example that's a lot easier for native English speakers to understand.
So while I agree that they haven't yet diverged, that's outside the context of the hypothetical, where they have diverged.
That's wrong. Someone's hypothetical had no diverging meaning involved, only a stylistic choice in the presentation of the same letter.
And he is still wrong when it comes to the claimed necessity to "guess".
The user has set up his system correctly. "Guessing" only comes into play if you want to force your stylistic variation on others.
And that is obviously a bad idea, for all the reasons called out before, like familiarity and readability.
Let's not kid ourselves. When it comes to Han unification opposition there is mostly one issue at play: plain racism. "Our holy script shall not be defiled by those dirty bastards". And that works in all directions.
The issue is that, in this example, both the colonials and the British think the two 'g' characters are different letters, just as people nowadays think 'g' and 'G' are different characters (historically, that is at least up for debate: http://en.m.wikipedia.org/wiki/Letter_case#History). Americans will want to see a different character when quoting Shakespeare inside American-English text (globalization starts with a different letter than globalisation)
because Unicode has only one character, writing that text in a text editor or storing it in a text column in a database becomes impossible.
Yes, there are workarounds such as using escape characters or html, but those are a nuisance that could be avoided by including both variants in Unicode.
The unicode consortium is entitled to think differently, but they cannot expect everybody to be happy with their choice.
When reading 'gas', the client will have to figure out whether that is about a liquid, in which case it has to choose a colonial 'g' (when written with a British 'g' 'gas' always is a liquid). If the meaning is that of a gaseous substance, the client will have to do additional work to determine what kind of 'g' to write.
I really would like to downvote you for your harping about "diversity" whatever you mean by that. I can guess what you mean, as I live in the United States and am exposed to other people who harp similarly thus.
I'm not going to down vote you though, just point out your weird usage of the term and mention that this is about proper information transmission, the foundation of computing and the Internet. So maybe get off your agenda regarding whatever you mean by "diversity" and start to see the people of all types as fellow humans and not actors in your own private psycho-drama.
I can't think of any telecommunication or computing device that I own or have touched in the last few years that was built by white, English speaking men.
Regarding software, I googled a bit to find numbers, and one of the first articles actually giving numbers says 19 percent of software developers are in the largest English speaking country, the US: http://www.techrepublic.com/blog/european-technology/there-a...
> Personally, I think that it's nice to have an ever-decreasing number of languages to support, preferably English. The first part of that is because it's annoying enough to unambiguously parse one human language much less dozens, and the second part is pure convenience and small-mindedness on my part.
I'll go further and claim that it's all about convenience and small-mindedness.
It's so typical of a certain group of programmers to complain like this. Ugh, why do we have different languages and ways of doing things? Let's just converge on the one I grew up with, or use[1]. Hopefully you won't be forced to work on this kind of thing, or any kind of work that you don't care about/think is worth it. But until you do, maybe you could keep your self-centred navel gazing to yourself.
[1] In the case of English, most programmers who express this opinion write perfectly good English, though they might not be native speakers. They wouldn't be inconvenienced at all if suddenly everyone forgot about their own language and started speaking English, especially since they don't seem to care about languages beyond having some way of communicating with other people.
It sounds like the author is looking to be offended. Talking about Bengali being the seventh largest native language, then saying that US$18k is too expensive for a stake in solving this problem? Emojis with skin tones, something every human can use, that have arrived before a character that only Bengalis use is taken as an "outright insult"?
> Talking about Bengali being the seventh largest native language, then saying that US$18k is too expensive for a stake in solving this problem?
West Bengal and Bangladesh aren't exactly the richest places in the world.
> Emojis with skin tones, something every human can use, that have arrived before a character that only Bengalis use is taken as an "outright insult"?
Nobody is being inconvenienced by the inability to send emojis with a darker skin tone. People definitely are being inconvenienced by not being able to write a common character of their native language.
I work with a Bangladeshi, and he says that there are more millionaires in Bangladesh than here in Australia. Yes, the people are poor on average, but there's still a lot of money sloshing around.
As to your second point, the top-voted comment in the thread talks about writing exactly that character and gives an example. From the ensuing discussion with the article author, it seems that rather than an inconvenience per se, it's more of a subtle issue. sdg1 also mentions that there's little in the way of local standards, which may be hindering a formal uptake by unicode.
Ultimately the answer is to push the issue with the standards body. If you aren't going to front the money for voting power, then pester the people who do have the power, and make it easier for them by helping them to standardise the codepoints for the language. But whichever path is taken, they must be engaged professionally - saying things like "I take it as an outright insult that you haven't accommodated me yet" is not going to help.
>Emojis with skin tones, something every human can use, that have arrived before a character that only Bengalis use is taken as an "outright insult"?
Who cares if emojis can be used by every human? Bengali language support is something people actually need to to function normally in the digital age without jumping through a million hoops.
But there is more to it when it comes to India, it is mostly that "nobody really cares".
It is sad that while India is develops, it is rapidly leaving its many languages behind when it comes to the computer; indeed, while Chinese/Arabic keyboards are extremely common, one would be hard pressed to find Hindi, "Nagari", (let alone Bengali) keyboards in India. This despite the nauseous linguistic jingoism in the country's political history. Indeed there has been, sadly very little to show, and it has infact, gotten worse past independence, as with many things; see,
http://www.columbia.edu/cu/mesaas/faculty/directory/pollock_...
I would not be surprised, at this rate, if India turns into a monoculture, two centuries hence.
P.S: Speaking of representation, the script the OP talks of, belongs to the set of phonetically accurate scripts; one which while being populated mostly by Indic scripts, is given - quite disgracefully - the name "abugida", after a lone Semitic script - "Geez" - from Ethiopia - which ironically is probably derived from one of the Indic ones. Systematic biases are far too apparent in Indology.
My English professor biggest pet peeve. He told us day one.
It was:
1. Don't use cliches in your writing.
2. Don't ever use emoji in any communication.
p.s. Emoji wasen't even around when I went to school. I bet Mr. Taylor woukd have had a field day though? Never been the
fan of any smiley face, even rotated 90 degrees.
In case of the "missing" letter (called khanda-ta in Bengali) for the Bengali equivalent of "suddenly", historically, it has been a derivative of the ta-halant form (ত + ্ + ). As the language evolved, khanda-ta became a grapheme of its own, and Unicode 4.1 did encode it as a distinct grapheme. A nicely written review of the discussions around the addition can be found here: http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf
I could write the author's name fine: আদিত্য. A search with the string in the Bengali version of Wikipedia pulls up quite a few results as well, so other people are writing it too. The final "letter" in that string is a compound character, and there's no clear evidence that it needs to be treated as an independent one. Even while in primary school, we were taught the final "letter" in the author's name as a conjunct. In contrast, for the khanda-ta case, it could be shown that modern Bengali dictionaries explicitly referred to khanda-ta as an independent character.
For me, many of these problems are more of an input issue, than an encoding issue. Non latin languages have had to shoe-horn their script onto keyboard layouts designed for latin-scripts, and that has been always suboptimal. With touch devices we have newer ways to think about this problem, and people are starting to try out things.
[Disclosure: I was involved in the Unicode discussions about khanda-ta (I was not affiliated with a consortium member) and I have been involved with Indic localization projects for the past 15 years]