Hacker News new | past | comments | ask | show | jobs | submit login

> there is an growing number of hieroglyphs as there is an infinite (growing) number of words; and every h. consists of more primitive parts - i think they should be coded so that the sequence of primitives would form a hieroglyph like in european languages letters form words

The F.A.Q for Chinese and Japanese at the Unicode Consortium's website http://unicode.org/faq/han_cjk.html poses this very question:

> Q: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?

Their reply:

A: The Han ideographic script is largely compositional in nature. The overwhelming number of characters created over the centuries (and still being coined) are made by adjoining two or more old characters in simple geometric relationships. For example, the Cantonese- specific character U+55F0 嗰 was created by adjoining the two older characters, U+53E3 口 and U+500B 個, one next to the other.

The compositional nature of the script—and, more to the point, the fact that this compositional nature is well-known—means that over time tens of thousands of ideographs have been created, and these are currently encoded in Unicode by using one code point per ideograph. The result is that some 71,000 code points are consumed by ideographs in Unicode 5.0, nearly three-quarters of the characters encoded.

The compositional nature of the script makes it attractive to propose a compositional encoding model, such as can be used for Hangul. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to examine potential candidates for encoding.

Unfortunately, there are some difficulties involved with a compositional model for Han.

First of all, while the rules for drawing composed Jamos as Hangul syllables are relatively straightforward, those for Han are surprisingly complex. To use U+55F0 嗰 as an example again, although it is built structurally out of two pieces, the left piece occupies far less than 50% of the character's horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and doesn't apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies into account, or the encoding model would have to provide more information than just the geometric relationship between the composing pieces. (This is the main reason why the existing Ideographic Description Sequence mechanism is inadequate even for drawing described ideographs.)

Even more difficult is the problem of normalization, which would be necessary for operations such as comparison or searching. A normalization algorithm would first have to parse the sequence of composing Han for validity, and then make sure that all substrings are normalized. It should also to be able to recognize a "canonical" form for a sequence of composing Han. Thus, U+55F0 嗰 could be spelled using three pieces (U+53E3 口, U+4EBB 亻, U+56FA 固) as well as with two. Indeed, since U+4EBB 亻 is a well-known variant form of U+4EBA 人, it could be spelled using that character, as well. Providing a canonical representation would have to take these multiple spellings into account.

The open-ended nature of the script and possibilities for ambiguous spelling make it virtually impossible to guarantee that two characters made up by two different people would be treated as equivalent even if they look exactly the same and are intended to be equivalent.

Other computer processes such as machine-based translation or text-to- speech would probably have to skip such characters when they occur in plain text, because there is no simple, authoritative way for these processes to be able to determine even approximate definitions or pronunciations from the visual representation alone. Even if the data are available, the need to parse strings of variable length before looking them up creates complications.

Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.

While the number of encodable ideographs has proven far greater than Unicode had originally anticipated, the standard is in no danger of running out of room for them any time soon. 71,000 ideographs encoded in 17 years amounts to just over 4000 ideographs per year. At this rate, it would take nearly two hundred years to fill up the available space in Unicode with ideographs.

And while the number of unencoded but useful ideographs is larger than originally anticipated, it is also finite and probably smaller than the number of ideographs already encoded. The bulk of useful unencoded forms is likely to come from placenames, personal names, or characters needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded characters and would best be represented as such.

While it currently takes several years for the IRG to fully process proposed ideographs so that they can be encoded, steps are being taken to streamline this, and further steps will be possible in the future should they prove necessary. Indeed, the bulk of the work currently done by the IRG would still have to be done for composed ideographs in order to provide support for them beyond rendering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: