Hacker News new | past | comments | ask | show | jobs | submit login

For those interested in Chinese-language computing, there's a fascinating rabbit hole.

In too-brief summary, character-based Chinese has almost died twice in modern history.

First, with the typewriter, where a mechanically unique typewriter had to be invented to span the Chinese character alphabet.

Second, with early low-bit computers, where the address space was insufficient to span the character alphabet, and what's essentially a multi-character-to-single-character encoding had to be invented.

More detail: https://www.radiolab.org/podcast/wubi-effect-2308

It'll be curious to see how the LLMs evolve, given have alphabetically distinct Chinese is.




Chinese characters being unsuitable for QWERTY typewriters and early computers' limitations didn't "almost kill character-based Chinese", it just meant Chinese computing lagged behind. Inputing Chinese characters with a limited key set wasn't even conceptually novel, given the long history of structural decomposition and phonetic representation of Chinese characters, but it just couldn't be implemented before computers were powerful enough.


It is a fact that Mao supported romanization. <https://np.reddit.com/r/todayilearned/comments/54ky8m/til_th...> Presumably, had the PRC switched decades ago, no doubt Taiwan would today cite its sticking with the traditional written language as yet more proof of it being the "real" China, while the PRC would no doubt point to the millions of illiterates who were taught to read in the Western alphabet as proof of the superiority of its approach. (And vice versa, of course, had Taiwan been the one that chose to romanize.)


> it just couldn't be implemented before computers were powerful enough.

Wubi was presented by Wang at the 38th UN General Assembly in 1984.

So while they missed a few years, that's not bad.


if like me you want to see the typewriter, here's [0] a short blub with it in use. seems like it'd be quite difficult to use. fewer parts to break down than a standard typewriter though, surprisingly.

0: https://www.bl.uk/history-of-writing/articles/the-double-pig...


While interesting, that podcast episode is just a watered down version of Tom Mullaney's work. Here's the man himself covering this subject in more depth and with more seriousness:

https://m.youtube.com/watch?v=KSEoHLnIXYk&t=310s&pp=ygUSY2hp...


When you learn Chinese, they teach you that the more complex words are just compounds of smaller words put into a single character. I imagine tokenizing it isn't much different than other languages -- for the complex characters you can tokenize it into its smaller parts.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: