For those interested in Chinese-language computing, there's a fascinating rabbit hole.
In too-brief summary, character-based Chinese has almost died twice in modern history.
First, with the typewriter, where a mechanically unique typewriter had to be invented to span the Chinese character alphabet.
Second, with early low-bit computers, where the address space was insufficient to span the character alphabet, and what's essentially a multi-character-to-single-character encoding had to be invented.
Chinese characters being unsuitable for QWERTY typewriters and early computers' limitations didn't "almost kill character-based Chinese", it just meant Chinese computing lagged behind. Inputing Chinese characters with a limited key set wasn't even conceptually novel, given the long history of structural decomposition and phonetic representation of Chinese characters, but it just couldn't be implemented before computers were powerful enough.
It is a fact that Mao supported romanization. <https://np.reddit.com/r/todayilearned/comments/54ky8m/til_th...> Presumably, had the PRC switched decades ago, no doubt Taiwan would today cite its sticking with the traditional written language as yet more proof of it being the "real" China, while the PRC would no doubt point to the millions of illiterates who were taught to read in the Western alphabet as proof of the superiority of its approach. (And vice versa, of course, had Taiwan been the one that chose to romanize.)
if like me you want to see the typewriter, here's [0] a short blub with it in use. seems like it'd be quite difficult to use. fewer parts to break down than a standard typewriter though, surprisingly.
While interesting, that podcast episode is just a watered down version of Tom Mullaney's work. Here's the man himself covering this subject in more depth and with more seriousness:
When you learn Chinese, they teach you that the more complex words are just compounds of smaller words put into a single character. I imagine tokenizing it isn't much different than other languages -- for the complex characters you can tokenize it into its smaller parts.
In too-brief summary, character-based Chinese has almost died twice in modern history.
First, with the typewriter, where a mechanically unique typewriter had to be invented to span the Chinese character alphabet.
Second, with early low-bit computers, where the address space was insufficient to span the character alphabet, and what's essentially a multi-character-to-single-character encoding had to be invented.
More detail: https://www.radiolab.org/podcast/wubi-effect-2308
It'll be curious to see how the LLMs evolve, given have alphabetically distinct Chinese is.