Hacker News new | past | comments | ask | show | jobs | submit login

LLM outputs in Standard Chinese are fluent and coherent but the grammar style and choice of words are clearly non-native. I think the source training material just isn't as comprehensive as English, and contemporary Chinese Internet is packed full of latin abbreviations and obscure homophones/wordplay due to its evolution under severe censorship.



Awesome observation re: censorship. Last I looked into training sets, Chinese and English were the frontrunners by far in available corpus size, and benefit from a certain amount of government-enforced-homogeneity (a standard mode is taught in compulsory education).

Arabic is safe from automation due to its limited exposure to the open web and extreme variability of dialects, and Icelandic will remain obscure due to sample size and the fact that they mostly speak English online.


The islandic government is apparently partnering with OpenAI to "preserve their language". I suppose they have a GPT-4 customized for that language.


oh neat, thanks for the heads up


Arabic is also often written in the so-called Arabic chat alphabet / Arabizi online, so a lot of Arabic text available on the internet looks less like حروف عربية and more like “7ruf 3rabiye”.


Formal text wise Chinese is still a very comprehensive language. However in terms of user-generated content it has degraded severely especially in the past ~5 years. Internet slangs in English can be confusing too, but users have a choice in whether to use them or not. On Chinese social media using abbreviations/homophones has become practically baseline for certain common terms because these platforms do so much preemptive self-censorship due to the government's near-invisible red line that fluctuates constantly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: