LLM outputs in Standard Chinese are fluent and coherent but the grammar style an...

jazzyjackson · on March 14, 2023

Awesome observation re: censorship. Last I looked into training sets, Chinese and English were the frontrunners by far in available corpus size, and benefit from a certain amount of government-enforced-homogeneity (a standard mode is taught in compulsory education).

Arabic is safe from automation due to its limited exposure to the open web and extreme variability of dialects, and Icelandic will remain obscure due to sample size and the fact that they mostly speak English online.

bayesian_horse · on March 14, 2023

The islandic government is apparently partnering with OpenAI to "preserve their language". I suppose they have a GPT-4 customized for that language.

jazzyjackson · on March 14, 2023

oh neat, thanks for the heads up

rafram · on March 15, 2023

Arabic is also often written in the so-called Arabic chat alphabet / Arabizi online, so a lot of Arabic text available on the internet looks less like حروف عربية and more like “7ruf 3rabiye”.

ssnistfajen · on March 14, 2023

Formal text wise Chinese is still a very comprehensive language. However in terms of user-generated content it has degraded severely especially in the past ~5 years. Internet slangs in English can be confusing too, but users have a choice in whether to use them or not. On Chinese social media using abbreviations/homophones has become practically baseline for certain common terms because these platforms do so much preemptive self-censorship due to the government's near-invisible red line that fluctuates constantly.