Hacker News new | past | comments | ask | show | jobs | submit login

Do you know any good sources for large amounts of translated Cantonese text?



I would assume Hong Kong records as Cantonese is the official language (for at least English<->Cantonese), I would also assume the Guangdong province would also be a source of material as well.


For most people in Guangdong, Cantonese is at most a spoken language. They learn Standard Written Chinese with Mandarin pronunciation at school and if they want to write down something said in Cantonese, they might substitute characters of equivalent meaning (e.g. 是 instead of 係) or with similar pronunciation instead of the "official" characters used in Hong Kong.

The Hong Kong government is not much different. Their actual language policy is "Chinese and English are the official languages of Hong Kong. Committed to openness and accountability, the Government produces important documents in both English and Chinese. Correspondence with individual members of the public is always in the language appropriate to the recipients. Simultaneous interpretation in English / Cantonese / Putonghua is made available to meetings of the Legislative Council and Government boards and committees as needed." https://www.csb.gov.hk/english/aboutus/org/scsd/1470.html

So they recognize Cantonese and Putonghua as different spoken forms, but only one written language. I've never seen a Hong Kong government website offer translation into both Cantonese and Mandarin, it's always just Standard Written Chinese with a choice of Traditional or Simplified characters.

Most written Cantonese content on the internet is probably produced by Hong Kongers in informal contexts such as forums, but then it's not clearly marked as such and might be mixed with Standard Written Chinese and English.

I think the largest collection of monolingual Cantonese text is probably Cantonese Wikipedia (which is small) and the largest collections of translated Cantonese I'm aware of are even smaller: Tatoeba with 6095 sentences https://tatoeba.org/eng/sentences/show_all_in/yue/und and CantoDict with 1547 sentences http://www.cantonese.sheik.co.uk/scripts/examplelist.htm

But there might be larger datasets I'm not aware of, which is why I was asking.


It's interesting and sad to see the forced assimilative process of erasing written Cantonese. I remember HK in the late 90s still had newspapers that published in written Cantonese. Just across the border in Shenzhen, without the British influence and prior to the explosion of industry and tech in the early 2000s, you could still see nonstandard signage that were in Cantonese in store windows. I think getting rid of spoken Cantonese is likely a generational and not just an effort that can be done in a decade or so, but I've both experienced and did field work on how the Wu dialects were more or less systematically erased from official, and now even private realms. The Shanghai variety, itself developed only in the early 1800s from a pidgin of the Suzhou and Nanjing varieties mixed with northern influences, is actually quite well-documented by foreign sources in writing, with a pidgin developing off of that and English and Portuguese that also survives in English sources and academically studied in great detail by Chinese authors in English but not in Chinese to anything close to the same degree. Starting with the millenial generation the speaking of the dialects in schools, even outside of class, became subject to punishment. With public education starting at the pre-kindergarten level enforcing the rule, across two or three generations even those whose first language is one of the dialects became more or less forced into Mandarin speakers and losing their fluency. I have little reason to doubt that something similar will simply be forced upon Hong Kong as well. Luckily sci-hub is your friend and written Cantonese seems to be better represented than written Wu through a cursory search.


BC is a bit of a quiet haven for Cantonese culture. Check UBC, I know they do language preservation projects related to Cantonese.


Thanks for the tip. Their Cantonese program's website is here: https://cantonese.arts.ubc.ca/

Via the announcement for the "Language Archiving in the Digital Era" workshop https://cantonese.arts.ubc.ca/language-archiving-in-the-digi... ...

I found the "Corpus of Mid-20th Century Hong Kong Cantonese" https://hkcc.eduhk.hk/

In typical academic fashion, it's behind a login wall and doesn't offer an easy way to download the whole corpus. (Understandable, given that it's based on transcribing movies that are probably still copyright-protected, but annoying.) Also, no translations.

They mention 香港粵語語料庫 as a related project, but the link is dead. I found what appears to be the new website: http://compling.hss.ntu.edu.sg/hkcancor/

That corpus is CC-BY licensed (yay!) and puts the download page front-and-center, so I like it. There's no translations either, but recordings are included, so it might still be useful for a project of mine.

Thanks again!


Modern techniques don't necessarily need this for NMT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: