Here is some information about romanisation of Cantonese if you are interested:
Romanisation system for Cantonese has an interesting history! Yale romanisation system [1] is (IMO) the most readable and also later on refined as Jyutping [2], another method used in more academic contexts which IMO is less readable (both used in GBoard as Cantonese input methods). However most persons and place names in HK use older system [3] developed in 1880s by Christian missionaries.
When people use Cantonese romanisation as part of their casual text chats on instant messaging or social media platforms, it’s usually a mix of both systems [1, 3], but rarely [2] but without the tone information (so lots of many-to-one mappings), mixed in with bits of English, making it hard to understand (even for a local Hong Kong person) without having good prior context of the entire conversation.
Not a Googler so I can only guess. But it seems like Google did try to treat Cantonese as a Chinese variant in the past, eventually they dropped it probably because they realised they're too different.
I know Google is actively working on the Cantonese version of Google Assistant, though not sure when it'll be officially released.
Whatever it is, Cantonese has different pronunciation, vocabulary and even grammar from Mandarin. Which means it takes a non trivial amount of work to adapt a language model designed for one to the other.
Source: I'm a native speaker of one and fully fluent in the other.
afaik Google needs a multilingual corpus. so if Cantonese is mostly written using Chinese characters, the corpus will be in Chinese characters.
and if written Cantonese is mostly informal (conversation, shop signs) it will not often be multilingual. so the approach that has worked for most languages wouldn't work then.
and it surely wouldn't work for a completely different, lossy orthography - without independent training.
I remember this being similar to the reason claimed for the use of leet speak back in the day (i.e, that it prevented searching of messages by 'the man' ); Interesting to see something so similar actually being practically applied in the present.
As a side note: If that would happen here in Flanders, not even Dutch speakers would be able to translate it. We have so many local dialects with their own pronunciation, and region bounded words, that only local people would understand what is written.
There's nothing which will translate that as far as I know, it's an entirely different problem than chinese characters since Cantonese is a separate language and the phonetics are not even standardized in these messages so there's no 1 to 1 mapping to characters either.
If you type the phonetic words in using the Google Keyboard in JyutPing mode, you will usually get the correct Chinese characters. The thing that would defeat this, though, is the deliberate introduction of very colloquial Chinglish puns.
I tried feeding some of the phonetically spelled stuff into google translate and it was completely lost at sea. So no worries about the claim this could stuff up NLP/search, then.
No, you can actually type this straight into Google Translate and it will work fine. After selecting Chinese as the language, you have to set the input method as "Cantonese." Of course, Google Translate is banned in China. ;)
This isn't a secret code or anything; it's a standard romanization that almost everyone who learns Cantonese formally will learn. The thing is---formal education in Cantonese and other non-Mandarin Chinese languages is banned in schools in China. Mainlanders that speak Cantonese as a primary language often don't even know how to use it (I talked to a Cantonese-speaking girl from Zhuhai, and she was like "Can you show me what Jyutping looks like?" Bizarre).
It's pretty smart, and a bit of a slap in the face to the establishment, which has been forcing Mandarin down people's throats for the past 50 years or so.
Many things are banned in China, especially web services, in order to create an internal (was possibly exportable incubation, but not since 'progression' over the past 5 years after the government - hegemony of leaders and follows on, rather than government workers, themselves the mules).
But translate.google.cn is not banned on the mainland and Google services work very finely in Hong Kong.
And forcing Mandarin does people's throats is not all bad, in terms of literacy and considering that 70 years ago most of the country was illiterate, no only in language but also in ideas, such as basic western ideas in medical - which led to a huge reduction in infant mortality - the doctors and the literate going to the countryside.
And the across the sea, river, passing swamp, was Hong Kong, which figured out very much earlier and flourished.
But.. translate.google.cn is not blocked in China.
Mandarin being the primary language of education doesn't mean that Cantonese is prohibited; it's just not mandatory, so most native Cantonese speakers aren't going to get a formal education in it unless they specifically seek it out.
Many people type with Cangjie because that's what gets taught. Very few people know Jyutping, and even fewer type with it; most people encountering Jyutping for the first time wouldn't know how to pronounce it, largely because of the unfamiliar 'j' and 'eo'.
For day to day transliterations of names etc there's actually no standard, just some loose rules based on English pronunciation.
In HK, pronounciation of Cantonese words is usually just learnt orally without much formal Romanization. You wouldn't encounter Jyutping unless you're a linguist studying the language, or you study it as a non-native speaker. Similarly, a native English speaker most likely didn't learn English words by using e.g. the International Phonetic alphabet.
As an aside this has caused colloquial Cantonese prounciation to "shift" over the years [0], it's called "lazy tone". E.g. 你 (you), the proper way is 'nei5', but most younger people say it 'lei5'. It's a big debate whether this trend ought to be stopped or not.
Most people use either Cangjie or the number pad on their phones. Office communications are generally conducted in English but where Chinese is needed, Cangjie is the preferred method. Our keyboards all have both QWERTY and 手田水口廿卜 printed on them.
Yeah, I just tried it and didn't know which characters to pick after the initial "heung gong" in "heung gong ya gan yau". So even if it doesn't actually break automatic translation, it does break the user like you say :)
You also might try Bing Translate, which actually has Cantonese translation. Google Translate can take Cantonese input, but since it's really just a Mandarin translation engine, it won't correctly interpret Cantonese words and grammar.
Hmm, at your suggestion I've just tried Bing and it doesn't seem to know what to do with the Romanisation? Interesting to know that Bing explicitly has Cantonese translation catered for but Google doesn't. Wonder why that is?
The step that's missing for me is how to turn the Romanisation into Chinese characters.
The examples in the article miss the tones. It's like writing pinyin without tones (even with tones it's much less clear than using characters).
This means that understanding a sentence requires a good knowledge of the oral language and to read it all to extract the meaning from the overall context.
That seems to be the point: Software tools will be completely lost.
How quickly could China create a dataset to translate this into Cantonese though? Probably in a year or less. It’s a game of cat and mouse and the cat will win in the end.
I thought the idea of a game of cat and mouse is that you never actually reach a conclusion?
Pace the obvious and awesome power of the Chinese surveillance state, nothing is foregone.
Edit: it's true that you could eventually translate the literal stuff back into Chinese characters, I'm sure, but that doesn't mean things are predetermined in a wider sense.
Hong Kong is not the only place where people speak Cantonese. The entire region of GuangDong (Canton) speak this dialect. Do they believe all "trolls" speak Mandarin?
I was lucky to have the chance to take a natural language processing course in the spring from a professor who was very knowledgeable and passionate about the subject.
Sentiment extraction, semantic meaning extraction, categorization... these are all really hard problems (to do automatically) even on properly spelled and grammatically correct text. I would imagine they are even harder in Chinese, which as I understand it has several different writing systems.
The HK protesters are clearly quite clever. If they keep using different obfuscation schemes for text, I could see it forcing the mainland to use human beings to read every post. Which I'm sure they have the resources to do, but it's still more expensive than using a machine.
Some strategies I would expect to be effective:
* Using alternative phonetic encoding (i.e. what is shown in the article, using Latin letters to spell out sounds rather than words)
* Homoglyph attacks
* Using deliberately incorrect or ambiguous grammatical structure
* Using deliberately incorrect spacing and punctuation (for example "m ee t. me;? b!y th e do.,c?s a!.t; m id ni;ght" will completely bewilder all the parsing packages I'm aware of)
* Convert the text to images and post those, possibly adding graphical text which will confuse OCR packages
Mix and match for even more fun!
There are also lots and lots of stenographic techniques, but those are a lot less accessible to laypeople.
I'm not familiar with NLP tools and techniques available for Chinese, but most parsers/taggers for English aren't really written with adversarial inputs in mind. It would probably be possible to deliberately construct valid (or at least decipherable to a human) English text that would crash the common tools available.
As an aside, the articles that keep coming out over the HK protester's tactics are starting to seem a lot like Cory Doctorow's "Little Brother"[1], which is available for free, and definitely worth a read.
There is state-sponsored warfare occurring on the internet right now. As with all war, each side desires a specific outcome; they're influencing en masse - the people they want - to achieve their outcome.
Here is some information about romanisation of Cantonese if you are interested:
Romanisation system for Cantonese has an interesting history! Yale romanisation system [1] is (IMO) the most readable and also later on refined as Jyutping [2], another method used in more academic contexts which IMO is less readable (both used in GBoard as Cantonese input methods). However most persons and place names in HK use older system [3] developed in 1880s by Christian missionaries.
When people use Cantonese romanisation as part of their casual text chats on instant messaging or social media platforms, it’s usually a mix of both systems [1, 3], but rarely [2] but without the tone information (so lots of many-to-one mappings), mixed in with bits of English, making it hard to understand (even for a local Hong Kong person) without having good prior context of the entire conversation.
[1] https://en.wikipedia.org/wiki/Yale_romanization_of_Cantonese
[2] https://en.wikipedia.org/wiki/Jyutping
[3] https://en.wikipedia.org/wiki/Standard_Romanization_(Cantone...