I was lucky to have the chance to take a natural language processing course in the spring from a professor who was very knowledgeable and passionate about the subject.
Sentiment extraction, semantic meaning extraction, categorization... these are all really hard problems (to do automatically) even on properly spelled and grammatically correct text. I would imagine they are even harder in Chinese, which as I understand it has several different writing systems.
The HK protesters are clearly quite clever. If they keep using different obfuscation schemes for text, I could see it forcing the mainland to use human beings to read every post. Which I'm sure they have the resources to do, but it's still more expensive than using a machine.
Some strategies I would expect to be effective:
* Using alternative phonetic encoding (i.e. what is shown in the article, using Latin letters to spell out sounds rather than words)
* Homoglyph attacks
* Using deliberately incorrect or ambiguous grammatical structure
* Using deliberately incorrect spacing and punctuation (for example "m ee t. me;? b!y th e do.,c?s a!.t; m id ni;ght" will completely bewilder all the parsing packages I'm aware of)
* Convert the text to images and post those, possibly adding graphical text which will confuse OCR packages
Mix and match for even more fun!
There are also lots and lots of stenographic techniques, but those are a lot less accessible to laypeople.
I'm not familiar with NLP tools and techniques available for Chinese, but most parsers/taggers for English aren't really written with adversarial inputs in mind. It would probably be possible to deliberately construct valid (or at least decipherable to a human) English text that would crash the common tools available.
As an aside, the articles that keep coming out over the HK protester's tactics are starting to seem a lot like Cory Doctorow's "Little Brother"[1], which is available for free, and definitely worth a read.
Sentiment extraction, semantic meaning extraction, categorization... these are all really hard problems (to do automatically) even on properly spelled and grammatically correct text. I would imagine they are even harder in Chinese, which as I understand it has several different writing systems.
The HK protesters are clearly quite clever. If they keep using different obfuscation schemes for text, I could see it forcing the mainland to use human beings to read every post. Which I'm sure they have the resources to do, but it's still more expensive than using a machine.
Some strategies I would expect to be effective:
* Using alternative phonetic encoding (i.e. what is shown in the article, using Latin letters to spell out sounds rather than words)
* Homoglyph attacks
* Using deliberately incorrect or ambiguous grammatical structure
* Using deliberately incorrect spacing and punctuation (for example "m ee t. me;? b!y th e do.,c?s a!.t; m id ni;ght" will completely bewilder all the parsing packages I'm aware of)
* Convert the text to images and post those, possibly adding graphical text which will confuse OCR packages
Mix and match for even more fun!
There are also lots and lots of stenographic techniques, but those are a lot less accessible to laypeople.
I'm not familiar with NLP tools and techniques available for Chinese, but most parsers/taggers for English aren't really written with adversarial inputs in mind. It would probably be possible to deliberately construct valid (or at least decipherable to a human) English text that would crash the common tools available.
As an aside, the articles that keep coming out over the HK protester's tactics are starting to seem a lot like Cory Doctorow's "Little Brother"[1], which is available for free, and definitely worth a read.
1 - https://craphound.com/littlebrother/Cory_Doctorow_-_Little_B...