An Analysis of WeChat’s Realtime Image Filtering in Chats

word-reader · on July 22, 2019

It looks like something similar is going to be legally required in the EU [1]. For encrypted chat apps like WhatsApp, the content filtering database can simply be baked into the client application [2].

[1] https://torrentfreak.com/eu-members-approve-upload-filters-f...

[2] https://www.bloomberg.com/opinion/articles/2019-01-25/how-to...

caractacus · on July 23, 2019

Er.... no. That's not what was passed by the EU.

The Copyright Directive is targeted at large "online content-sharing services" - in effect, services like YouTube, not messaging apps - and says that they are liable for what they host. This means that (1) they need to seek a license for any copyrighted content and (2) if they can't do that, find out a way to stop the copyrighted content from appearing.

Plus, member states are now in a two year period to get all this stuff into national law. So nothing is required right now, and chat apps are firmly and completely off limits for this legislation.

word-reader · on July 25, 2019

This isn't the same thing as the copyright directive. There is some ambiguity in the proposal about whether they would have to filter things that are shared privately (as opposed to posted publicly). But it repeatedly uses phrases like "uploaded to the service" which implies that things like Facebook Messenger, Instagram DMs, and so on would be covered.

TazeTSchnitzel · on July 23, 2019

I'm really hoping it gets watered down by the European Court of Justice, but that would take a long time.

_jyog · on July 23, 2019

I'm not sure e2e-encrypted applications can be reasonably expected to handle this. The bytes they are transmitting are effectively not copywritten, and cannot be deciphered into copywritten information without something which the company does not have. I could also see a huge increase in client-side processing requirements due to this, meaning this could translate to substantial financial loss and an undue burden.

martijnarts · on July 23, 2019

> The bytes they are transmitting are effectively not copywritten

Eh, that's too limited a definition of copyright. Technical implementation doesn't really matter to the law; only whether or not you are in fact transmitting copyrighted information. That is decidedly the case: the author is knowingly sending a certain piece of content, which the receiver can then access.

Here's a great article on it: https://ansuz.sooke.bc.ca/entry/23

kccqzy · on July 23, 2019

On the contrary, client-side processing means it is burning the battery in users' devices, not consuming any cloud resources. It could very well be cheaper for the companies to implement.

mycall · on July 23, 2019

If it is running on my device, software can be cracked. This will prove fun.

darkpuma · on July 23, 2019

That some technically proficient people will find ways around it is beyond doubt, however it may prove to be an effective system for information control despite that, if the common user finds it difficult to circumvent.

whoevercares · on July 22, 2019

I’m saddened by the fact many brightest mind might be working behind this

xxxpupugo · on July 22, 2019

This is probably not that a tricky system to implement, from this study.

So from description, it seems that they have a 2 layer system, one synchronous one and an asynchronous one. The sync one filters images based on purely the md5 hashes, so basically a lookup-table check. Then the image goes to the async OCR service to have the text extracted, if being decided against the censorship, it will be write back to the front layer hash table.

Indeed, this is not very different from Facebook's auto face recognition/tagging features that has been enabled like forever. Just the volume of the system is pretty significant.

ehsankia · on July 22, 2019

So all you would need is an app that very slightly modifies one bit in every image you send at random, and it'll never be caught by the synchronous system? Does the async update retroactively delete older posted images? And if so, how long would it take for that to happen?

xxxpupugo · on July 22, 2019

Probably, an app that applies some random distortion to your source image, just like your own recaptcha.

Also Adverisal Patch is a thing, so that could be applied to specifically distorted the system in a human invisible way to misguide the machine learning models.

Whether it will be retroactively deleted, it think there is a possibility. If WeChat periodically scan your image and find your image violating the censorship, it will probably delete it and report to the authority.

Again, some encryption needs to apply in this case. Just like in the old days.

Yetanfou · on July 23, 2019

Technical measures like these don't have to be watertight to be effective with the 'right' laws in place, this has been demonstrated by the restrictions on bypassing 'content protection measures' in the DMCA.

Those found out to be using means to bypass image censoring will eventually see their 'social credit score' plummet and be kicked off WeChat and/or the 'net.

gwern · on July 23, 2019

They say the async is operating on a timescale of seconds (presumably long enough for the batch OCR jobs to get around to it), so you can bypass it, but that won't do you much good.

hhjj · on July 22, 2019

Too smart for your own good.

apengwin · on July 22, 2019

Tell the physics department to open more faculty positions.

j7ake · on July 23, 2019

They would love to if they had funding for faculty positions.

s09dfhks · on July 22, 2019

Moneys still green after all

(Or whatever color RMB is associated with)

colordrops · on July 22, 2019

dymk · on July 22, 2019

It’s a legal directive, not a monetary incentive

luminati · on July 22, 2019

Considering the choice of hash function of md5, not so bright I guess..

varenc · on July 23, 2019

You're right.

CityLab researchers exploited MD5's weakness to answer questions about the system. While not a real problem in practice, it seems clear MD5 was not an ideal choice.

From the article, the researchers generated forbidden and allowed images with a colliding hash to prove WeChat was using MD5. The allowed image was banned in the future as a result.

However, MD5 collision generation has some constraints. It's very hard make an image collide with a particular known hash, but it's feasible (5 hours with a large GPU) to take two images and modify them until their hashes collide. Practically this means exploitation opportunities are rather limited, but a forced collision being possible at all seems non-ideal for an adversarial use case. There's also the risk that future cryptanalysis will further weaken MD5. Seems clear to me WeChat just should have used something like sha256.

luminati · on July 23, 2019

As you pointed out, since this is an adversarial use case, a robust cryptographic hash function is the way to go. [For a non-cryptographic Hash function, SeaHash [1] would be the best choice, which is violently fast!]

BLAKE2b would have been the perfect choice given the adversarial nature, as it much secure and faster than MD5. [2]

MD5 is so broken, it's really poor choice for any use case - cryptographic (fundamentally broken) or not (fundamentally slow).

[1] http://ticki.github.io/blog/seahash-explained/

[2] https://leastauthority.com/blog/BLAKE2-harder-better-faster-...

blaser-waffle · on July 23, 2019

> Practically this means exploitation opportunities are rather limited, but a forced collision being possible at all seems non-ideal for an adversarial use case. There's also the risk that future cryptanalysis will further weaken MD5. Seems clear to me WeChat just should have used something like sha256.

With a billion people using phones, "good enough" is probably good enough.

Given the scale and scope of the Chinese security apparatus, anyone capable of using a GPU to hash out collisions is probably already known to the state. And the handful of collisions are probably not important enough to worry about -- a stealthy Winnie The Poo image isn't a big deal.

chillacy · on July 23, 2019

Another constraint is going to be cpu/battery usage, md5 probably has better hardware support in mobile processors as well as being faster to compute than a 256bit hash.

eliseumds · on July 22, 2019

This is not cryptography. MD5 is a very popular choice for hashing static assets.

varenc · on July 23, 2019

But this is an adversarial use case. You're not trying to cause md5 hash collisions in your own static assets, but WeChat users might benefit from that. See my other comment above.

swinglock · on July 22, 2019

Big whoop. The hash collisions of MD5 allows an "attacker" to prevent himself from posting his own image. A simpler and cheaper way to perform the same "attack" would be to just not post the image in the first place as the outcome is the same.

chrismsnz · on July 22, 2019

Ignoring MD5/image format-specific collision realities, theoretically an attacker could submit a contraband image that collides with a valid, allowed image they may want to remove.

When action is taken on the first image, the collided image could also be censored.

HALtheWise · on July 23, 2019

Not with current technologies. There is a big difference between creating an image that collides with a specific target hash and just making two images that collide when you control both. The former is not currently possible, and the latter is.

That being said, you could probably create a pair of colliding images, give one to a news outlet or something, then later post the second (presumably banned) one. The app would on short notice need to decide between banning neither or banning both.

chrismsnz · on July 23, 2019

> That being said, you could probably create a pair of colliding images, give one to a news outlet or something, then later post the second (presumably banned) one. The app would on short notice need to decide between banning neither or banning both.

Yeah they did this - except the contraband was automatically recognised and both images were banned via hash.

bhl · on July 22, 2019

What’s wrong with the application of MD5 to this process?

saagarjha · on July 22, 2019

Why is MD5 a poor choice in this context?

luminati · on July 23, 2019

See my above comment: https://news.ycombinator.com/item?id=20508786

xxxpupugo · on July 22, 2019

Wrong, it is digital fingerprint, not for encrypting. It is suffice for most cases to prevent the same image to spread, had it been blacklisted. Not all users are that tech savvy.

Macuyiko · on July 23, 2019

Interesting fragment:

> Moreover, we found that new accounts required approval from a second account that must have existed for over six months, be in good standing, and have not already approved any other accounts in the past month. Because of these requirements, we found that creating new WeChat accounts was prohibitively difficult.

I've noticed WeChat tightening up their accounts as well over the past months. I have been "lucky" to have created an account years ago, with a wallet still working as well (as a non-Chinese, that is) without having to link a Chinese bank account/card. Friends of mine who visited recently were no longer able to do so for their account.

perlwle · on July 23, 2019

A tricky thing WeChat does on the UX is that censored image looks like sent successfully from the chat history. But the other side really didn't receive it.

It's like a black hole eating your messages without telling you.

whytaka · on July 23, 2019

The Memory Hole [https://en.wikipedia.org/wiki/Memory_hole]

Causality1 · on July 22, 2019

This kind of thing reminds you of the fact that the reason we don't do business and trade with North Korea is because they're a geopolitical threat, not because they abuse their citizens. So long as a country like China's evil stops at the border the rest of the world could not give less of a shit.

anbop · on July 22, 2019

China could easily go beyond its border without consequence. Look at how easily Russia took Crimea.

deogeo · on July 22, 2019

Or how easily China took Tibet.

hobs · on July 22, 2019

Or the current claims in the South China Sea https://twitter.com/indopac_info/status/1102650429458931713

shane369 · on July 23, 2019

It is actually the Mongolians took Tibet, then the Chinese ruler simply inherited it

kccqzy · on July 22, 2019

China taking Tibet is about the same as the U.S. taking the land of the Native American peoples and made them Indian reservations. The whites historically maltreated the Indians just as the Han people of China maltreated the Tibetans.

chillacy · on July 23, 2019

The moral stuff is just the justification, the rest is just realpolitik.

fourier_mode · on July 22, 2019

> Internet platform companies operating in China are required by law to control content on their platforms or face penalties

I recently learned that Signal[0] works in China, are they forced to do the same?

[0]: https://en.wikipedia.org/wiki/Signal_Messenger

oppositelock · on July 22, 2019

China manipulates data pretty much anywhere imaginable. See the Google Maps link [1] and corresponding Baidu Maps [2] locations. Notice how the Google Maps data has huge disagreement between the road network and satellite imagery? It's because if you do mapping in China, the government hands you a perturbation function to apply to each data layer. You have to warp your data per their function, and they can audit it. Baidu doesn't have to do this. However, both Google and Baidu maps are WAY off on GPS locations, 100+ meters off, everybody has to do that unless yo have an accurate mapping license.

I realize this is a little off topic, but since I work on something which has a big China presence, I'm always running into their BS, and censorship is just one little piece of it. VPN connectivity to your non-China offices is also problematic, running TLS over the Chinese internet is also problematic, unless you use officially provided certs and keys, etc.

[1] https://www.google.com/maps/place/Beijing,+China/@39.7616007... [2] https://map.baidu.com/@12957558.390456071,4804287.368277797,...

kccqzy · on July 22, 2019

About this perturbation function: almost the entire world uses the WGS84 datum. The Chinese use a datum that's similar but subtly different. If you didn't account for this datum you got shifted roads and features. The technical information about this datum has never officially been made public, but only licensed to certain companies. I'm fairly certain Google doesn't have such a license. You can find reverse engineered info online but there's no guarantee that those are correct.

oppositelock · on July 23, 2019

You're referring to GCJ-02. From best as I can figure, it adds some form multi-frequency noise to a shifted WGS84 coordinate, but it also seems that different companies are told to use different coefficients for the different noise frequencies, that's how they can tell if you're doing what you're told.

Regardless, it's difficult to work with. If you have a mapping license, you must also take serious precautions never to let the accurate map data leave China, or your Chinese employees are in deep trouble.

jjakque · on July 23, 2019

short video explanation on said china map topic:

https://www.youtube.com/watch?v=L9Di-UVC-_4

Thorrez · on July 23, 2019

Change google.com to google.cn and it lines up.

mycall · on July 23, 2019

What about OpenStreetMaps?

xenospn · on July 23, 2019

Probably has the same issues.

tuxmascot · on July 22, 2019

You cited the actual Signal app, but provided no citation that Signal works in China. Can you please provide evidence of that?

Furthermore, just because it works in China doesn't mean that it won't cause your encrypted traffic to get flagged. This is why the threat model for a sophisticated network adversary, like the Chinese government, is difficult to model against.

fourier_mode · on July 22, 2019

https://en.wikipedia.org/wiki/Signal_(software)#/media/File:...

solstice · on July 23, 2019

Signal works in Shanghai, albeit with occasional delays and hickups.

Canada · on July 22, 2019

Yeah, Signal seems to work fine in most of China.

Obviously there's not a snowball's chance in hell that Open Whisper Systems would ever even consider complying with any demand from the PRC.

I just don't think PRC authorities really care about Signal. It's not popular, not available in the app stores available to the Chinese masses, and anyone using it is probably just going to tunnel out anyway. They tend to ban things that are competitive, and Signal just isn't. It's a niche app.

NKCSS · on July 23, 2019

> We found that the use of client re-encoding can make the image filtering more powerful in that it effectively results in an image’s hash representing all images with identical pixel values as opposed to merely a specific image encoding. The process of re-encoding extends the generalizability of hash-based filtering because any image containing the same pixel contents will be encoded to the same file and thus have an identical hash (see Figure 3). When this happens, any changes to the original image’s encoding or to its metadata will be ineffective at evading filtering, and some change to the image’s pixel values, if even a small one, will be required to change the resultant hash. When the client does not re-encode, then any change to an image’s file encoding, including to its metadata, is sufficient to change its hash.

So, just add a random off-color pixel in your image and the system will fail.

elaus · on July 23, 2019

No, according to the article this will only evade the hash-based lookup. All images that aren't in the hash database are then analyzed by some other, computational more expensive system (OCR, …).

ComodoHacker · on July 23, 2019

I doubt it's that simple. Perhaps they have an AI part that analyzes new images and adds their hashes to the blacklist.

netsec_burn · on July 23, 2019

Fascinating analysis given such limited information. It's amazing how much can be inferred.

whytaka · on July 22, 2019

While the crisis with fake news certainly warrants suspicion of the citizenry's ability to make informed decisions, how do pro-government citizens of countries without even the cultural ideal of free speech attempt to claim objective soundness for their political positions?

svachalek · on July 22, 2019

There's only one political position that anyone can speak, why would you need to justify it?

hkai · on July 23, 2019

Would you also be open to question the existence of a "fake news crisis" itself, or were you convinced by the news reports that it exists?

whytaka · on July 23, 2019

Having seen its proliferation first-hand, I don’t think I need much convincing.