It won't last, at least for China. Their government is working on a clone of wiki, scheduled for 2018[0]. Once that's done they'll likely completely ban the original.
Wikipedia publishes database dumps every couple of days[1]. So it shouldn't be that expensive for smaller governments to create and host their own censored mirror. You'd maintain a list of banned and censored articles, then pull from wikipedia once a month. You'd have to check new articles by hand (maybe even all edits), but a lot of that should be easily automated, and if you only care about wikipedia in your native tongue (and it's not english) that's much less work.
The academics will bypass censorship anyway, since it's so easy[2], so an autocrat won't worry about intellectually crippling their country by banning wikipedia. Maybe they don't do this because the list of banned articles would be trivial to get.
Better machine translation might solve this by helping information flow freely[3]. We have until 2018 I guess.
Wikipedia editors are pretty strict about what gets to remain a page. Everyone knows they delete articles unless it has lots of sources and public interest.
With 6x the articles on baike I can't imagine that there is that level of quality control. Unless there are 6x as many things worth documenting in China vs rest of the world.
That doesn't surprise me; in Japan there are various Japanese wiki sites, usually with less information than even the Japanese versions of Wikipedia, but still with more articles. Usually they even have comment sections below the articles, which can become quite toxic, at least on certain articles.
Won't work, they are really closed in moderation. A few years ago they even renamed this article (https://en.wikipedia.org/wiki/Kievan_Rus%27) because they have identity crisis - they try to pose as the oldest part of slavic nations, the core nation and therefore must be obeyed (literally). So to shift the history they renamed the article to "Ancient Rus" to make people forget the Kiev part. (not the only thing they do of course)
This article in RU denies any involvement of Russia in Russian-Ukrainian war, however weird that may sound. They are either complicit or so deep in denial that it is impossible to talk to them about the war.
Currently Russian Wiki segment can't be trusted except for bare facts and non-political entries.
The russian version of that article is currently Киевская Русь [1] (Kievan Rus), though Дре́вняя Русь (Ancient Rus) is listed as a synonym. So it seems that specific change has been reverted, right?
I don't believe so. I just tested one of my own websites, which only serves over HTTPS, from Hong Kong (admittedly a special case), and Beijing. It worked fine from both. Surprisingly, because I thought Adsense was blocked, an advert even appeared on the Beijing screenshot. On the other hand, it reported as temporarily unavailable from Shanghai.
Okay, thanks for testing for me. When I lived in Shanghai 5 yrs ago, I had a lot of trouble connecting to https and whenever possible, I would try and connect unencrypted.
That's interesting. My admittedly flawed understanding is that the Great Firewall of China isn't implemented with a unified set of policies, but rather varies from province to province, which might help explain your experience and my test results.
Can an expert comment on side-channel attacks on HTTPS and whether they're less viable on HTTP/2?
My assumption is that because wikipedia has a known plaintext and a known link graph it's plausible to identify pages with some accuracy and either block them or monitor who's reading what.
I also assume that the traffic profile of editing looks different from viewing.
> My assumption is that because wikipedia has a known plaintext and a known link graph it's plausible to identify pages with some accuracy
At least in theory, the latest versions of TLS should not be vulnerable to a known plaintext attack. TLS also is capable of length-padding, which would reduce the attack surface here as well for an eavesdropper.
My understanding is that HTTP/2 makes it even more difficult to construct an attack on this basis, because HTTP/2 means multiple requests can get rolled into one.
Of course, all this is assuming an eavesdropper without the ability to intercept and modify traffic. In practice, governments will probably just MITM the connection - we have precedent for governments abusing CAs like this in the past - and unless Wikipedia uses HPKP and we trust the initial connection and we trust that the HPKP reporting endpoint isn't blocked, then it's still possible to censor pages, without anybody else knowing[0].
[0] ie, the government censors will know, and the person who attempted to access the page will know, but neither Wikipedia nor the browser vendor would be able to detect the censorship automatically.
TLS1.2 doesn't have an effective padding scheme, and with most sites (including Wikipedia) moving to AES-GCM and ChaCha20, it is actually less effective than the primitive CBC padding, which provided some protection.
TLS1.3, which is still a draft, does have support for record-level padding, but I haven't seen any of the experimental deployments using it.
HTTP/2 does have support for padding, but again, it's not common to see it being used, at least not in the kind of sizes it would take to obscure content fingerprints.
Wikipedia is a particularly hard case for traffic analysis fingerprinting. First, the combination of page size and image sizes are just highly unique, even modulo large block/padding sizes. But more importantly, anyone can edit a wikipedia page, so if the size of a target page isn't unique, it's very easy to go ahead and edit it to make it so. It would take very large amounts of padding to defeat this.
So it's definitely possible to fingerprint which wikipedia someone is browsing. But it's probably not easy to block it; the fingerprint is only detectable after the page has been downloaded. So it's not very useful for censorship.
> But it's probably not easy to block it; the fingerprint is only detectable after the page has been downloaded. So it's not very useful for censorship
Well, it's detectable after the request has been made and Wikipedia sends the response. Assuming that a government has the capabilities to block delivery of that response (which they do), they can still implement censorship at this level, before the page reaches the end user.
Except China has their own browser made by a state controlled company that a lot of people use. This browser is already demonstrated to accept the government CA and ordinary people in China don't care.
And one thing to note is that people generally don't randomly pad the length of articles, so it's not _very_ difficult to figure out what articles you might be reading -- even over TLS.
Random padding wouldn't really help; an active attacker can force retries, so the random distribution can be mapped (and then subtracted). To defeat TA you need to pad to a fixed length for all cases, or for a very large amount of cases.
E.g. if every wikipedia page, plus all of the content it includes, came to exactly 10K, 20K, 30K, ... in size, then you could obscure what the user is reading.
I've seen the theory that you could work out which pages are loaded from wikipedia over SSL by looking at other metrics like content length etc, but thanks to stuff like gzip compression, caching headers etc, this is much harder to exploit in practice. Plus there's the huge overhead of maintaining a database to link the frequently changing metrics back to the appropriate page on wikipedia. There's a great link somewhere (which of course I now can't find) where somebody prototyped this idea and found it really pretty hard to implement.
In the event this was even tried, it would presumably be trivial to defeat with injection of random content somewhere in the server responses anyway. This of course all assumes we can trust the root certificate authority though :P
Yeah, good point. I presumed it would be a pain, but I never thought to see if someone actually tried it.
Though you shouldn't be compressing things over TLS. I think the only proper solution is to pad out all articles (and images) to the nearest 2kB or something so that you can't figure out the length (randomness can be thwarted by forcing refreshes).
The government could force pc manufacturers to deploy a root CA that they control and then do a MITM proxy to read everything the user is doing, they could also redirect wikipedia domain to another domain that just acts as a reverse proxy and deploy a legit cert on that other site
AFAIK HSTS doesn't break TLS MITM. A valid x509 certificate is generated by the attacker (using a Certificate Authority trusted by the victim's browser) for the domain the victim is visiting and all is well for both TLS sessions (Client<->Attacker, Attacker<->Server). This all relies on the attacker having access to sign certs from the trusted CA.
Certificate pinning in the HTTPS client would mitigate TLS MITM (HPKP).
There were few censored pages on the Turkish Wikipedia when it was on HTTP. They were the "vagina" article and election prediction article. Only those pages were censored.
Last month there were some articles on the English Wikipedia about ISIS-Erdoğan (I don't care true or not). Then they have blocked all Wikipedia (all languages). Because they were unable to block those individual pages.
Yup. Was there 2 weeks ago working with a group of Turkish engineers - I went online to get some technical information about a particular stream cipher, and WHOOPS! - Wikipedia is blocked, completely.
Fired up my VPN, accessed the page, thank you very much.
"The Net interprets censorship as damage and routes around it." - John Gilmore
That's just it; they can't! When you visit Wikipedia over HTTPS, the only thing actually visible in plain text is wikipedia.org, and that's only if your browser is using Server Name Identification (SNI).
Since the rest of the request, including the URL is hidden, governments and other malicious agents between you and the server cannot actually see what pages you're requesting directly. They can only see that you are accessing wikipedia.org and transmitting some data. You may still be somewhat vulnerable to timing attacks to try to identify what pages you're viewing, but censorship can't happen at the page level over HTTPS; you have to block the whole thing in one go.
Although countries like China, Thailand and Uzbekistan were still censoring part or all of Wikipedia by the time the researchers wrapped up their study
The top comment might be asking about the "were still censoring part" of the article.
Oh, huh! I missed that entirely, now I'm curious too. HTTPS should make that difficult, but China has been known to employ all sorts of weird shenanigans-- perhaps they're running a "trusted' MitM as part of the great firewall?
I know that certain companies (like Google and Microsoft) will actively censor themselves to continue to operate within China, but I figured Wikipedia would be against that practice on principal. Now I'm curious as to how it's done.
When I visited China a bunch of years ago, zh.wikipedia was completely blocked, and on English wikipedia, only certain articles were deadholed (tiananmen square...)
Nitpick: Google opted to pull out of mainland China instead of self-censoring. They moved Chinese operations to Hong Kong, but operate uncensored there.
Google was perfectly willing to self-censor in China until they were hacked by the Chinese government in 2010. That's when Google China moved to Hong Kong.
This is how "domain fronting" (https://en.wikipedia.org/wiki/Domain_fronting) works as well -- encryption makes blocking an all-or-nothing deal, and blocking everything that goes to an extremely popular IP range / SNI causes too much collateral damage :)
For committed governments like China, TLS may just be an extra hurdle but they can get around it if they want. Basically China could simply implement a massive proxy that terminates TLS.
If your internet traffic is going to flow through infrastructure that a curious government owns, then you'll know that they're monitoring the traffic but there is no way to keep them from seeing what you're doing.
No, TLS is not vulnerable to a MITM unless a) your client trusts the certificates issued by the attacker, or b) the attacker successfully forges the certificate of the website you are trying to visit.
That is, assuming you don't click away your browser's security warning.
> TLS is not vulnerable to a MITM unless a) your client trusts the certificates issued by the attacker,
Or in other words it is vulnerable.
China can (and probably does) issue a certificate that all Chinese browsers must install, they can then do MITM https using their certificate to sign the new versions.
Companies do this routinely BTW. Since it's their equipment, it's considered just fine. (But be aware of it if you are using a company computer.)
"On Friday, March 20th, we became aware of unauthorized digital certificates for several Google domains. The certificates were issued by an intermediate certificate authority apparently held by a company called MCS Holdings. This intermediate certificate was issued by CNNIC."
There is a concern dating back many years that a government will mandate that UAs trust a particular government-controlled CA (that eventually, but maybe not at first, openly performs MITMs). This is one reason that browsers really want to keep control of their root programs and not be mandated by governments to include any particular trusted roots -- including to maintain a remedy against roots that do appear to deliberately facilitate MITMs.
Although there have been lots of concerns about CNNIC, I don't believe that the Chinese government currently either (1) routinely uses CNNIC to perform MITMs for censorship or mass surveillance purposes, or (2) purports to require UAs to trust CNNIC or another Chinese root in order to be used by Chinese users. I'm happy to be corrected if someone knows otherwise.
They wouldn't "routinely" abuse their root to monitor large populations. That would be too obvious and result in near-immediate loss of their precious root.
What's more dangerous, and much more likely, is that they might use forged certificates against specific individuals for a short period of time, for example, to intercept login credentials. The attack will go unnoticed as long as they also block the corresponding HPKP reporting URL (if the targeted site uses HPKP at all).
revoking the root outside china will have no bearing within. All devices sold and used in china could be forced to include that root. There is not a lot a user could do , especially in mobile if you have locked phone and only access to the official app store
There's a fine line between cartoon-villain evil, exemplified by people like Kim Jong Un who just doesn't seem to give a fuck, and just-enough-to-achieve-your-objectives-but-not-enough-to-make-too-many-people-notice evil, which is what China is aiming at.
Lots of people travel in and out of China with all sorts of computing devices. China does care about the reputation of their root and of their highly profitable electronic exports.
It isn't but if you live in China and want to use the internet, you'll likely be forced to use a proxy that MITMs and serves its own certificate....My point is that TLS is not a solution to prevent government interference when the user has to rely on the government infrastructure for access.
according to the paper.. the answer is subdomains.. For example, in one instance China blocked zh.wikipedia.org (the entire subdomain.. they cant see what page you're visiting), but left their other 291 subdomains unblocked.
I wonder about this. If a government can hack into a server and steal the private encryption key, then they could just look like any other server in the server farm, right?
Given the recent Shadow Brokers release of the NSA tools, it seems to me that this was not only possible, but probable (not necessarily with Wikipedia, but any website).
Well, they can block it whole. Once they figure out they can't block parts that's exactly what they will do. Either that or a re-host on their own infrastructure with the offending parts removed, conveniently seizing whatever domain names wikipedia has in that country for added authenticity.
Who says they can't see the URL? A sufficiently motivate government would probably be able to create forged certificates and mass interception isn't really out of the question. Especially with browsers homogenizing on fast ciphers AES-GCM/POLY-1305, I bet it's much more economical than you would think.
Cert Pinning or HPKP is one type of solution, but it's tricky to get right especially for a large site like wikipedia.
"In Turkey, Wikipedia articles about female genitals have been banned; Russia has censored articles about weed; in the UK, articles about German metal bands have been blocked..."
Critics of this plan argued that this move would just result in more
total censorship of Wikipedia and that access to some information
was better than no information at all
I'm no critic of this plan but I still don't understand why this wouldn't result in more total censorship. Someone explain please?
Because Wikipedia is too useful. Note that it required a certain self-confidence that this was the case for Wikipedia to implement this strategy. And it's self-fulfilling - if Wikipedia allowed itself to be censored, then it would have fewer contributors and its usefulness would suffer.
There's a rather interesting analogy to be made with the GPL here. Critics argue that companies shy away from it because they cannot control it. Yet its entire goal is to not be controlled, and it draws its strength from the conviction that the body of GPL software is too useful to ignore. And again, that's self-fulfilling.
It takes courage, but it's important to know when you have the power to say "all of me, or none of me".
> Critics argue that companies shy away from it because they cannot control it.
No, they don't. Critics point out that companies avoid it, and non-critics ascribe this avoidance to "can't control it", which is false, because nothing under a third-party copyright under any non-exclusive license can be controlled by the licensee, but businesses avoiding the GPL don't generally avoid all non-exclusive licenses.
I think "can't control" refers to sublicensing in this context. People's dislike over copyleft stems from wanting to make software proprietary (or proprietary-friendly through lax licensing). Copyleft removes that control, and the GPL's main strength is that it is so ubiquitous that you cannot practically avoid it (in most cases).
Insofar as companies avoid it, they do so because it constrains their behaviour in some way. Call it what you will; my wording was perhaps sloppy.
For the increasing number of companies that do participate in the GPL ecosystem, they do so because the opportunity cost of not participating outweighs the concomitant behavioural constraints. This produces a strong network effect as GPL software gains contributors, making GPL software more useful.
Wikipedia's anti-censorship strategy is analogous in that the switch to HTTPS raised the opportunity cost of censorship to the loss of the entire Wikipedia "ecosystem", which for many regimes is more severe than the "cost" of not censoring. This too produces a network effect as Wikipedia gains more contributors, thus further increasing its value.
Yes, I understand that. I mean, why don't these censors block the whole wikipedia.org access then?
If they don't want their population to access a Wikipedia topic/article and can't block/determine if someone is accessing it, the easiest thing to do would be just block it right away. So why they won't do it?
(PS: I'm in no way in favor of censorship, I'm just trying to understand such mindset)
If you censor too much people may be pissed. It's much easier to decide "we censor specific articles about specific subjects" than "we censor all of wikipedia". Censoring a popular mainstream webpage may cause too much opposition. Maybe even the politicians who make the decision and their families like to look up things on wikipedia.
Then https will force them to either extreme which I think is a good thing. No option to slowly raise the temperature so the frogs won't jump out of the pot.
As well as forge an SSL certificate for *.wikipedia.org.
Last time I checked, Wikipedia had HSTS enabled. So trying to forge their DNS without also forging their SSL certificate would be equivalent to total censorship for anybody who has previously visited Wikipedia.
I think it's a fun/educational process to interact with some daemons over telnet. You can telnet into port 80 and create an HTTP request, for instance.
Certification negotiation happens before the GET request happens, which means that the "URL" (or, rather, everything after the domain) is encrypted.
You can also see some of this process with curl. So:
Telnet is a great way to realize that HTTP is just some simple text commands and not some mysterious binary protocol.
WireShark also provides a good visualization of the HTTPS negotiation process and the various layers of HTTPS requests and responses. It does take a lot more to figure out than telnet though.
For all those wo are not aware what HTTPS encrypts:
HTTPs encrypts basically the whole protocol, this includes your request (the URL, your fingerprint -- e.g. browser, plugins installed, preferred languages) and the response (the content, type of the response (text, video, audio file), and some other not some important things).
What HTTPs does not encrypt is the domain and ip. The domain is leaked through DNS. DNSSec will not help either because it will not encrypt the DNS request. It rather signs it so that you can be sure it is authentic (not tempered with) but everyone can read it. This includes the wifi hotspot you use, your ISP, your government and anyone who tampers with the wires (theoretically even your neighbor and nearby people if you use mobile data since the connection from your device to your ISP is not really strong[1]).
Even if you would encrypt the DNS traffic (or you use just use the host's ip directly), the person who intercepts your traffic could just build a database with IP addresses that correspond to DNS entries (or do a reverse lookup, however, not every IP address has a reverse lookup configured to the domain you are visiting).
In wikipedia's example, this can still be pretty bad. For instance, if an oppressive government realizes that you visit wikipedia version of a particular language pretty frequently (compared to the rest of the population), they might make assumptions about you and profile you. When you visit the German wikipedia site, you are actually visiting de.wikipedia.org instead of en.wikipedia.org which can be intercepted and seen.
This gets worse for static file servers which serve different images at different subdomains (e.g. static512.domain.tld). So, if a DNS request is made to static523, static123, static721, and static132, an attack might be able to guess which article you are reading (or narrow down the choice) because their will not be many articles which have images served by those particular file servers. Thankfully wikipedia does not do that. Everything is served through upload.wikimedia.org but newpapers/forums, etc might not do that or they even have a unique domain for that article (e.g. embedded chart/video, which comes from a unique their party and is loaded automatically).
So all in all, HTTPs is pretty good but you still leave a lot of metadata (the DNS requests are just the tip of the iceberg) that can be used to learn a lot about you. If you want to be safe, use Tor or a VPN. If you use a VPN be aware that you just shift the trust from your current location to another one (so that the VPN provider, their ISP, and the government where the VPN server is located can read all those metadata, which might be not a big deal or even worse, depending where you actually life. Furthermore some VPNs have been known to be broken easily and your ISP/government still sees that you are using a VPN or even Tor).
[1]: One exception is LTE internet but you could still downgrade the connection to 3G or edge to intercept the domain
This is important for some censorship circumvention schemes and also because some people have suggested that encrypting SNI is useless because DNS leaks the hostname [however, not necessarily along the same network path!!], while some people have also suggested that encrypting DNS queries is useless because SNI leaks the hostname.
ever been to beach in Europe, particularly naturism beach? not sure what's problem here, apparently some puritans who are fine with violence but consider naked body disgusting
To be fair to the Scorpions, quote from Wikipedia...original concept for song
'...Time is the virgin killer. A kid comes into the world very naive, they lose that naiveness and then go into this life losing all of this getting into trouble. That was the basic idea about all of it'
Different times...
https://en.wikipedia.org/wiki/Virgin_Killer
Not, strictly speaking, the UK government. The Internet Watch Foundation, a non-governmental organisation, placed the article/image in question on its blacklist, a list which most major UK ISPs use (notable exceptions at the time were the UK universities' and military networks IIRC).
AFAIK, whether or not the image is actually illegal under English law is somewhat unclear (the definition of "indecent" is rather woolly), though it's certainly a poor choice for an album cover.
Edit: "to its blacklist" -> "on"; added "a non-governmental organisation"
Currently HTTPS sends domain in clear-text before establishing a connection. It allows to host (and block) website by domain, not by IP. May be HTTPS should have optional extension to send URI in clear-text before establishing a connection. This way, if censors decide to block Wikipedia, users can opt-in into this behaviour and have unblocked Wikipedia except few selected articles.
> Absolutely not. The response to censorship should not be to make things easier for the censor.
It's not about making things easier for the censor. It's already easy. It's about making life easier for people who have to live with censorship (pretty much the entire world, I guess?).
> Anyway, the idea is unworkable as the user's client could simply lie about what URI it's going to send after the encrypted connection is setup.
- Unlike the host, URIs are a property of the request, not the connection, so sending it as part of the connection handshake doesn't really make sense.
- Unlike the host, there is a very long history of putting secret things into the URI. Even if the extension is built with this in mind, the number of security breaches that will result is greater than zero, with probability one. That's probably not the correct price to pay for convenient censorship infrastructure.
Wikipedia publishes database dumps every couple of days[1]. So it shouldn't be that expensive for smaller governments to create and host their own censored mirror. You'd maintain a list of banned and censored articles, then pull from wikipedia once a month. You'd have to check new articles by hand (maybe even all edits), but a lot of that should be easily automated, and if you only care about wikipedia in your native tongue (and it's not english) that's much less work.
The academics will bypass censorship anyway, since it's so easy[2], so an autocrat won't worry about intellectually crippling their country by banning wikipedia. Maybe they don't do this because the list of banned articles would be trivial to get.
Better machine translation might solve this by helping information flow freely[3]. We have until 2018 I guess.
[0] https://news.vice.com/story/china-is-recruiting-20000-people...
[1] https://dumps.wikimedia.org/backup-index.html
[2] https://www.wired.co.uk/article/china-great-firewall-censors...
[3] https://blogs.wsj.com/chinarealtime/2015/12/17/anti-wikipedi...