I run a MITM proxy for adblocking/general filtering and within the past little while I've noticed CloudFlare and other "bot protection" tends to get me blocked out of increasingly more sites I come across in search results, so this will be very useful for fixing that.
However, I should caution that in this era of companies being particularly user-hostile and authoritarian, especially Big Tech, I would be more careful with sharing stuff like this. Being forced to run JS is bad enough; profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.
Cloudflare is likely one of the worst things that has happened to the internet in recent history.
Like, I get the need for some protective mechanisms for interactive content/posting/etc, but there should be zero cases where a simple HTTP 200 GET requires javascript/client side crap. If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.
They've effectively just turned into a google protection racket. Small/special purpose search/archive tools are just stonewalled.
You can't turn it off as a Cloudflare customer either.
The best you've got is "essentially off" but that wording is such because even with everything disabled there are still edge cases where their security will enforce a JS challenge or CAPTCHA.
At least on their basic plan there is also little to no indication of how often this is triggering. Leading to having know idea what the various settings are.
Not to be too dismissive of this, but for companies trying to just run a service and getting constantly bombarded by stuff like DDoS issues, Cloudflare and its ilk lets them service a large portion of "legitimate" users, compared to none.
I don't really know how you resolve that absent just like... putting everything behind logins, though.
They solve the DDOS issue by requiring JS captchas (which fundamentally breaks the way the internet should work), rather then serving a cache of the page to reduce load on the real host.
Requiring JS doesn't disambiguate between well behaved automated (or headless. I used a custom proxy for a lot of my content browsing) user agents and malicious users, it breaks /all/ of them.
Some people shoot themselves in the foot, yes. There is no reason to not have some amount of microcaching even if it is very short and that puts an upper limit on the request rate per resource behind the caching layer.
I've noticed even GitHub has a login wall now for comments on open source projects. They truncate them if you aren't logged in, similar to reddit on mobile, instagram, twitter, etc. Hopefully the mobile version doesn't start pushing you to install some crappy apps where you can't use features like tabbed browsing, tab sync with another machine, etc.
Not sure if I am buying that excuse. I think they want to nudge people to make accounts and login. Really shady in case of Github and many other sides that are successful because of user content in my opinion.
Ironically, that makes the scraped copies more useful because they aren't truncated (at least for older pages) and I can actually get all the content. I wonder if that might be at least why Google seems to be giving them more weight in search result rankings.
> profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.
Right to Read indeed... fanfiction.net has over the last months become really annoying. Especially at night, when you have the FFN UI set to dark, and then out of nothing a bright white Cloudflare page appears. Or why the Cloudflare "anti bot" protection leads to an endless loop when the browser is the Android web view inside a third-party Reddit client.
Maybe I'm just a techno-optimist, but I suspect big tech companies don't give a hoot about you running "unapproved" software, but rather care about their services being abused and "unapproved" software is just a useful signal that fails on a tiny percentage of total legit users.
You are a lot more charitable than I am. I believe the big tech companies use dark patterns to get us to sign up, improve their metrics and hoover up our data.
Just trying to keep services operational is a fine goal to pursue as an operator, but forcing users to small inbound funnels for the service is detrimental too. There needs to be better research to be done to allow simpler ways of operation to continue working.
A browser is becoming a universal agent by itself, but many people (maybe increasingly) use terminal to access to the resources, and stonewalling these paths are never OK in my book.
you should really be impersonating an ESR version (eg. 91). Versions from the release channel is updated every month or so, and everyone has autoupdate enabled. Therefore unless you keep it up to date, your fingerprint is going to stick out like a sore thumb in a few months. On the other hand, ESR sticks to one version and shouldn't change significantly during its one year lifetime. It's still going to stick out to some extent (most people don't use ESR), but at least you have some enterprises who use ESR to blend into.
They should really be impersonating Chrome. If this takes off, Firefox has such a small user share that I could see sites just banning Firefox altogether, like they do with Tor
If there are a lot of abuse masquerading as Firefox, outstripping legit users, they can totally throw up a CAPTCHA for Firefox but not for Chrome. An outright ban isn’t the only annoying outcome.
Thanks for the suggestion, I had no idea ESR was a thing. I've just added support for Firefox ESR 91 (it was pretty similar and required adding one cipher to the cipher list and changing the user agent).
I think ESR is the way to go too, but either way, I wonder if some tests can be written to confirm the coverage/similarity of the requests? It would entail automating a both Firefox session and the recording of network traffic, and feels like it might end up as bikeshedding.
I will try to impersonate Chrome next, However, I suspect this is going to be more challenging. Chrome uses BoringSSL, which curl does not support. So it means either enforcing curl to compile with BoringSSL or modifying NSS to look like BoringSSL.
Counter argument is service providers just choosing to block anything that looks like Firefox since the market share is so small and it's being used to circumvent their precious protections.
Whilst it's not as big as it once was, the idea of a service provider blocking all Firefox user agents is still ludicrous to the point that I can't believe you're not trolling here.
"Some web services therefore use the TLS handshake to fingerprint which HTTP client is accessing them. Notably, some bot protection platforms use this to identify curl and block it."
As a user of non-browser clients (not curl though) I have not run into this in the wild.^1
Anyone have an example of a site that blocks non-browser clients based on TLS fingerprint.
1. As far as I know. The only site I know of today that is blocking non-browser clients appears to be www.startpage.com. Perhaps this is the heuristic they are using. More likely it is something simpler I have not figured out yet.
There was a conversation on their mailing list contemplating dropping NSS support. https://curl.se/mail/lib-2022-01/0120.html If you have a use case for NSS in curl, you may want to speak up. Perhaps "I want curl to look exactly like a browser" is a significant use case?
Agreed, it is very important to bring this up on the mailing list. It might also be plausible to make curl look like Chrome if curl had BoringSSL support.
Currently, I cannot think about anything else other than "noscript/basic (x)html" /IRC to get us out of this, at least for sites where such protocols are "good enough" to provide their services to users over internet.
But how? Enlighten the "javascript web" brain-washed devs to make them realize how toxic what they do is? regulations (at least for critical sites)?
And how to deal with the other sites: those which devs are scammers and perfectly aware of how toxic they are and keep doing it.
In my own country, for critical sites, I will probably have go to court since 'noscript/basic (x)html" interop was broken in the last few years.
Would be cool if there was something like this for Python. Last time i tried to scrape something interesting i found that one of Cloudflare's enterprise options was easily blocking all of the main http libraries due to the identifiable TLS handshake.
The site wasn't using it to block me, just to prompt a captcha, without doing so to 'real' browsers.
The HTTP requests were exact copies of browser requests (in terms of how the server would've seen them), so it was something below HTTP. I ended up finding a lot of info about Cloudflare and the TLS stuff on StackOverflow, with others having similar issues. Someone even made an API to do the TLS stuff as a service, but was too expensive for me. https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare...
Thanks for the response, never came across the particular behaviour.
fwiw I think when it comes to the 'copy as curl', the HTTP header ordering may be different and it's worth loading up a page twice as some of the cookies are replaced.
I've used puppeteer as the article talks about. Manages the cookies better. Managed to do continuous requests without getting further CF blocks as opposed to a couple of hundred with cURL (due to cookies different from what CF expect over a time)
IIRC CF does have a sliding scale of how protected you want a site to be, so perhaps the TLS stuff belongs further up the scale.
I think most of the scraping libraries have stagnated since it's hard to scrape without a headless browser these days...too many sites with client-side rendered content.
Very cool! Thanks for sharing - it’s always nice to learn about fingerprinting tricks and workarounds, from both a privacy and a “don’t unintentionally look like a bot” perspective.
Good blog post. Stuff like this makes me wonder if by 2030 (1) the internet will mostly consist of machine generated content; (2) machines written by normal people in Python won't be authorized to access the machine-generated content anymore due to Protectify; (3) most client traffic will originate from Protectify's network, so people like bloggers won't have any visibility into whether their readers are humans or machines; (4) video compression algorithms will become indistinguishable from deepfakes; and (5) airborne pathogens will make alternatives to the above impractical.
There are some industries (virtually all of Wall Street, for example, and certain parts of government) where the company needs to surveil 100% of what their employees do on the web from inside the office. These companies have been running MITM proxies for decades.
Wouldn't any website that rejects a non-browsery TLS client be blocking out these people as well?
They don't block you completely, just present you with a JS challenge that delays your access to the site. A browser, even if behind a MITM proxy, would be able to solve this challenge.
Handy. Is the TCP handshake, or other details about socket behavior, ever get used for assessing the remote process, and in turn libraries written to mimic known patterns?
However, I should caution that in this era of companies being particularly user-hostile and authoritarian, especially Big Tech, I would be more careful with sharing stuff like this. Being forced to run JS is bad enough; profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.