I'm using chrome on linux and noticed that this year cloudflare is very agressive in showing the "Verify you are a human" box. Now a lot of sites that use cloudflare show it and once you solve the challenge it shows it again after 30 minutes!
What are you protecting cloudflare?
Also they show those captchas when going to robots.txt... unbelievable.
Cloudflare has been even worse for me on Linux + Firefox. On a number of sites I get the "Verify" challenge and after solving it immediately get a message saying "You have been blocked" every time. Clearing cookies, disabling UBO, and other changes make no difference. Reporting the issue to them does nothing.
This hostility to normal browsing behavior makes me extremely reluctant to ever use Cloudflare on any projects.
I'm a Cloudflare customer, even their own dashboard does not work with linux+slightly older firefox. I mean one click and it is ooops, please report the error to dev null
At least you can get past the challenge. For me, every-single-time it is an endless loop of "select all bikes/cars/trains". I've given up even trying to solve the challenge anymore and just close the page when it shows up.
It is Cloudflare, I see it too. It's a Cloudflare page, with all branding, the spinning circle, then a captcha pops up on the same Cloudflare-branded page.
I run a few Linux desktop VMs and Cloudflare's Turnstile verification (their auto/non-input based verification) fails for the couple sites I've tried that use it for logins, on latest Chromium and Firefox browsers. Doesn't matter that I'm even connecting from the same IP.
I'd presumed it was just the VM they're heuristically detecting but sounds like some are experiencing issues on Linux in general.
Check that you are allowing webworker scripts, that did the trick for me. I still have issues on slower computers (Raspberry pies and the like) as they seem to be to slow to do whatever Cloudflare wants as a verification in the allotted time, however.
Sounds like my experience browsing internet while connected to the VPN provided by my employer: tons of captcha and everything is defaulted to German (IP is from Frankfurt).
The problem is that you are not performing "normal browsing behavior". The vast majority of the population (at least ~70% don't use ad-blockers) have no extensions and change no settings, so they are 100% fingerprintable every time, which lets them through immediately.
linux + firefox. not sure what happened to me yesterday but the challange/response thing was borked and when i finally got through it all, it said i was a robot anyway. this was while trying to sign up for a skype acct, could have been a ms issue though and not necessarily cloudflare. i think the solution is to just not use obstructive software. thanks to this issue i discovered jitsi and that seems more than enough for my purposes.
Yeah, Lego and Etsy are two sites I can now only visit with safari. It sucks. Firefox on the same machine it claims I'm a bot or a crawler. (not even on linux, on a mac)
Fwiw, I was getting cloudflare blocked for a long time on Firefox+Linux and the only thing that fixed it was completely disabling the UA adjuster browser extension I had installed.
I have Firefox and Brave set to always clear cookies and everything when I close the browser... it is a nightmare when I come back the amount of captchas everywhere....
It is either that or keep sending data back to the Meta and Co. overlords despite me not being a Facebook, Instagram, Whatsapp user...
You don't even need to use a different browser - Firefox has an official "Multi-account containers" extension that lets you assign certain sites to open in their own sandbox so you can have a sandbox for Google, another for Facebook, etc.
So, what's a good strategy for managing containers? I've used this extension for years, and in the past I was a bit more conservative with my containers (personal, work, google, facebook, twitter, banking, etc.) and now I've gone a bit more ... "ham" as they say ... and I have 29. One example is travel, to keep fare searches from pervading news story ads. But I'm sure there's a way to strike a balance that I've just not yet found.
Great idea, I wasn’t even aware and got resigned to the idea tracing is inescapable, but I really need to take that back, even stop using a lot of hostile services. On smartphones it’s even worse.
I don't bother with sites that have cloudflare turnstyle. Web developers supposedly know the importance of page load time, but even worse than a slow loading page is waiting for cloudflare's gatekeeper before I can even see the page.
Turnstile is the in-page captcha option, which you're right, does affect page load. But they force a defer on the loading of that JS as best they can.
Also, turnstile is a Proof of Work check, and is meant to slow down & verify would-be attack vectors. Turnstile should only be used on things like Login, email change, "place order", etc.
Managed challenges actually come from the same "challenges" platform, which includes Turnstile; the only difference being that Turnstile is something that you can embed yourself on a webpage, and managed challenge is Cloudflare serving the same "challenge" on an interstitial web page.
Also, Turnstile is definitely not a simple proof of work check, and performs browser fingerprinting and checks for web APIs. You can easily check this by changing your browser's user-agent at the header level and leave it as-is at the header level; this puts Turnstile into an infinite loop.
The captcha on robots is a misconfiguration in the website. CF has lots of issues, but this one is on their costumer. Also they detect Google and other bots, so those may be going through anyway.
Sure; but sensible defaults ought to be in place. There are certain "well known" urls that are intended for machine consuption. CF should permit (and perhaps rate limit?) those by default, unless the user overrides them.
Putting a CAPTCHA in front of robots.txt in particular is harmful. If a web crawler fetches robots.txt and receives an HTML response that isn’t a valid robots.txt file, then it will continue to crawl the website when the real robots.txt might’ve forbidden it from doing so.
using palemoon, i don't even get a captcha that i could solve. just a spinning wheel, and the site reloads over and over. this makes it impossible to use e.g. anything hosted on sourceforge.net, as they're behind the clownflare "Great Firewall of the West" too.
Whoever configures the Cloudflare rules should be turning off the firewall for things like robots.txt and sitemap.xml. You can still use caching for those resources to prevent them becoming a front door to DDoS.
It seems like common cases like this should be handled correctly by default. These are cachable requests intended for robots. Sure, it would be nice if webmasters configure it but I suspect a tiny minority does.
For example even Cloudflare hasn't configure their official blog's RSS feed properly. My feed reader (running in a DigitalOcean datacenter) hasn't been able to access it since 2021 (403 every time even though backed off to checking weekly). This is a cachable endpoint with public data intended for robots. If they can't configure their own product correctly for their official blog how can they expect other sites to?
I agree, but I also somewhat understand. Some people will actually pay more per month for Cloudflare than their own hosting. The Cloudflare Pro plan is $20/month USD. Some sites wouldn't be able to handle the constant requests for robots.txt, just because bots don't necessarily respect cache headers (if they are even configured for robots.txt), and the sheer number of bots that look at robots.txt and will ignore a caching header are too numerous.
If you are writing some kind of malicious crawler that doesn't care about rate-limiting, and wants to scan as many sites as possible for the most vulnerable to get a list together to hack, you will scan robots.txt because that is the file that tells robots NOT to index these pages. I never use a robots.txt for some kind of security through obscurity. I've only ever bothered with robots.txt to make SEO easier when you can control a virtual subdirectory of a site, to block things like repeated content with alternative layouts (to avoid duplicate content issues), or to get a section of a website to drop out of SERPs for discontinued sections of a site.
> sheer number of bots that look at robots.txt and will ignore a caching header
This is not relevant because Cloudflare will cache it so it never hits your origin. Unless they are adding random URL parameters (which you can teach Cloudflare to ignore but I don't think that should be a default configuration).
The thing is, it won't do that by default. You have to enable caching currently, when creating a new account. I use a service that detects if a website is still running, and it does this by using a certain URL parameter to bypass the cache.
Again, I think you are correct with more sane defaults, but I don't know if you've ever dealt with a network admin or web administrator that hasn't dealt with server-side caching vs. browser caching, but it most definitely would end up with Cloudflare losing sales because people misunderstood how things work. Maybe I'm jaded, at 45, but I feel like most people don't even know to look at headers by default when they feel they hit a caching issue. I don't think it's based on age, I think it's based on being interested in the technology and wanting to learn all about it. Mostly developers that got into it for the love of technology, versus those that got into it because it was high paying and they understood Excel, or learned to build a simple website early in life, so everyone told them to get into software.
I scrape hundreds of cloudflare protected sites every 15 minutes, without ever having any issues, using a simple headless browser and mobile connection, meanwhile real users get interstitial pages.
It's almost like Cloudflare is deliberately showing the challenge to real users just to show that they exist and are doing "something".
It's not just Linux, I'm using Chrome on my macOS Catalina MBP and I can't even get past the "Verify you are a human" box. It just shows another captcha, and another, and yet another... No amount of clearing cookies/disabling adblockers/connecting from a different WiFi does it. And that's on most random sites (like ones from HN links), I also don't recall ever doing anything "suspicious" (web scraping etc.) on that device/IP.
A cheeky response is "their profit margins", but I don't think that quite right considering that their earnings per share is $-0.28.
I've not looked into Cloudflare much, I've never needed their services, so I'm not totally sure on what all their revenue streams are. I have heard that small websites are not paying much if anything at all [1]. With that preface out of the way–I think that we see challenges on sites that perhaps don't need them as a form of advertising, to ensure that their name is ever-present. Maybe they don't need this form of advertising, or maybe they do.
If you log in to the CF dashboard every 3 months or so you will see pretty clearly they are slowly trying to be a cloud provider like Azure or AWS. Every time I log in there is a who new slew of services that have equivalent on the other cloud providers. They are using the CDN portion of the business as a loss leader.
They run their own DNS infra so that when you set the SOA for your zone to their servers they can decide what to resolve to. If you have protection set on a specific record then it resolves to a fleet of nginx servers with a bunch of special sauce that does the reverse proxying that allows for WAF, caching, anti-DDoS, etc. It's entirely feasible for them to exempt specific requests like this one since they aren't "protect[ing] the whole DNS" so much as using it to facilitate control
of the entire HTTP request/response.
I run a honeypot and I can say with reasonable confidence many (most?) bots and scrapers use a Chrome on Linux user-agent. It's a fairly good indication of malicious traffic. In fact I would say it probably outweighs legitimate traffic with that user agent.
It's also a pretty safe assumption that Cloudflare is not run by morons, and they have access to more data than we do, by virtue of being the strip club bouncer for half the Internet.
User-agent might be a useful signal but treating it as an absolute flag is sloppy. For one thing it's trivial for malicious actors to change their user-agent. Cloudflare could use many other signals to drastically cut down on false positives that block normal users, but it seems like they don't care enough to be bothered. If they cared more about technical and privacy-conscious users they would do better.
> For one thing it's trivial for malicious actors to change their user-agent.
Absolutely true. But the programmers of these bots are lazy and often don't. So if Cloudflare has access to other data that can positively identify bots, and there is a high correlation with a particular user agent, well then it's a good first-pass indication despite collateral damage from false positives.
The programmers of these bots are not lazy - this space is a thriving industry with a bunch of commercial bots, the abiluty of whcih to evade cloudflare/etc is the literal metric that determines their commercial viability
My data says otherwise and you have provided nothing to back up your claim other than saying we have an industry full of dirty money paying programmers to write unethical code. I'm sure it inspires them to do their best work.
Half these imbeciles don't even change the user-agent from the scraper they downloaded off GitHub.
I employ lots of filtering so it's possible the data is skewed towards those that sneak through the sieve - but they've already been caught, so it's meaningless.
I would hope Cloudflare would be way, way beyond a “first pass” at this stuff. That’s logic you use for a ten person startup, not the company who’s managed to capture the fucking internet under their network.
> So if Cloudflare has access to other data that can positively identify bots
They do not - not definitively [1]. This cat-and-mouse game is stochastic at higher levels, with bots doing their best to blend in with regular traffic, and the defense trying to pick up signals barely above the noise floor. There are diminishing returns to battling bots that are indistinguishable from regular users.
1. A few weeks ago, the HN frontpage had a browser-based project that claimed to be undetectable
If you're thinking of Google's WEI, I'm thankful that went down in flames:
"Google is adding code to Chrome that will send tamper-proof information about your operating system and other software, and share it with websites. Google says this will reduce ad fraud. In practice, it reduces your control over your own computer, and is likely to mean that some websites will block access for everyone who's not using an "approved" operating system and browser."
Sure, but does that means that we, Linux users, can't go on the web anymore ? It's way easier for spammers and bots to move to another user agent/system than for legitimate users. So whatever causes this is not a great solution to this problem. You can do better CF
I'm a Linux user as well but I'm not sure what Cloudflare is supposed to be doing here that makes everybody happy. Removing the most obvious signals of botting because there are some real users that look like that too may be better for that individual user but that doesn't make it a good answer for legitimate users as a whole. SPAM, DoS, phishing, credential stuffing, scraping, click fraud, API abuse, and more are problems which impact real users just as extra checks and false positive blocks do.
If you really do have a better way to make all legitimate users of sites happy with bot protections then by all means there is a massive market for this. Unfortunately you're probably more like me, stuck between a rock and a hard place of being in a situation where we have no good solution and just annoyance with the way things are.
The method is the same, it just looks different when n=1. I.e. the method is "wait until you see something particularly anomalous occuring, probe, see if the reaction is human like". The more times you say "well you can't count that as anomalous, an actual person can look like that too and a bot could try to fake that!" the less effective it becomes at blocking bots.
This approach clearly blocks bots so it's not enough to say "just don't ever do things which have false positives" and it's a bit silly to say "just don't ever do the things which have false positives, but for my specific false positives only - leave the other methods please!"
Many / most bots use Chrome on Linux user agent, so you think it's OK to block Chrome on Linux user agents. That's very broken thinking.
So it's OK for them to do shitty things without explaining themselves because they "have access to more data than we do"? Big companies can be mysterious and non-transparent because they're big?
What are you protecting cloudflare?
Also they show those captchas when going to robots.txt... unbelievable.