Hacker News new | past | comments | ask | show | jobs | submit login

I run a honeypot and I can say with reasonable confidence many (most?) bots and scrapers use a Chrome on Linux user-agent. It's a fairly good indication of malicious traffic. In fact I would say it probably outweighs legitimate traffic with that user agent.

It's also a pretty safe assumption that Cloudflare is not run by morons, and they have access to more data than we do, by virtue of being the strip club bouncer for half the Internet.




User-agent might be a useful signal but treating it as an absolute flag is sloppy. For one thing it's trivial for malicious actors to change their user-agent. Cloudflare could use many other signals to drastically cut down on false positives that block normal users, but it seems like they don't care enough to be bothered. If they cared more about technical and privacy-conscious users they would do better.


> For one thing it's trivial for malicious actors to change their user-agent.

Absolutely true. But the programmers of these bots are lazy and often don't. So if Cloudflare has access to other data that can positively identify bots, and there is a high correlation with a particular user agent, well then it's a good first-pass indication despite collateral damage from false positives.


The programmers of these bots are not lazy - this space is a thriving industry with a bunch of commercial bots, the abiluty of whcih to evade cloudflare/etc is the literal metric that determines their commercial viability


My data says otherwise and you have provided nothing to back up your claim other than saying we have an industry full of dirty money paying programmers to write unethical code. I'm sure it inspires them to do their best work.

Half these imbeciles don't even change the user-agent from the scraper they downloaded off GitHub.

I employ lots of filtering so it's possible the data is skewed towards those that sneak through the sieve - but they've already been caught, so it's meaningless.


I would hope Cloudflare would be way, way beyond a “first pass” at this stuff. That’s logic you use for a ten person startup, not the company who’s managed to capture the fucking internet under their network.


> So if Cloudflare has access to other data that can positively identify bots

They do not - not definitively [1]. This cat-and-mouse game is stochastic at higher levels, with bots doing their best to blend in with regular traffic, and the defense trying to pick up signals barely above the noise floor. There are diminishing returns to battling bots that are indistinguishable from regular users.

1. A few weeks ago, the HN frontpage had a browser-based project that claimed to be undetectable


> a browser-based project that claimed to be undetectable

For now


That's just part of the game. Sometimes you're ahead, sometimes you're behind, but there's never a decisive winner.


I mean, do we need to replace user agent with some kind of 'browser signing'?


If you're thinking of Google's WEI, I'm thankful that went down in flames:

"Google is adding code to Chrome that will send tamper-proof information about your operating system and other software, and share it with websites. Google says this will reduce ad fraud. In practice, it reduces your control over your own computer, and is likely to mean that some websites will block access for everyone who's not using an "approved" operating system and browser."

https://www.eff.org/deeplinks/2023/08/your-computer-should-s...


Sure, but does that means that we, Linux users, can't go on the web anymore ? It's way easier for spammers and bots to move to another user agent/system than for legitimate users. So whatever causes this is not a great solution to this problem. You can do better CF


I'm a Linux user as well but I'm not sure what Cloudflare is supposed to be doing here that makes everybody happy. Removing the most obvious signals of botting because there are some real users that look like that too may be better for that individual user but that doesn't make it a good answer for legitimate users as a whole. SPAM, DoS, phishing, credential stuffing, scraping, click fraud, API abuse, and more are problems which impact real users just as extra checks and false positive blocks do.

If you really do have a better way to make all legitimate users of sites happy with bot protections then by all means there is a massive market for this. Unfortunately you're probably more like me, stuck between a rock and a hard place of being in a situation where we have no good solution and just annoyance with the way things are.


What CF does when bots use "Chrome on Windows" browser agent string?


The method is the same, it just looks different when n=1. I.e. the method is "wait until you see something particularly anomalous occuring, probe, see if the reaction is human like". The more times you say "well you can't count that as anomalous, an actual person can look like that too and a bot could try to fake that!" the less effective it becomes at blocking bots.

This approach clearly blocks bots so it's not enough to say "just don't ever do things which have false positives" and it's a bit silly to say "just don't ever do the things which have false positives, but for my specific false positives only - leave the other methods please!"


Many / most bots use Chrome on Linux user agent, so you think it's OK to block Chrome on Linux user agents. That's very broken thinking.

So it's OK for them to do shitty things without explaining themselves because they "have access to more data than we do"? Big companies can be mysterious and non-transparent because they're big?

What a take!


Can't the user agent be spoofed anyway?


I think they also fingerprint the browser. So changing user agent alone won't help.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: