I wanted to try making an HTTP request from Telnet the other day. I tried Wikipe...

JoshTriplett · on July 15, 2012

In the particular case of Wikipedia, I think they check User-Agent to prevent people from unthinkingly wasting gigabytes of bandwidth scraping Wikipedia via tools like wget. In Wikipedia's case, better ways exist to download large quantities of their content in a more usable form.

icebraining · on July 15, 2012

They may do that ('though requesting a single article works fine), but it's not very smart. Throttling heavy users - possibly returning 429 with a link to the download pages - would make much more sense. It's not like wget users can't change their UA.

cubicle · on July 15, 2012

?

  bbot@magnesium:~> wget http://en.wikipedia.org/wiki/Japanese_yen
  --2012-07-15 13:54:29--  http://en.wikipedia.org/wiki/Japanese_yen
  Resolving en.wikipedia.org... 208.80.154.225, 2620:0:861:ed1a::1
  Connecting to en.wikipedia.org|208.80.154.225|:80... connected.
  HTTP request sent, awaiting response... 200 OK
  Length: 203481 (199K) [text/html]

verroq · on July 16, 2012

wget uses the "wget/version" useragent.

cubicle · on July 16, 2012

Yes, I am aware. The point of my comment is that Wikipedia obviously does not block wget.

verroq · on July 17, 2012

The point is that if it becomes a problem they'll just block that particular useragent.

cubicle · on July 17, 2012

The point is that you can use -U to specify arbitrary user-agent strings, and -E robots=off to ignore robots.txt.

User-agent blocking is completely braindead. It does nothing at all. The fact that somebody in 2012 can possibly think it works is astounding to me.

jackalope · on July 15, 2012

I return a 403 if User-Agent or Host headers are missing. And my firewall will lock you out completely if you use "User-agent" instead of "User-Agent" (among many other obvious giveaways in the User-Agent header).

icebraining · on July 15, 2012

jackalope · on July 16, 2012

I block anything that looks like penetration testing or content scraping if there's no chance of false positives. Even when there's no vulnerability present, it conserves resources on dynamically generated sites.

dredmorbius · on July 16, 2012

Why I can appreciate that, why not block based on patterns of use rather than headers?