Hacker News new | past | comments | ask | show | jobs | submit login

I wanted to try making an HTTP request from Telnet the other day. I tried Wikipedia, using the Host header. I got a 403 for not including a user agent, so I tried again with User-Agent: Telnet and it worked!

It's one of the most important headers for clients, since if you don't include it you might not get a 200.




In the particular case of Wikipedia, I think they check User-Agent to prevent people from unthinkingly wasting gigabytes of bandwidth scraping Wikipedia via tools like wget. In Wikipedia's case, better ways exist to download large quantities of their content in a more usable form.


They may do that ('though requesting a single article works fine), but it's not very smart. Throttling heavy users - possibly returning 429 with a link to the download pages - would make much more sense. It's not like wget users can't change their UA.


?

  bbot@magnesium:~> wget http://en.wikipedia.org/wiki/Japanese_yen
  --2012-07-15 13:54:29--  http://en.wikipedia.org/wiki/Japanese_yen
  Resolving en.wikipedia.org... 208.80.154.225, 2620:0:861:ed1a::1
  Connecting to en.wikipedia.org|208.80.154.225|:80... connected.
  HTTP request sent, awaiting response... 200 OK
  Length: 203481 (199K) [text/html]


wget uses the "wget/version" useragent.


Yes, I am aware. The point of my comment is that Wikipedia obviously does not block wget.


The point is that if it becomes a problem they'll just block that particular useragent.


The point is that you can use -U to specify arbitrary user-agent strings, and -E robots=off to ignore robots.txt.

User-agent blocking is completely braindead. It does nothing at all. The fact that somebody in 2012 can possibly think it works is astounding to me.


I return a 403 if User-Agent or Host headers are missing. And my firewall will lock you out completely if you use "User-agent" instead of "User-Agent" (among many other obvious giveaways in the User-Agent header).


Why?


I block anything that looks like penetration testing or content scraping if there's no chance of false positives. Even when there's no vulnerability present, it conserves resources on dynamically generated sites.


Why I can appreciate that, why not block based on patterns of use rather than headers?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: