Hacker News new | past | comments | ask | show | jobs | submit login

It's irrelevant that it is spidering mixed in. You can think of spidering as a future page view, assuming that page gets indexed.

The spiders and organic visitors should have been 301 redirected to the correct location. Spiders learn the 301 redirects, and some of their articles may even benefited from better search rankings.




As I said in my post below, spiders that request article_url/reddit.png or article_url/google-analytics.com/ga.gs do not get a 301 from me because they're not looking at an href of an <a> tag. They're guessing at a URL that never existed. They are legitimate 404 responses.


I feel sorry for John_Onion. The comments here are coming from people who have never looked at server logs (e.g. 'how do you know those are all spiders?'). Looking at my logs I would say at least 75% of my hits (juliusdavies.ca) are spiders. They come and visit every single page a few times a year to see if it's changed. For my own purposes I mirror some open source manuals and specifications (http://juliusdavies.ca/webdocs/). These have been on my site, unchanged, for at least 3 years, and the spiders come every couple months and check every page.

These hits will never (and should never!) translate into even a single real user in my case.


Wouldn't exclusion via robots.txt be appropriate in this case?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: