Hacker News new | past | comments | ask | show | jobs | submit login

You'd want to white-list both based on user-agent string and IP / CIDR origin.

Keeping those lists maintained would be a bit of fairly constant effort, unless you could come up with a self-training mechanism, or a self-validating mechanism.

Killing Google's ability to crawl your site is pretty much as bad or worse than blocking bots. Though you should also be able to ID bad guys by noting which IPs the spoofed user-agents are coming from to, say, some honey-pot links specifically disallowed for that user-agent in your robots.txt file.




In the general case it's important to let google crawl your site, but facebook would probably be fine without it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: