Hacker News new | past | comments | ask | show | jobs | submit login

I understand the point of using a special user-agent to crawl webpages for indexing, but search engines should use a "regular" browser UA string (full JavaScript, etc. to simulate an actual browser) occasionally. From a different IP range too, of course. If the contents of the page are wildly different, penalise the site.



They do both or at least google does.


Even if they do both, if the bots always follow what is entered in robots.txt and humans do not, it won’t be long before that’s the primary factor.


And that's totally fine - if they add their articles (for example) into their robots.txt, it would cripple their SEO. It wouldn't happen.


I've seen stuff in the robots.txt get crawled anyways if enough people link to it. In Google's results it will still only show up without any contextual information though.


Google won't crawl it, but they can still include the link in search results, usually with the title guessed from the way it was referred to on another page, and no description.


The Google bot mostly ignores the paths specified in robots.txt


What about sites which display different content depending on origin?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: