Hacker News new | past | comments | ask | show | jobs | submit login

That's a good point. I'm not sure how you'd get around non-HTML documents (eg PDFs) but web pages themselves can be excluded via a meta tag:

    <meta name="robots" content="noindex">
Source: https://support.google.com/webmasters/answer/93710?hl=en

Interestingly in that article, there is the following disclaimer about not using robots.txt for your example:

"Important! For the noindex meta tag to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex tag, and the page can still appear in search results, for example if other pages link to it."

I must admit even I hadn't realised that could happen and I was critical of the use robots.txt to begin with.




For PDF you can use X-Robots-Tag HTTP header [0].

Nofollow is a good suggestion of you control links to the resource, robots of you don't.

[0]: https://developers.google.com/webmasters/control-crawl-index...


Ah, that's true, indeed. The page, though, will appear as a link without any contents, because the bot won't be able to index it.


Except it has indexed it. It just hasn't crawled it. But content or not, the aim you were trying to achieve (namely your content not being indexed) has failed. Thus you are then once again dependant on other countermeasures that render the robots.txt irrelevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: