Hacker News new | past | comments | ask | show | jobs | submit login

This is crazy, that's not what robots.txt is for. How can you complain about the security of a thing that is not meant to provide security?

According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.




That was an anecdote since the previous poster raised the point about security. I'm definitely not claiming robots.txt should be for security nor was designed for security!

I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.

I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.


design your site properly then you shouldn't really need a robots.txt

But how? For example, if you don't want a page to be indexed by Google, you add this information to robots.txt. Nofollow doesn't work for every case, because any external website can link to it, and Google will discover it.


That's a good point. I'm not sure how you'd get around non-HTML documents (eg PDFs) but web pages themselves can be excluded via a meta tag:

    <meta name="robots" content="noindex">
Source: https://support.google.com/webmasters/answer/93710?hl=en

Interestingly in that article, there is the following disclaimer about not using robots.txt for your example:

"Important! For the noindex meta tag to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex tag, and the page can still appear in search results, for example if other pages link to it."

I must admit even I hadn't realised that could happen and I was critical of the use robots.txt to begin with.


For PDF you can use X-Robots-Tag HTTP header [0].

Nofollow is a good suggestion of you control links to the resource, robots of you don't.

[0]: https://developers.google.com/webmasters/control-crawl-index...


Ah, that's true, indeed. The page, though, will appear as a link without any contents, because the bot won't be able to index it.


Except it has indexed it. It just hasn't crawled it. But content or not, the aim you were trying to achieve (namely your content not being indexed) has failed. Thus you are then once again dependant on other countermeasures that render the robots.txt irrelevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: