Hacker News new | past | comments | ask | show | jobs | submit login

To be honest I see robots.txt as a failed experiment since it relies on trust rather than security or thoughtful design.



I don't think it's about security.

For example I've got a link to do delegated login like /login-with/github. When people click it an oauth flow will start. But it is useless for robots to follow so I disallow it in robots.txt. If they still follow nothing breaks and it's not a security issue but if I can avoid starting unnecessary oauth logins it's an additional benefit.


robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content (sadly I have that seen that happen a lot) where as a better approach would be IP whitelisting and/or user authentication.

However I'm not claiming security is the only reason people use (misuse?) robots.txt. For example in your case you could mitigate your need for a robots.txt with a nofollow attribute[1]. Sure bad bots could still crawl your site and find the authentication URL without probing robots.txt so the security implications there is pretty much non-existent. But you've already got a thoughtful design (the other point I raised) that mitigates the need for robots.txt anyway so adding something like "nofollow" maybe enough to remove the robots.txt requirement altogether.

[1] https://en.wikipedia.org/wiki/Nofollow


This is crazy, that's not what robots.txt is for. How can you complain about the security of a thing that is not meant to provide security?

According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.


That was an anecdote since the previous poster raised the point about security. I'm definitely not claiming robots.txt should be for security nor was designed for security!

I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.

I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.


design your site properly then you shouldn't really need a robots.txt

But how? For example, if you don't want a page to be indexed by Google, you add this information to robots.txt. Nofollow doesn't work for every case, because any external website can link to it, and Google will discover it.


That's a good point. I'm not sure how you'd get around non-HTML documents (eg PDFs) but web pages themselves can be excluded via a meta tag:

    <meta name="robots" content="noindex">
Source: https://support.google.com/webmasters/answer/93710?hl=en

Interestingly in that article, there is the following disclaimer about not using robots.txt for your example:

"Important! For the noindex meta tag to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex tag, and the page can still appear in search results, for example if other pages link to it."

I must admit even I hadn't realised that could happen and I was critical of the use robots.txt to begin with.


For PDF you can use X-Robots-Tag HTTP header [0].

Nofollow is a good suggestion of you control links to the resource, robots of you don't.

[0]: https://developers.google.com/webmasters/control-crawl-index...


Ah, that's true, indeed. The page, though, will appear as a link without any contents, because the bot won't be able to index it.


Except it has indexed it. It just hasn't crawled it. But content or not, the aim you were trying to achieve (namely your content not being indexed) has failed. Thus you are then once again dependant on other countermeasures that render the robots.txt irrelevant.


> robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content

Using robots.txt to secure your server from bots is the equivalent of attempting to secure your house from robbery by planting a sign that says "please,don't rob my house". Surprisingly it may works from time to time, by if you're into attempting security by wishful thinking maybe don't be too surprised when it fails about as much as security by chance.


I know. With the greatest of respect your counter argument is literally just reiterating the point I was making. Albeit in the quote you've left off the part of my post where I said it's stupid to use robots.txt in this way.


Note that links marked with Nofollow can still be followed by well-behaving bots: https://en.wikipedia.org/wiki/Nofollow#Interpretation_by_the...


robots.text is not a security tool. It's communication tool that gives advice. Just like sitemaps.

If you need to add security (logins) to protect content you don't need to protect you inconvenience users.


I'd already covered the security point replying to another poster (https://news.ycombinator.com/item?id=14163792) but just to be clear, I'm absolutely not claiming robots.txt is a security tool. Quite the opposite, I saying following good security and design practices renders the robots.txt file obsolete.

Your point about sitemaps helps illustrate that point of mine because having a decent sitemap mitigates the need for Allow lines in robots.txt. It's another feature of the web where robots.txt isn't well equipped to handle and thus there have been other, better, tools built to highlight pages of interest to search engines.


https://en.wikipedia.org/wiki/Robots_exclusion_standard#Secu...

robots.txt was proposed after a badly behaved bot DoSed a web server 20+ years ago, those were different times. With the robots.txt standed now those who wants to play nice can do so without asking anything, for the badly behaved ones it's still up to the admin to put forward the appropriate measures.


Wow has it really been more than 20 years!?? I feel old now...

I do get what you're saying but if you have to implement "appropriate measures" anyway then the robots.txt file becomes completely redundant.


I came here to say something about respecting the wishes of others, etc, but you know what? You're absolutely right. We shouldn't even need to have a conversation about trust and respect.

It should be non-negotiable if you don't want your personal contents indexed by scrapers and archivers, and it should be enforced by design. It's a broken system.


Lots of laws are pretty similar. e.g. technically you could steal loads of things. Practically you don't. Defeating/ignoring mechanisms such as robots.txt (vs maybe some security person in a store) still makes stealing not ok.


The morality of whether bots should obey robots.txt is a separate issue to the point I raised about how you shouldn't trust bots to obey them. To use your example of high street stores: shops have security tags on expensive items / clothing as a method of securing products from theft because you cannot blindly trust everyone not to steal (though wouldn't it be great if that wasn't the case). Equally websites cannot trust that bots will obey robots.txt. Which means any content that doesn't want to be crawled needs to be behind nofollow attributes or (if it's sensitive) user authentication layers and any content that does need to be indexed also needs to be in a sitemap. Once you have all of these extra layers implemented, the robots.txt becomes utterly redundant. Hence why I say it's a failed experiment. The benefits it offers are superseded by better solutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: