Hacker News new | past | comments | ask | show | jobs | submit login

That's not how robots.txt works. Google explicitly says [1] that:

> A robots.txt file [...] is not a mechanism for keeping a web page out of Google.

> You should not use robots.txt as a means to hide your web pages from Google Search results.

> If your web page is blocked with a robots.txt file, it can still appear in search results

[1] https://developers.google.com/search/docs/advanced/robots/in...




Fair, it seems like noindex is the better way. It doesn't really matter what the exact technical details are, the point is that Google respects your wish to not be indexed if you express it.

The law as written is extremely draconic because it will force Google to link to these orgs(providing value) and then also pay them(providing value). It explicitly forbids Google from delisting them


That very document explains why banning robots.txt may still get you indexed or quoted by google (it's because Google indexes other sites, and they may quote or mention you), and then goes onto to explain they provide mechanisms to help solve that problem.

The general point make by the post you responding is perfectly correct: Google provides many ways for a web site to stop their content from being indexed. That includes all the newspaper sites. And they very explicitly don't use it. This is www.news.com.au's robots.txt:

https://www.news.com.au/robots.txt

Notice how they ban some spider called "NewsNow". But the don't ban Google.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: