Hacker News new | past | comments | ask | show | jobs | submit login

Why not use robots.txt instead of littering your html with googlebot instructions?



Hi, author here. Google stopped supporting robots.txt [edit: as a way to fully remove your site] a few years ago, so these meta tags are now the recommended way of keeping their crawler at bay: https://developers.google.com/search/blog/2019/07/a-note-on-...


Did you actually read your link? That's not at all what it says.


To be clear, stopped supporting robots.txt noindex a few years ago.

Combined with the fact that Google might list your site [based only on third-party links][1], robots.txt isn't an effective way to remove your site from Google's results.

Sorry, could have been clearer.

[1]: https://developers.google.com/search/docs/advanced/robots/in...


This page has a little more detail: https://developers.google.com/search/docs/advanced/crawling/...

"If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. "


>noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.

Seems clear enough to me


Quote from the linked article:

“ For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:”

The first option is the meta tag. It does mention an alternative directive for robots.txt, however.


What about the blocking google bot by their IPs, also combined with user-agent wouldn't that stop the crawlers

Google crawlers IPs https://www.lifewire.com/what-is-the-ip-address-of-google-81...


That will stop the crawlers but you could still show up in the search results, because of other web pages. From GP:

> If other pages point to your page with descriptive text, Google could still index the URL without visiting the page


Did you think that mighty Google would pay attention to your puny "noindex" tag? Ha!


According to google's own docs, this should work.

> You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response.

Source: https://developers.google.com/search/docs/advanced/crawling/...


I mean technically that says that your site won't appear in search results, not that your site won't be used to profile people, determine other site ratings based on your site's content etc.

they won't show your site's content, but that doesn't mean they won't use your site's content.


I thought that (i.e. removing the site from google search) was the goal.

I'd review the other usage on a case by case basis; e.g. determining ratings of other sites seems fair use to me. I'd guess you're allowing others to use your site's content when you're making your site public (TINLA).


maybe, but I guess I would be cantankerous enough to see the goal as preventing google from profiting off your site.


Until they change the rules again...


yes, I do think that


Google will still index the url even if you block them from crawling the page via robots.txt. They will index the lane, and it can still rank well. Google just puts up a message in the results saying they’re not allowed to crawl the page.


robots.txt stops crawling - you can get indexed via other mechanisms.

You want no index robots tags on all your pages and let google see those.

You can use GSC (Google search console) to remove a site / page from the index


Yes, pretty sure this is the way to go.

You can even tell which bots are allowed to index and not.


Or even better, iptables rules :P


Doesn’t that mean you have to know every IP Address used by Google bot now and in the future?


The way to check googlebot (in a way that will be resistant to expansion of Googlebot's IP ranges in the future) is to perform hostname lookup, with dns lookup as well to verify that the rDNS isn't a lie: https://developers.google.com/search/docs/advanced/crawling/...


Indeed, this was one of the things I considered (note I'm not OP), but then I didn't really want to rely on DNS. https://duckduckgo.com/?q=it's+always+DNS


Not a very hard problem; after all, many websites allow full access to Googlebot IP ranges yet show a paywall to everyone else (including competing search engines).

I also happen to ban Google ranges on multiple less-public sites specially since they completely ignore robots.txt and crawl-delay.


Is that how archive.vn works? I've always wondered how they are able to get the full text of paywalled sites like Wall Street Journal who give 0 free articles per month.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: