Hi, author here. Google stopped supporting robots.txt [edit: as a way to fully remove your site] a few years ago, so these meta tags are now the recommended way of keeping their crawler at bay: https://developers.google.com/search/blog/2019/07/a-note-on-...
To be clear, stopped supporting robots.txt noindex a few years ago.
Combined with the fact that Google might list your site [based only on third-party links][1], robots.txt isn't an effective way to remove your site from Google's results.
"If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. "
>noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
“ For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:”
The first option is the meta tag. It does mention an alternative directive for robots.txt, however.
> You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response.
I mean technically that says that your site won't appear in search results, not that your site won't be used to profile people, determine other site ratings based on your site's content etc.
they won't show your site's content, but that doesn't mean they won't use your site's content.
I thought that (i.e. removing the site from google search) was the goal.
I'd review the other usage on a case by case basis; e.g. determining ratings of other sites seems fair use to me. I'd guess you're allowing others to use your site's content when you're making your site public (TINLA).
Google will still index the url even if you block them from crawling the page via robots.txt. They will index the lane, and it can still rank well. Google just puts up a message in the results saying they’re not allowed to crawl the page.
The way to check googlebot (in a way that will be resistant to expansion of Googlebot's IP ranges in the future) is to perform hostname lookup, with dns lookup as well to verify that the rDNS isn't a lie: https://developers.google.com/search/docs/advanced/crawling/...
Not a very hard problem; after all, many websites allow full access to Googlebot IP ranges yet show a paywall to everyone else (including competing search engines).
I also happen to ban Google ranges on multiple less-public sites specially since they completely ignore robots.txt and crawl-delay.
Is that how archive.vn works? I've always wondered how they are able to get the full text of paywalled sites like Wall Street Journal who give 0 free articles per month.