Why not use robots.txt instead of littering your html with googlebot instruction...

tao_oat · on Oct 3, 2021

Hi, author here. Google stopped supporting robots.txt [edit: as a way to fully remove your site] a few years ago, so these meta tags are now the recommended way of keeping their crawler at bay: https://developers.google.com/search/blog/2019/07/a-note-on-...

new_guy · on Oct 3, 2021

Did you actually read your link? That's not at all what it says.

tao_oat · on Oct 3, 2021

To be clear, stopped supporting robots.txt noindex a few years ago.

Combined with the fact that Google might list your site [based only on third-party links][1], robots.txt isn't an effective way to remove your site from Google's results.

Sorry, could have been clearer.

[1]: https://developers.google.com/search/docs/advanced/robots/in...

gnabgib · on Oct 3, 2021

This page has a little more detail: https://developers.google.com/search/docs/advanced/crawling/...

"If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. "

dd82 · on Oct 3, 2021

>noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.

Seems clear enough to me

jen20 · on Oct 3, 2021

Quote from the linked article:

“ For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:”

The first option is the meta tag. It does mention an alternative directive for robots.txt, however.

ghassanmas · on Oct 3, 2021

What about the blocking google bot by their IPs, also combined with user-agent wouldn't that stop the crawlers

Google crawlers IPs https://www.lifewire.com/what-is-the-ip-address-of-google-81...

benatkin · on Oct 3, 2021

That will stop the crawlers but you could still show up in the search results, because of other web pages. From GP:

> If other pages point to your page with descriptive text, Google could still index the URL without visiting the page

Animats · on Oct 3, 2021

Did you think that mighty Google would pay attention to your puny "noindex" tag? Ha!

thih9 · on Oct 3, 2021

According to google's own docs, this should work.

> You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response.

Source: https://developers.google.com/search/docs/advanced/crawling/...

bryanrasmussen · on Oct 3, 2021

I mean technically that says that your site won't appear in search results, not that your site won't be used to profile people, determine other site ratings based on your site's content etc.

they won't show your site's content, but that doesn't mean they won't use your site's content.

thih9 · on Oct 3, 2021

I thought that (i.e. removing the site from google search) was the goal.

I'd review the other usage on a case by case basis; e.g. determining ratings of other sites seems fair use to me. I'd guess you're allowing others to use your site's content when you're making your site public (TINLA).

bryanrasmussen · on Oct 3, 2021

maybe, but I guess I would be cantankerous enough to see the goal as preventing google from profiting off your site.

swills · on Oct 3, 2021

Until they change the rules again...

mro_name · on Oct 3, 2021

yes, I do think that

bhartzer · on Oct 4, 2021

Google will still index the url even if you block them from crawling the page via robots.txt. They will index the lane, and it can still rank well. Google just puts up a message in the results saying they’re not allowed to crawl the page.

walshemj · on Oct 3, 2021

robots.txt stops crawling - you can get indexed via other mechanisms.

You want no index robots tags on all your pages and let google see those.

You can use GSC (Google search console) to remove a site / page from the index

TheChaplain · on Oct 3, 2021

Yes, pretty sure this is the way to go.

You can even tell which bots are allowed to index and not.

vadfa · on Oct 3, 2021

Or even better, iptables rules :P

TedDoesntTalk · on Oct 3, 2021

Doesn’t that mean you have to know every IP Address used by Google bot now and in the future?

judge2020 · on Oct 3, 2021

The way to check googlebot (in a way that will be resistant to expansion of Googlebot's IP ranges in the future) is to perform hostname lookup, with dns lookup as well to verify that the rDNS isn't a lie: https://developers.google.com/search/docs/advanced/crawling/...

lucb1e · on Oct 3, 2021

Indeed, this was one of the things I considered (note I'm not OP), but then I didn't really want to rely on DNS. https://duckduckgo.com/?q=it's+always+DNS

AshamedCaptain · on Oct 3, 2021

Not a very hard problem; after all, many websites allow full access to Googlebot IP ranges yet show a paywall to everyone else (including competing search engines).

I also happen to ban Google ranges on multiple less-public sites specially since they completely ignore robots.txt and crawl-delay.

chihuahua · on Oct 4, 2021

Is that how archive.vn works? I've always wondered how they are able to get the full text of paywalled sites like Wall Street Journal who give 0 free articles per month.