Hacker News new | past | comments | ask | show | jobs | submit login
Removing my site from Google search (btao.org)
215 points by todsacerdoti on Oct 3, 2021 | hide | past | favorite | 109 comments



It seems like the amount it will hurt Google is directly proportional to the amount it will hurt the owner of the site (assuming they want people to read their message).

I'm sure someone at Google is pretty happy that they don't have to show this page in their search results. Nobody can accuse them of bias against anti-Google pages -- the site owner did it to themselves.

Seems like as perfect an example of "cutting off nose to spite face" as I can imagine. (ok, refusing the vax and dying of COVID to get back at the left might be a better example, but this one is close)


That may end up being true, but if it were to inspire a movement off of google at a large enough scale it could have an significant impact.

Kinda like everything at human scale. If I stop using fossil fuels the planet still burns, if we all stop we have a chance.


The way to create real change is to come up with viable alternatives. I use DDG as my primary search engine but still find myself using Google on a daily basis using "!g".

Rarely do principles alone keep people using inferior products or technologies.


>Seems like as perfect an example of "cutting off nose to spite face" as I can imagine.

If Google was purely beneficial, then yes. If no, then it's a tradeoff.


Maybe. If enough people use alternative search engines it might not matter a lot for him.


What's the point of the remark at the end of your comment? Completely unneccessary bait for irrelevant political bickering.


DId you just compare boycotting Google with not getting vaxed? I used to vote left. But reading people like you arguing like this reminds me to never ever do so again.


Go read some Foxnews comments to get some balancing dumb from the right. People say stupid shit online and neither side has a monopoly.


Interesting. That just means I can make you vote any way I want by making the dumbest argument for the other side. You advertise a strategy that guarantees that you will always cast a manipulated vote. Entertaining.


You seem to imply that not voting for the left automatically means voting for the right. May I remind you that there is another option you probably forgot about? Did you ever hear about the concept of abstaining from voting?


That is also a desired outcome. If I desire a green outcome, I only have to make blues stop voting. In multi-party system if I can make reds and yellows also stop voting it’s great. Fortunately, they will all comply even if I tell them I’m doing this. In fact, they (and you) will particularly comply when told for fear of the shame of backing down.

This pride is particularly exploitable and is why you can orchestrate all sorts of outcomes among demographics that have this vulnerability.


In a two-party system (a necessary de facto in a single-member-district plurality system), failure to vote for a candidate is mathematically identical with a vote for the opposing party.


That's nonsense. By that logic both sides can claim "They didn't vote FOR us so essentially the voted for the other side", which means that by not voting you voted for both.


Interesting. Judging from your comment, I guess you are living in a country with a two party system and the death penalty? Well, I dont.


I fail to see the mathematical equivalence. One is side A: +1 side B: 0

the other is side A: 0 side B: 0


That comment was just an analogy.


"I used to do X, but now I understand better and won't do X again" is a manipulation technic.


And comparing everything you dont like with anti-vax is not manipulative? C'mon, be more creative.


The opposite of stupidity is not intelligence


Oh hey I thought I was the only one. lucb1e.com and another site are also not indexed, though I blocked it based on the user agent string. That way it doesn't get page data or non-HTML files from my server. I introduced this when they were pulling this AMP thing: https://lucb1e.com/?p=post&id=130 It personally doesn't impact me, but it impacts other people on the internet and I figured it was the only thing I can do to try and diversify this market (since I myself already switched to another search engine).

There are zero other restrictions on my site. Use any search engine other than google. Or don't, up to you.


That’s a good idea,but google sometime crawls without the google user agent. So that’s not going to be 100 percent foolproof.

You’d be better off just blocking all of the ip addresses that google crawls from. There are lists of those out there.

When I used to cloak website content, and only serve up certain content to google, the only reliable way was to use ip cloaking. Because google crawls using “partners”, such as using Comcast IPs.

So if you’re want to really get your site out of the index, serve up the page with a noindex tag or noindex in the server header based on google ip addresses.


Hey! Googler here!

We don't use our hardware located on partner networks to do indexing. Those machines are searching for malware and serving some YouTube videos and Play Store downloads.


You forgot to add the word "currently"


"Because google crawls using "partners", such as using Comcast IPs."

Is this different than when others use proxies to evade access controls.


Why not use robots.txt instead of littering your html with googlebot instructions?


Hi, author here. Google stopped supporting robots.txt [edit: as a way to fully remove your site] a few years ago, so these meta tags are now the recommended way of keeping their crawler at bay: https://developers.google.com/search/blog/2019/07/a-note-on-...


Did you actually read your link? That's not at all what it says.


To be clear, stopped supporting robots.txt noindex a few years ago.

Combined with the fact that Google might list your site [based only on third-party links][1], robots.txt isn't an effective way to remove your site from Google's results.

Sorry, could have been clearer.

[1]: https://developers.google.com/search/docs/advanced/robots/in...


This page has a little more detail: https://developers.google.com/search/docs/advanced/crawling/...

"If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. "


>noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.

Seems clear enough to me


Quote from the linked article:

“ For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:”

The first option is the meta tag. It does mention an alternative directive for robots.txt, however.


What about the blocking google bot by their IPs, also combined with user-agent wouldn't that stop the crawlers

Google crawlers IPs https://www.lifewire.com/what-is-the-ip-address-of-google-81...


That will stop the crawlers but you could still show up in the search results, because of other web pages. From GP:

> If other pages point to your page with descriptive text, Google could still index the URL without visiting the page


Did you think that mighty Google would pay attention to your puny "noindex" tag? Ha!


According to google's own docs, this should work.

> You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response.

Source: https://developers.google.com/search/docs/advanced/crawling/...


I mean technically that says that your site won't appear in search results, not that your site won't be used to profile people, determine other site ratings based on your site's content etc.

they won't show your site's content, but that doesn't mean they won't use your site's content.


I thought that (i.e. removing the site from google search) was the goal.

I'd review the other usage on a case by case basis; e.g. determining ratings of other sites seems fair use to me. I'd guess you're allowing others to use your site's content when you're making your site public (TINLA).


maybe, but I guess I would be cantankerous enough to see the goal as preventing google from profiting off your site.


Until they change the rules again...


yes, I do think that


Google will still index the url even if you block them from crawling the page via robots.txt. They will index the lane, and it can still rank well. Google just puts up a message in the results saying they’re not allowed to crawl the page.


robots.txt stops crawling - you can get indexed via other mechanisms.

You want no index robots tags on all your pages and let google see those.

You can use GSC (Google search console) to remove a site / page from the index


Yes, pretty sure this is the way to go.

You can even tell which bots are allowed to index and not.


Or even better, iptables rules :P


Doesn’t that mean you have to know every IP Address used by Google bot now and in the future?


The way to check googlebot (in a way that will be resistant to expansion of Googlebot's IP ranges in the future) is to perform hostname lookup, with dns lookup as well to verify that the rDNS isn't a lie: https://developers.google.com/search/docs/advanced/crawling/...


Indeed, this was one of the things I considered (note I'm not OP), but then I didn't really want to rely on DNS. https://duckduckgo.com/?q=it's+always+DNS


Not a very hard problem; after all, many websites allow full access to Googlebot IP ranges yet show a paywall to everyone else (including competing search engines).

I also happen to ban Google ranges on multiple less-public sites specially since they completely ignore robots.txt and crawl-delay.


Is that how archive.vn works? I've always wondered how they are able to get the full text of paywalled sites like Wall Street Journal who give 0 free articles per month.


Alternatively just use EFF's privacy badger and duckduckgo to stop feeding the beast?

Those are active steps you can take - I am not convinced a few metatags will stop Google spidering your site (even if it is invisible in results), and is of questionable value if you are still using Google search and not blocking their scripts.


It's not either/or. you can do both.


Joke’s on you, Google search results are already atrociously useless as they are and nobody cares.


A more interesting issue is the opposite -- many large sites have robots.txt rules that Disallow all crawlers except Google. A new search engine either 100% respects robots.txt with the result that some major properties are completely unavailable in their index, ignore robots.txt in these special cases where robots.txt configuration is unreasonable, or- crawl anything that allows Google to crawl it. None of these options are great.


Any idea why this would be (other than incompetence)?


I once read it's about bandwidth. Presumably, search engines require lots of BW to serve, and some sites don't want this cost. They make an exception for Google because they want to be on Google.



Replying to Google bots with errors will keep you out of the index, though they may keep retrying forever in case the page starts working again. A lot of times when I am retiring a site I will look at traffic logs and it will all be Google or Bing requesting content that is gone - sometimes years after the content was taken down. I think it is just greed on their part, forever hoping that if content was there before it might appear there again. No telling how much bandwidth gets consumed by that sort of traffic each year. If you are terminating traffic directly it can be really interesting to see all the connections that get opened but never receive an http request before being closed. A lot of that seems to be broken browser plugins, people scanning for live ports, or maybe seeing what certificate is offered in the TLS handshake.


The first sentence grabbed my attention, and I was looking forward to learning about the "threat that surveillance capitalism poses to democracy and human autonomy". But then the article fell flat: he gave no examples of that threat, and neither did the linked article in The Guardian.

Are there specific examples of this type of harm? The only complaints that he made were that Google makes a lot of money (which I have no problem with), and that Google's conduct feels "creepy" to him (which is merely an emotional reaction).

He did hint at Google "modifying your off-screen behavior", and I was eager to learn about that as well... but then he left that unexplored too, and gave no follow-up or examples of that intriguing scenario.


He referenced this page: https://www.socialcooling.com/


Thanks, ok I just looked at that one too. Most of the examples on that page don't pertain to Google, and the ones that do pertain to Google are not harmful (e.g. targeted ads, or someone's emotional reaction to data collection).

Since his ire is specifically directed at Google, I'm still left wondering what specific harm he is envisioning from Google's activities.


I suspect the easiest way to remove your site is to place on it content that Google frowns upon.


Time to bring webrings back.


I use simple HTTP auth with an easy username and password on most my sites. It is rarely a problem for anyone I invite, except perhaps Instagram's browser, but no crawler traffic.


That's a shitty solution. The whole point is to keep the website public.


I think it depends on who your audience is.

For "general" audience, I would use a proof-of-work puzzle 10 seconds long and a basic question captcha with human review.

But for a site which is primarily for tech-savvy people, with an accent on retro-compatibility (HTTP auth is supported almost universally, even by Mosaic), I can't think of a better option. Not that interested in SEO, since the software is my main target.


I like Google, but wouldn’t mind a better search engine, even at the cost of my privacy, so long as I had a choice for what could be shared.


Can you explain what you mean by this? I’ve read it a few times and don’t understand


I think OP means they don't mind sharing their information with the search engine (be it Google, another engine that provides better results, or even a better Google in terms of results), _as long as_ OP has control over exactly what is being shared.

As an aside, I do see the trend for some companies to provide this control nowadays. Even Google is doing it (e.g. you can auto delete your information, or turn them off completely): https://myaccount.google.com/data-and-privacy

Of course, whether or not you believe Google is doing what you have configured in the backend is another question... and there is nothing anyone can do to actually make you believe it short of giving you complete access to the entire Google backend. Or is there a way to verify without exposing? Maybe an interesting research topic...


Sure. I probably could’ve been much more clearer.

I don’t think Google taking your information and sharing it with advertisers is a great sin. Somewhat annoying but nothing particularly harmful.

I do think the search results are easily manipulated and it can be frustrating trying to find relevant information. Like most people I end up defaulting to Reddit for search queries just to find something that isn’t a blog by someone shilling their product.

But I understand the invasion of privacy would irritate some people and maybe in the long term it would be a net negative. So if there was a search engine that explicitly asked for certain information and you had the option to share, that would probably go a long ways.


You're in luck, since there's active development in this space: Neeva.com and kagi.com two of the many alt search engines.


I listed 25 alternative indexing search engines, including Neeva: https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

I'd forgotten about Kagi, thanks; someone mentioned it to me on IRC but it slipped my mind. I'll add it to the list later today.


seems like it would hurt your traffic


Only if there is relevant traffic from Google to begin with, which is highly unlikely for a site like this. A high percentage of results in almost every Google search comes from the closed circle of the same top 10,000 sites or so.

This is the beauty of a protest like this, because this site does have valuable content, and if enough sites like this joined the protest it could actually hurt the relevancy of the Google index, that by the time Google figures out is valuable, would not be allowed to index anymore.


I don't think that's so unlikely: on my blog ~30% of visitors come from searches


Sorry my statement was both generalized and specific at the same time, and that did not turn out well. How many visits does your blog have daily? And what would it take you to remove your site from Google index?


> How many visits does your blog have daily?

~200k sessions in the past year, so ~550/d. Breakdown:

* ~30% search

* ~30% no referer

* ~25% HN

* ~7% Twitter/FB/etc

* ~8% other

> what would it take you to remove your site from Google index?

I don't see why I would want to exclude my site from any index? Being in indexes helps people find my writing, which I like!


> I don't see why I would want to exclude my site from any index? Being in indexes helps people find my writing, which I like!

It's essentially a form of boycott. If one believes Google is a problematic entity (too many fingers in too many aspects of our lives), it's a way to sever connections with them at some personal cost.

At least, if you care about search traffic - one might argue the assumption that Google-like search is the default way to navigate the web is one worth reconsidering and encouraging alternatives to anyway.


So, first, I don't think Google search is harmful, so I'm not especially interested in boycotting, but I'm happy to grant this for the discussion.

There's already an interesting question about when boycotting is it worthwhile tactic, but in this situation, there's the additional complexity of there being two ends at which one could boycott a search engine:

* Producer: don't allow the search engine to include your stuff

* Consumer: use a different search engine

This is not the only place you find this dynamic. For example, if I thought Chrome was harmful, I could choose:

* Producer: make my site incompatible with Chrome and suggest people switch

* Consumer: use a different browser

Or, with email:

* Producer: don't email people with @gmail.com addresses

* Consumer: use a different email provider

Thinking through these cases and similar ones, if you think Gmail / Chrome / Search is harmful then the "consumer" side makes sense: the alternatives are nearly as good so you're not giving up much, and you're helping increase diversity. On the other hand, the "producer" side ones are much less attractive, because they're a much larger sacrifice and the benefit doesn't seem that big.

(Disclosure: I work at Google, but not on Search, Chrome, or Gmail)


> (Disclosure: I work at Google, but not on Search, Chrome, or Gmail)

I appreciate that you mention in your bio that you do work in Ads at Google, which seems directly relevant to OP’s point about boycotting Google by blocking indexing. If Google can’t or otherwise doesn’t index your content, Google can’t profit from selling ads that it otherwise show alongside search results for it. If Google de-indexing became popular among a group of content creators, other search engines may not be similarly blocked, and other alternatives to find said content would be found or created, all to the detriment of Google Ads placement, which is a profitable - and inseparable - component of Google Search.


I don't work on that kind of ad: I work in display ads. If you go to a newspaper or other publisher and see ads alongside the content, there's a good chance that my team owns the JS that handles requesting those ads and putting the responses on the screen.


The producer side would be the much more effective option if they could get a significant fraction of people to do the same. That might not happen now, but if enough people get fed up with Google maybe it will.


Just join a web ring like in the old days.

https://en.wikipedia.org/wiki/Webring


If that's your goal. Personally I host content that people can use or not. I'll link friends if I want them to see it. Visitors don't cost me anything, it doesn't really bring me anything (other than ego?) to have visitors either. Hence I saw fit to also block google (two years ago already apparently, I thought it was much more recent) and it didn't negatively impact my site in any way.


That depends on where your traffic originates from. Back when I tracked people on my site, I found I got very little from search results. Most of it (> 95%) came from links from social media and Github. On a blog that's heavily about privacy I wouldn't expect much to come from Google.

Also, so what if the numbers go down? If your reason from writing a blog is to see a number on a screen then what does that actually give you?


> so what if the numbers go down? If your reason from writing a blog is to see a number on a screen then what does that actually give you?

Traffic numbers are not an end in themselves, but are a decent proxy for "are other people getting value out of what I write?"


This assumes that the increase in traffic due to Google is beneficial, which it rarely is for personal diary sites.


Imagine wikipedia and the top newspapers doing this ... users will start to use another search engine.


If enough sites started to do this, google would stop respecting it. This isn't blocking google from indexing, it's asking google not to index..


Normal users will start to use another newspaper


Wikipedia should do this as Apple and Google are showing Wikipedia results as their own, robbing, IMO, Wikipedia of importance. Wikipedia is large enough that it should have their own search engine, likely with more relevant results.


It does not help Wikipedia to do that. The content on Wikipedia is licensed so that Apple and Google can show the content from Wikipedia (and this is by design, not a loophole). If the users can get access to the encyclopedic content more conveniently, that is still in line with the project's goals, even if that content reaches the user indirectly via a third party.


Identical situation when Facebook was asked to not show previews of the news articles, because of the ePrivacy directive. Could this go for the same legistlation?

https://www.mysk.blog/2021/02/08/fb-link-previews/


No, people will start to read other news sites.


I did the same, though using the HTTP header “X-Robots-Tag: none”.


Removing website from Google search is the least of worries. Every meaningful aspect of your life is now being monitored by corporations and governments. It is too fucking late. That fucked up social scoring system being used in China to oppress people is coming here. Only instead of government doing it directly It will be mostly performed by corporations to keep the appearance of "free" society. Corps will collect your data, assign you a rank and act accordingly.


So I did a search for the title of his blog post to see what comes up, this HN page is top hit.


TFA cites a book on surveillance capitalism:

> …the essence of the exploitation here is the rendering of our lives as behavioral data for the sake of other’s improved control of us.

I don't doubt that surveillance capitalism is a problem. I do doubt that the underlying motive here on the part of capitalist concerns is improved control of us. Capitalism wants profit to move towards capital. Improved control might be a way of doing that, but it's only one tool in the arsenal of capital, and it's a fragile one at best.

It seems to me that the main motive of capitalism is to continue to drive high levels of desire-driven spending, and that it is at least as likely that surveillance capitalism is mostly about understanding how best to do that as it about actual control.

Yes, we know that marketing & advertising have demonstrated over the course of at least a century that consumer control is possible and desirable. It's just not clear that capitalism needs to increase or improve the level of control over what's already established.


Depends on how control is defined. Ultimately they want to steer your purchasing behavior, by showing you super focused ads on even more stuff that you would like to buy. That would be some kind of control.


So if "surveillance capitalism" is apparently the new scare, would "surveillance socialism" be better?

Or are we supposed to imagine that under socialism, there would be no need for Big Tech surveillance? This I most certainly disagree with.

I'm just noticing a trend lately where the word "capitalism" is being attacked on many fronts, and I personally find that troublesome.

Like Churchill said..."Capitalism is absolutely the worse system there is, besides everything else of course"


I don't really understand why you would assume that the only alternative to surveillance capitalism is "surveillance socialism".

This sophism seems to be built on two errors:

* that there is nothing outside of pure capitalism and pure socialism

* that adjoining "surveillance" to capitalism means we're talking about an inevitable aspect of our society that is combined to capitalism, rather than a specific subset of the way business is done in this age.

To be honest, this lack of ability to conceive alternative social systems is concerning. The deformed Churchill quote comes as a cherry on top.


Of course there are an infinite number of alternatives...I just quickly picked something that sounded decent that tried to make my point about my observation lately that Capitalism was getting knocked around everywhere I looked.

In the last Econtalk podcast that I listened to last night, they discussed the loneliness "epidemic" and the author ended up blaming Thatcher-based Capitalism as perhaps the main reason why people are so lonely today!

I thought that was quite the stretch but imagine my chagrin when here was another spurious back-handed attack on it.


> Capitalism [is] getting knocked around everywhere I looked.

That's what happened with every social system in the past, and, while failures certainly have happened, we've always found ways for that criticism to result in improvements to our societies.

It would be very surprising for the current shape of our society or, more generally, capitalism to be an exception to the rule, unless you subscribe to the "end of history" thesis.


No need to believe Fukiyama. The exception to the rule concept can be based on the idea that capitalism is novel in its ability to absorb and transform any protest against it. The divine right of kings, to pick just one example, was unable to pull off this trick, and perished under questioning. So far, capitalism has done remarkably well at incorporating protest and criticism that targets it, in a way that does seem quite novel.


Presumably the "capitalism" in "surveillance capitalism" is to make it clear they're talking about private companies - as distinct from the traditional concerns about government surveillance.


I can’t tell if you’re being willfully obtuse or not. Harvesting people’s data for the express purpose of manipulating them into thinking/buying things that they otherwise wouldn’t is wrong in every sense of the word.

Capitalism has its problems just like everything else. Pretending it doesn’t is just as disingenuous as pretending that socialism would fix everything. If you’re concerned about people attacking capitalism, help fix the problems. Simple as that.


Churchill was referring to democracy, not capitalism.


I'm interested in why this comment is getting so knocked down?

I asked what I think is a valid question and would like to hear honest reactions from people about my observation.

I feel I have every right to be a cheerleader for Capitalism as my father escaped communist Cuba in 1959 as Castro was coming to power and used the US's system and tons of hard work to create a extremely comfortable life for himself, while friends and family members who stayed there lived rather wretched lives.

He never forgot how lucky he was to be able to get out of there just in time and told me time and time again that the US while having flaws, was by far the best place in the world to live, so my original comment comes from this background.

I don't give a flying fuck if the comment gets modded down, but i would like to know just what in it is so offensive to those modding it donw so I can learn something.


I have this weird feeling that you probably think the United States government being overthrown and replaced with the Soviet Union 2.0 is a bad idea.

I suggest "improving society somewhat" so that doesn't happen, but I have a weird feeling that you'd also consider that socialism.


I block all Google trash from my servers and the companies I work for, we have a separate domain for all search and a separate service for websites that can be indexed by Google.

I purposely block all their ips from my servers, aws, and Azure as well and then whitelist any service we want from those services individually and carefully.

Google spies on Everything online, they are the largest collector of exposed data and if they weren't so big they'd be in jail for theft, imo.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: