Hacker News new | past | comments | ask | show | jobs | submit login

I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.

I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.

[0] https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...

[1]https://www.stonetemple.com/does-google-respect-robots-txt-n...

[2] https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...




Google is also really generous with how they will let you spell "disallow": https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...

:D


I'm not surprised. Some people think humans read robots.txt and get super angry when the crawler doesn't understand.


I read robots.txt, but I'm not a massive corporation.


> (absl::StartsWithIgnoreCase(key, "disallaw")))));

Ah, the southern version. :)


This is great. #kAllowFrequentTypos


Ive never made a tpyo in my life!!!


Whats a tpyo? My typo array for "typo" only has tipo, typpo and thai-pho


If you're gonna include thai-pho you also need to include thai-fur...


Makes sense because not everyone speaks English as native language. Disalow is pretty close to disallow, phonetically.


Yuck though! Imagine if you were writing a compiler. Would you make it accept “unsinged” “unnsigned” “unssined” and “unsined” as keywords, just to catch spelling mistakes? Not sure I like that pattern.


It's a little different in that case, since the person using the parser is also the person writing the input to the parser. So if the input fails the parser, the author of the code can simply correct it. As I understand it, there's no single standard that captures how all robots.txt files are formatted, so there's no "standard parser" that the authors of these files could be expected to pass.


That is not an excuse. Non-native speakers can learn to spell.


Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.

If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.

In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).


That's correct. If a URL is blocked using robots.txt, Google will never be able to see the "noindex" tag on the page.

URLs blocked in robots.txt can get discovered through other links and they will get displayed in the search results.

However, you will not see any information like the meta description on these blocked URLs.

There's a good explanation about this here, including a video from former Googler, Matt Cutts: https://yoast.com/prevent-site-being-indexed/


> However, you will not see any information like the meta description on these blocked URLs.

True, but that's not the only thing. If it ever was in the index, it takes forever to be removed, if it gets removed at all. Send 404 or 410, Disallow it or set it to noindex - you may get lucky or you may not. You can of course "hide it from search results", but that only works for 90 days (iirc, may be 120, something in that range). Those leftovers will typically lose rankings, but they often stay indexed, easy to spot with a site: query.


Reindexing a page is dynamic based on noteworthiness and volatility iirc, but individual links can be reindexed on the fly since the Percolator index. The 90d number was from an old system when indexes were broken into shards that had to be swapped out wholesale.

Percolator white paper: https://ai.google/research/pubs/pub36726


I don't mean reindexing, I mean "hiding from the index" ("Remove URLs" in GSC). It works instantly, but only for a limited time, after which it will re-appear in the index if you haven't gotten it out of the index (via 410, noindex or disallow). Since these other ways don't always work, if you're unlucky and want it to stay gone, you need to hide it again (and again and again). I've had clients that were hacked and had spammy content injected into their site and it took (literally!) years for that to get removed (we tried combinations of 404, 410, noindex and disallow).


Yeah, the URL removal tool is not meant for permanent removals, but for temporary, 90-day removals:

https://support.google.com/webmasters/answer/1663419?hl=en


Exactly, there is no guaranteed way to remove anything, HTTP status, meta-tags, headers, and robots.txt only have advisory status. They are usually followed when a resource is hit first, but once it's in the index, "keeping the result available" seems to be a top priority. I do understand the idea - it might still be a useful result for a user, but otoh if it's 410 (or continuously 404), it won't be of any use because the content that was indexed is no longer available (especially in case of 410).

Granted, these are edge cases, in most circumstances, 410 + 90 day hiding means they are hidden instantly and don't resurface. These edge cases do make me take Google's official statements on how to deal with things with a grain of salt though: bugs exist, and unless you happen to know somebody at Google there's no way to report them.


Send 410 Gone with a noindex meta tag in html and X-Robots-Tag?

https://www.searchenginejournal.com/google-404-status/254429 "How Google Handles 404/410 Status Codes" -- "If we see a 410, they immediately convert that 410 into an error rather than protecting it for 24 hours"


> You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).

Which site? [Edit: I have now found https://www.gcsbackpack.com/ on page 6 of the results, and this was presumably the intended site.]


Doesn't that indicate that Google doesn't respect robots.txt then?


No, disallow means that you are not allowed to crawl the page. You have to crawl the page to no you cannot index it. But how do you index it if you do not crawl the page, well if another page that you can crawl and index points to the page you cannot index as authoritative on a keyword then it be in the index with that keyword, even if you do not have the actual crawled content of the page.


It really feels like they need to allow a list of `noindex` pages in the robots.txt then...


The whole point is they don’t want you to easily opt out of it.


Keep walking down that path to find v. heavy regulation.


Not if you buy the government first. I think they’re already a victim of their own greed though.

It’s bad enough I started using DDG for search because the results are now more relevant. Google’s advertising algorithms are designed to subtly nudge sites into paying for placement — which means there’s a “non-content” element to the search results that makes it into the user experience. I feel like there was a tipping point a year or two ago where the results just stopped being useful — The best analogy I can find is how search engines used to be in the days before AltaVista. Then AltaVista came out and the results were far more relevant (if not perfect). Google -> DDG feels like that in 2019.

That “non-content” element will only grow over time as Google seeks revenue growth — growth across all of Google’s non-advertising revenue streams combined are not enough to move the needle compared to the scale their ad business has — of which search ads are by far the most profitable. So they will further try to monetize search; it’s their cash cow but I think a small player like DDG could easily overtake them as the quality of Google’s search results (to the end user) continue to decline.


Agreed re: DDG search quality. It's my own default and preferred choice. Google remains useful for Scholar and Books, but relevance and deceptive ads on SERPS is rapidly declining.


Right, but how do you index a page you weren't supposed to crawl in the first place?


It's like recommending a book you haven't read, and newspapers do that every day.

Basically Google finds the link in other places -> oh that must be interesting, I'm indexing it, without even reading it. So they don't have the actual content, and just use the texts from the sites that link to it.


But they do have the actual content since they show the meta title an description on top of what I assume is heavy NLP to drive the search engine itself.


Robots stops a page from being crawled the noindex tag stops it getting into the index.

Google is also slow to honour 404 and drop pages which can hang around for ages, Bing is much faster to remove 404 pages.


Usually, creating "410 Gone"[0] response for the URL and running the URL through the URL Inspection Tool [1] can help make things a bit faster. But yeah, it does take a while to get these 404s removed.

[0] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410 [1] https://support.google.com/webmasters/answer/6065812?hl=en


That distinction exists in many systems. E.g. for cloud events, 404 is considered with skepticism because it could be a race condition in provisioning or transient issue whereas 410 requires data streams to be cut off.


Then you should serve 503


5xx means that the server made a mistake. 4xx means that the caller made a mistake. Sending a request to a GONE url is canonically classified as a “user” or sender error.


> but the SEO people seem to trust these shady sites more.

It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".


You need to find better SEO people then :)

But I understand, there's a lot of snake-oil and "one weird trick to rank first" that brings a bad name to the SEO world.

I've seen people go on Fiverr and expect to find top-notch SEOs there.

There's more to SEO than just writing good content. There's a lot of technical stuff that can bite you and your awesome content will never rank.

Stuff like improving site structure, canonicals, learning to deal with multi-language versions of your content, implementing proper redirects, etc,etc is something that a good SEO should be able to fix and improve.


This really has not been my experience with SEO people in the 2010s. They have focused on page load, no errors, good redirect schemes etc


I mean if you're hiring a SEO person isn't this literally what you're paying for -- tricks to increase your search ranking without changing your content?


Not at all. SEOs are more likely to find all the ways you're currently hanging yourself. Some common examples I see are:

- Putting important text inside of images

- Duplicate content out the wazoo

- Not making use of canonicals

- No sitemaps, html or xml

- Page performance issues

- Broken mobile support

And of course, poor content. You can't rank if you don't have content.


I'd argue most of those are necessary for good content (if we don't view content separately from presentation)

> Putting important text inside of images I'm sure the reason for this is that it's hard to parse text from images, and while Google could use their AI to figure it out, they don't bother. But it also prevents blind people from being able to read the text, so it does worsen the experience. > Duplicate content This makes the site harder to navigate for users as well. > Page performace issues Quite obviously makes the experience worse. > Broken mobile support. -..-


SEO should be a bridge between technical and non-technical people that build out sites.

No site's output is 100% because of the tech team - content writers can put in weird code, marketers can add all sorts of stuff to say Tag manager, the robots.txt is likely from 2008. And a site built with code as the primary goal is likely lacking in some marketing oomph somewhere.

Someone who's job it is to find the right balance, and aim to maximise the returns from the single largest source of traffic, is pretty valuable.


Except... they were correct.

Google has now clarified that they're removing the code behind the undocumented items, with noindex called out explicitly.

https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

It wasn't officially supported / the recommended way - but it worked (in many cases.)


> evidence to me that this Noindex thing is bullshit

For those who (like me) don't know a lot about this, which side of the argument is bullshit? Have you just been proved right or wrong?


It looks like it's too late for me to edit my comment, but I've been proved right. Putting a Noindex directive directly in robots.txt is frequently suggested, but this seems like definitive proof that that does nothing (at least with Google).

As far as I can tell the inception of this idea was that it was briefly mentioned by some Google employee in an interview. Maybe it was supported in the past or maybe he just misspoke, but I bet even now we'll see people still using this tag.


I recommend finishing the whole comment


read the whole comment. Still confused as well.


While that's great, there should be instances where crawlers should ignore noindex directives. For example, all .gov sites.


I'm not sure I understand your reasoning, why should Google honor noindex everywhere but on .gov websites? What about other countries' government TLDs? What about publicly traded companies? What about personal websites of elected officials? What about accounts of elected officials on 3rd party websites?

That seems like a can of worms not really worth opening.


This might be controversial but everything is fair game everywhere. If you can crawl it, tough luck. It's there and everyone can get to it anyways, why not a crawler?


Because the rules a well-functioning society runs by are more nuanced than "Is it technically possible to do this?"

If you'd like a specific example of why people might seek this courtesy, someone might have a page or group of pages on their site that works fine when used by the humans who would normally use it, but which would keel over if bots started crawling it, because bot usage patterns don't look like normal human patterns.


A society is composed of humans. But there are (very stupid) AIs loose on the Internet that aren't going to respect human etiquette.

By analogy: humans drive cars and cars can respond to human problems at human time-scales, and so humans (e.g. pedestrians) expect cars to react to them the way humans would. But there are other things on, and crossing, the road, besides cars. Everyone knows that a train won't stop for you. It's your job to get out of the way of the train, because the train is a dumb machine with a lot of momentum behind it, no matter whether its operator pulls the emergency brake or not.

There are dumb machines on the Internet with a lot of momentum behind them, but, unlike trains, they don't follow known paths. They just go wherever. There's no way to predict where they'll go; no rule to follow to avoid them. So, essentially, you have to build websites so that they can survive being hit by a train at any time. And, for some websites, you have to build them to survive being hit by trains once per day or more.

Sure, on a political level, it's the fault of whoever built these machines to be so stupid, and you can and should go after them. But on a technical, operational level—they're there. You can't pre-emptively catch every one of them. The Internet is not a civilized place where "a bolt from the blue" is a freak accident no one could have predicted, and everyone will forgive your web service if it has to go to the hospital from one; instead, the Internet is a (cyber-)war-zone where stray bullets are just flying constantly through the air in every direction. Customers of a web service are about the same as shareholders in a private security contractor—they'd just think you irresponsible if you deployed to this war-zone without properly equipping yourself with layers and layers of armor.


Honestly that is the site owners problem. If it can be found by a person it's fair. I genuinely respect the concept of courtesy but I don't expect it. People can seek courtesy but they should have expectations of whether or not it will happen.


So in your view is DoS attack not actually an attack and site owners should just have to handle the traffic?


Techies forget the rule of laws. A dos has intent. A bot crawling a poorly designed website accidentally causing the site owners problems does not have malicious intent. They can choose to block the offender just like a restaurant can refuse service. But intent still matters.


This thread is about what behavior we should design crawlers to have. One person said crawlers should disregard noindex directives on government sites, and you replied that they should ignore all robots.txt directives and just crawl whatever they can. If you intentionally ignore robots.txt, that has intent, by definition.


Not intentionally ignore it by going out of their way to override it, just not be required to implement a feature to their crawler. Apparently parsing those sounds tricky with edge cases. Ignoring that file is absolutely on the table. People of course can adhere to but it's not required and in my opinion shouldn't even be paid attention to.

In my younger years the only time I ever dealt with robots.txt was to find stuff I wasn't supposed to crawl.


If you don’t want something public, don’t allow a crawler to find it or access it. The people you want to hide stuff from are just going to use search engines that ignore robots.txt


If you don't want someone or a bot to find it, don't put it online.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: