Tell HN: The front page of Hacker News has been deindexed from Google

Matt_Cutts · on June 27, 2013

It's not that PG has a grudge against Google (or vice versa) or anything like that. I believe that search engine bots crawl Hacker News hard enough that PG blocks most crawling by bots. In the case of Google, he does allow us to crawl from some IP addresses, but it's true that Google isn't able to crawl/index every page on Hacker News.

Here's a link where I answered the same question about three weeks ago: https://news.ycombinator.com/item?id=5837004 , so this isn't a new issue. In fact, PG has been blocking various bots since 2011 or so; https://news.ycombinator.com/item?id=3277661 is one of the original discussions about this.

And to show this isn't a Google-specific issue, note that Bing's #1 result for the search [hacker news] is a completely different site, thehackernews.com: http://www.bing.com/search?q=hacker+news

In general, I think PG's priority is to have a useful, interesting site for hackers. That takes precedence and is the reason why I believe PG blocks most bots: so that crawling doesn't overload the site.

Roedou · on June 27, 2013

Thanks for that Matt; I didn't see that recent post or your comment, so sorry for dragging you back here to repeat yourself.

Looks like I'm going to have to stop relying on searching 'hn' when using a different computer, and start typing in the full URL. First world problems are such a burden.

Matt_Cutts · on June 27, 2013

No worries at all. I don't think the HN thread from three weeks ago made it to the front page (I happened to see it while browsing on /newest). I figured someone would notice and ask about this, so I'm happy to have the chance to explain.

mkbrody · on June 28, 2013

Hey Matt,

I'm sorry to reach out to you directly on a public forum like this, but my company's website encountered a major negative SEO attack last month and we were hit with a manual penalty by Google today. I thought you might be interested to hear about what happened, and I of course I would like to resolve it as I do my best to always keep my company's SEO efforts within Google's guidelines. Please reach out via email to me at mbrody@myclean.com if we can help each other fix this! Thanks again for everything you do to help make the web a better place, and in advance I understand if you're too busy to respond.

Best regards,

Mike B.

nicholasreed · on June 28, 2013

Don't apologize to just Matt, you're pseudo-apology and better-sent-as-an-email question pissed me off. Why would you take up three extra lines with a BS platitude and a signature? Please keep a personal request for assistance to better-suited channels.

karolist · on July 5, 2013

Just downvote and move along, why the hostility.

sswaner · on June 28, 2013

https://plus.google.com/+MattCutts might work better.

chintan · on June 28, 2013

Doesn't Googlebot respect Crawl-Delay in robots.txt? PG has set it to be 30 secs - https://news.ycombinator.com/robots.txt - which IMHO should not cause any load issues given the overall traffic profile of HN.

kanamekun · on June 28, 2013

Googlebot doesn't respect the crawl-delay setting in robots.txt. https://groups.google.com/forum/#!topic/Google_Webmaster_Hel...

As I understand it, the best way to lower the crawl rate is to log into Google Webmaster Tools and manually lower your crawl rate. The crawl rate delays expire every 90 days, so I set a calendar reminder to renew the crawl delay every 3 months. https://support.google.com/webmasters/answer/48620?hl=en

jlgreco · on June 28, 2013

Mmm, seems kind of like a feature. In fact, maybe PG should robots.txt google entirely. It seems like HN has been getting mentions in other media with increasing frequency. If you can't find the site just because google doesn't doesn't list it, then I have to wonder what you are actually doing here. This wouldn't be the first way that HN sets a bar for new users either; the "Create Account" form is already hidden under "submit".

HNSearch works great for HN specific searches anyway.

JoeCortopassi · on June 27, 2013

This has happened before, and usually has a non-pitchforky reasoning (e.g. PG pulled it temporarily because of network/server issue). I'm sure it will be back soon, and we will have a rather reasonable answer as to why. There are way to many google employees, that frequent and enjoy HN, for it to be banned for some arbitrary reason

AsymetricCom · on June 27, 2013

And of course, the network has specific functions for censorship as required by child protection laws. "Just a network error" really doesn't guarantee that the network wasn't doing something nefarious itself.

gee_totes · on June 27, 2013

If you are using DuckDuckGo, you can use the !hn bang to send your query to hnsearch.com

eliben · on June 28, 2013

This is trivial to do in any modern browser without DDG.

Roedou · on June 27, 2013

I found this old thread, where pg had blocked most of the Google bots, and it caused Google to think the site was down:

https://news.ycombinator.com/item?id=3277661

Could be a similar issue? I'll take a look.

Roedou · on June 27, 2013

pg also commented that he doesn't want traffic from Google anyway: https://news.ycombinator.com/item?id=5808990

In which case, he should add: <meta name="googlebot" content="noindex"> to the html head of every page.

(I have to say, that's a smart way of avoiding any Eternal Septembering, but it'd be a shame. I often use Google to find old HN threads that I vaguely remember from months or years ago.)

dnautics · on June 27, 2013

you may want to consider using hnsearch.

saurik · on June 27, 2013

(Google's search is often better for this purpose as it has features like synonym/typo fixing and as it indexes entire pages lets you match keywords across entire threads: hnsearch is myopic on individual posts.)

snogglethorpe · on June 28, 2013

Yup...

Even sites with good search functions are often still way outclassed by google search with "site:..."—and most sites don't have good search functions...

glitch273 · on June 27, 2013

Matt Cutts browses this site. Maybe he knows the reason why?

jffry · on June 27, 2013

Indeed, he showed up: https://news.ycombinator.com/item?id=5955374

mattparlane · on June 28, 2013

The .org site hasn't been delisted, so it's obviously not based on content:

https://www.google.co.nz/search?q=site:news.ycombinator.org

meritt · on June 27, 2013

This is most likely the same reason digg's frontpage was deindexed. There's no "content" per se, it's just links. Someone will notice, add an exception, and all is well.

Unlike Digg, HN has a substantial amount of content in the comments pages though, which are heavily indexed.

Edit - All the comment pages are still indexed just fine. It's /only/ the front-page. Which, imo, doesn't really matter anyway.

Roedou · on June 27, 2013

That wasn't the reason for Digg's issue at all: Google had tried to manually deindex some pages from the site, but made a mistake and pulled the whole domain. They reincluded it shortly after.

pstuart · on June 27, 2013

The comments are the real content of this site.

aidscholar · on June 27, 2013

Sounds like overaggressive spam detection.

joepawl · on June 27, 2013

This sounds like the case. Google is getting aggressive with its Panda updates, and as a previous commenter noted, the HN homepage is just links. Since that triggers Panda, it's a good bet that Google went a little overboard (not unprecedented).

mahranch · on June 27, 2013

> the HN homepage is just links. Since that triggers Panda

To be more specific, Panda is triggered by low quality/duplicate content. 'Penguin' is triggered by spammy/bad backlinks.

I'm not saying you're wrong (a page of links would look pretty low quality to google's algo), I just wanted to add on for clarity's sake.

joepawl · on June 28, 2013

Yes, I see where I was unclear. It's not the links themselves, but the lack of original, robust content.

Roedou · on June 27, 2013

The Panda update definitely pushed lazy/low-quality pages down the SERPs, but doesn't tend to deindex pages.

Also: while the page doesn't have any of its own unique content, it presumably still has high engagement and low bounce rate.

joepawl · on June 28, 2013

That's a good point. Hadn't thought the issue through that far.

eli · on June 27, 2013

Please don't post on HN to ask or tell us something (e.g. to ask us questions about Y Combinator, or to ask or complain about moderation). If you want to say something to us, please send it to info@ycombinator.com.

http://ycombinator.com/newsguidelines.html

malandrew · on June 28, 2013

I too had noticed this. It's unfortunate because searching via Google with site:news.ycombinator.com in the query is much better than HN's own search when you have a good idea what you're looking for (spearfishing search vs BFS)

chacham15 · on June 27, 2013

This isnt the first time this has happened and I suspect that it wont be the last.

gscott · on June 28, 2013

The pagerank has fallen from a 6 to a 3 as well.

godgod · on June 27, 2013

Google is evil. Screw them. I refuse to use Google or their services. Make the switch. They deindex a lot of sites they don't agree with. Not saying that is the case here but they've been known to do it.

quantumpotato_ · on June 27, 2013

Link backing up your claims?

mindstab · on June 27, 2013

Switch to what?