It's not that PG has a grudge against Google (or vice versa) or anything like that. I believe that search engine bots crawl Hacker News hard enough that PG blocks most crawling by bots. In the case of Google, he does allow us to crawl from some IP addresses, but it's true that Google isn't able to crawl/index every page on Hacker News.
And to show this isn't a Google-specific issue, note that Bing's #1 result for the search [hacker news] is a completely different site, thehackernews.com: http://www.bing.com/search?q=hacker+news
In general, I think PG's priority is to have a useful, interesting site for hackers. That takes precedence and is the reason why I believe PG blocks most bots: so that crawling doesn't overload the site.
Thanks for that Matt; I didn't see that recent post or your comment, so sorry for dragging you back here to repeat yourself.
Looks like I'm going to have to stop relying on searching 'hn' when using a different computer, and start typing in the full URL. First world problems are such a burden.
No worries at all. I don't think the HN thread from three weeks ago made it to the front page (I happened to see it while browsing on /newest). I figured someone would notice and ask about this, so I'm happy to have the chance to explain.
I'm sorry to reach out to you directly on a public forum like this, but my company's website encountered a major negative SEO attack last month and we were hit with a manual penalty by Google today. I thought you might be interested to hear about what happened, and I of course I would like to resolve it as I do my best to always keep my company's SEO efforts within Google's guidelines. Please reach out via email to me at mbrody@myclean.com if we can help each other fix this! Thanks again for everything you do to help make the web a better place, and in advance I understand if you're too busy to respond.
Don't apologize to just Matt, you're pseudo-apology and better-sent-as-an-email question pissed me off. Why would you take up three extra lines with a BS platitude and a signature? Please keep a personal request for assistance to better-suited channels.
Doesn't Googlebot respect Crawl-Delay in robots.txt? PG has set it to be 30 secs - https://news.ycombinator.com/robots.txt - which IMHO should not cause any load issues given the overall traffic profile of HN.
As I understand it, the best way to lower the crawl rate is to log into Google Webmaster Tools and manually lower your crawl rate. The crawl rate delays expire every 90 days, so I set a calendar reminder to renew the crawl delay every 3 months.
https://support.google.com/webmasters/answer/48620?hl=en
Here's a link where I answered the same question about three weeks ago: https://news.ycombinator.com/item?id=5837004 , so this isn't a new issue. In fact, PG has been blocking various bots since 2011 or so; https://news.ycombinator.com/item?id=3277661 is one of the original discussions about this.
And to show this isn't a Google-specific issue, note that Bing's #1 result for the search [hacker news] is a completely different site, thehackernews.com: http://www.bing.com/search?q=hacker+news
In general, I think PG's priority is to have a useful, interesting site for hackers. That takes precedence and is the reason why I believe PG blocks most bots: so that crawling doesn't overload the site.