Hacker News new | past | comments | ask | show | jobs | submit login
Poll: Should this site be visible to Yahoo, MSN crawlers?
40 points by pg on April 16, 2008 | hide | past | favorite | 94 comments
A large fraction of our http requests come from Yahoo crawlers-- almost 20% yesterday. Their crawler seems significantly stupider than Google's. Yesterday we got 12,423 requests from the Google crawler, of which 4148 were for x (= mostly useless) urls, and 43,087 requests from Yahoo crawlers, of which 30,652 were for x urls.

Unlike most sites, I'm looking for ways to constrain our growth. News.YC is deliberately not intended to become a massively popular site. So the thought occurred to me: why not just ban Yahoo crawlers? And MSN too, while we're at it. I don't know anyone who uses either of them for search. I'd just as soon have the site be invisible to them. But what does the community think? Do any hackers use Yahoo or MSN search?

Be invisible to Yahoo, MSN search.
311 points
Remain visible to Yahoo, MSN search.
92 points



My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google.

As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.


Okay, that's one reason for not indexing the site. However, the poll question implies blocking it for traffic reasons. I'd vote for robots.txt'ing out the /x URLs, then re-running the vote with the question of "should we deliberately ban search engines so that we get less growth in visitors."


If we're sacrificing equal competitive access just to "constrain growth as much as possible", maybe we could block certain browsers, too?

IE users aren't smart enough to change their defaults; Safari users are mindless Apple fanboys; Opera users are weird and talk with funny accents. Smart hackers will take the trouble to launch FF to access News.YC.

Anything to preserve the quality of content here!


If you want to really constrain it, require a browser specific to YC... perhaps written in Arc. Also: secret handshake ;)

(I can think of a justification for the first part, though--the time it takes to launch a separate browser that one cannot use for any other sites is the perhaps near the time it takes to cool down and forget any external emotional influences on your reading.)


No, FF users are overzealous advocates of FOSS in general and Firefox in particular. Links, Lynx, and w3m users are too proud of their terminals. Netscape users don't realize it's the 21st century, AOL users are idiots. Flock and Seamonkey users are weird. No comment on Konqueror and Epiphany.


Not evil at all. Letting someone crawl your site is a business decision. If you think the resource cost on your servers is worth the added traffic that the search engine can send, then go for it. For this site, I would guess the resource cost is more (because of storing continuations) and the benefit of traffic is less (since pg doesn't even want it). Therefore, it's a good decision on pg's part and Yahoo's loss.


It is evil. A lot of the evil things out there are business decision (e.g. Microsoft monopoly practices). Even if this decision has a positive net effect on the community, as a result the world is likely to be worse off (all non-google searchers get suboptimal search, on this site and elsewhere).


It'd be interesting if source browsing and javascript hacking were the only ways to affect the site. Then only hackers would be able to contribute.


I voted to remain visible.

I assume that resource is not a problem for YC. This fact, coupled with YC not having a search feature, leads me to believe that it's better for YC community that this site is indexed by more than just one search engine.

One big reason, as pointed out by Tichy, is that not all of us use Google as our primary search engine. By cutting off Yahoo, MSN, etc, but not Google, you would essentially be forcing us to use it to search the content of this site. Do that and you could be going down a slippery slope.

And I agree with the "ban all or ban none" mentality.


What's the slippery slope?


It's a form of censorship, albeit a subtle one with possibly good reasons.

That's reason I voted for "Remain visible to Yahoo, MSN search." Had option two been "Be invisible to all search engines.", then I would've voted for that one. But then YC would need to offer its own search.


At least two dimensions:

- tying the site to specific products. pg may decide firefox browser is the best predictor for being a hacker and ban all other browsers. Would you like that?

- introducing bias to the community. the search engine may be a correct predictor for being a hacker, but what about the false negtives: do we want to exclude the hackers working on MSN search engine from this community, for example? I certainly don't.


That would be a problem if the decision is technology centered. But, it isn't. Technology is only incidental to making HN membership more challenging, and making a certain technology the de facto standard is anti-thetical to challenge.

See why Fisher thinks chess is dead.


pg wants to make Google search technology de facto standard for news.yc.


Support?


Is there a reason for any "/x?" URLs to be indexed?

If not, a simple robots.txt would fix the crawler-traffic problem without prejudicially banning any upstart search engines:

  User-Agent: *
  Disallow: /x?


This poll is mixing the two issues of allowing Yahoo/MSN search, and constraining News.YC growth.

If the main aim is to constrain growth, wouldn't it make more sense to block popular Google, and allow a less-used engine for utility search?

I got the impression a few weeks ago (from this comment -- http://news.ycombinator.com/item?id=146623) that you were OK with inlink-driven growth; has something changed?


I can understand not caring how many people read the site, so long as the hacker community is well served. I guess I'm just not quite clear on the downside of growth outside of this community; more useless comments? Comment spam? I also wonder if blocking those Yahoo and MSN crawlers would make it harder for them to compete with Google, because I consider competition a good thing. Eventually, by allowing them access, they may continue to improve their search product.

Perhaps most importantly, I worry that blocking these crawlers might be seen as a political statement against Yahoo and MSN, one which they might take into account when considering the acquisition of a ycombinator funded startup.

-- Fred


Why does it matter how many useless urls they visit? If you're trying to fix the "expired link" issue -- there are better ways to fix that.

I think you should ban google too if you ban the others. The only disadvantage I can see is that people from here who want to find something they've seen on news.yc can't use google anymore.


> [...] can't use google anymore.

I think pg should just make a search link on top of the site pointing to

http://www.searchyc.com/

or

http://nycs.bigheadlabs.com/


Blocking a crawler is a one liner that could be done in a couple seconds. Tackling the limitations of a continuation based architecture is a bigger problem.


I use Yahoo search occasionally and have often been impressed at it finding exact material, with a precise search, that Google couldn't.

Separate from the local implications on News.YC, society benefits by having multiple competing search engines. Seceding from all search indexes but Google contributes to a monopoly/monoculture.

Can you just warn off the stupid crawlers from valueless URLs with some combination of robots.txt, META robots directives, and 'nofollow'? (I'm happy to help craft these signposts, with a description of the URLs to be spared.)


Can't you robots.txt'd the "x" URLs?


lets block all search traffic and then take this even further -- lets have all incoming links get forwarded to a dummy hacker news with nothing but lame stuff so that anyone trying to access the site from a blog or something with an HTTP_REFERER will get redirected to the lame site. that'll definitely keep out the unwashed masses.

:)


Even better: replace HN with a static snapshot of it as it exists right now! Then we know it will always be this good! :-D


i was thinking that it would be funny if it was a data scrape of digg with a HN skin applied.

although that would probably turn out badly


why not block all search engines?


Because News.YC doesn't have it's own search engine. So I use google to search the site.


Why not solve the real problem and set up a search engine for News.YC? This has other advantages such as controlling what shows up in the results, different ways to sort them, etc.

It's easy to do with open-source software such as Solr (lucene.apache.org/solr) or Hounder (hounder.org, shameless plug because it is our product), which powers the wordpress.com site.

Edit: it looks like someone already did this: searchyc.com.


You must be new here.

The day PG & Co adopt some piece of software that is made in Java... well, I can't even finish the sentence.


Part of the & Co is Paul Buchheit, who wrote some of the news.yc javascript. Paul Buchheit's company friendfeed is using lucene for search.


i didn't know that.

Either way, I think that PG is the king of guy who would rather work on his very special design (staying in Arc-land as long as possible) than adopting systems where you no longer have control.

Letting Paul work on the JavaScript makes sense, in this case. PG has no control over client-side code, so he prefers to not bother with it.


Google uses Java, and since that is evidentally the only acceptable search engine, I'm not sure pg's posited elitism really goes that far.


I didn't say that PG would refuse to use any system that was developed in Java. What I meant that he would never would put himself in a situation where the application would depend on Java for development, or to add more features.

To integrate with Google, you talk HTTP, not Java. To adopt Lucene, you'll have to talk Java. There is quite some difference, isn't it?


You can push everything you want indexed into SOLR by HTTP -- so even though it's Java/Lucene you don't have to get your hands dirty.


They use some Java, but a whole lot more Python and C++...


Usualy Google gets better results that a custom search engine for your site, because Google also knows the links form other web pages to your site.


Did you not find searchyc.com useful?


Googling for my name lists my Hacker News profile as one of the top results. I think that is really cool.


That is cool. My username returns my profile here as the number one hit. I had no idea it would score so highly.


Not as cool as mine. My username returns a wikipedia page about some asteroid. Cool stuff.


googling my username brings up my hacker profile too



> There's no other search.

http://www.searchyc.com

(edit: I'm not the one who voted you down)



They stopped indexing looong ago.


thanks, I didn't know that! Perhaps they should say so on the page...


Done :)


Hey can you publish the source code of the search tool?


wow, that was quick:-)


Does that actually work for people? I've tried using Google to search News.YC for an older topic on a couple of occasions, and the results have been very hit-or-miss. Is it just me?


(plug)

If you remember some key words from the title, you can do a quick find in http://www.kirubakaran.com/phr0zen (snapshots of front page). It is not really search search but I find it useful all the time.


right on. then implement a search feature


I would set NOFOLLOW but not bother with NOINDEX, assuming those crawlers handle the same meta tags google does. That way news.yc pages people link to from other sites would be indexed, but x urls would be left alone.

I do not currently use Yahoo or MSN search, but would rather not have the option of using them cut off.


How many people actually come from Yahoo and MSN searches? I would guess that the publicity news.yc would receive from this action on the tech blogs would send more people here than simply allowing those spiders to continue indexing. Otherwise, I'm all for blocking them.


Allow all or ban all. let's make fairness a rule on the internet, it gives a chance to all startup to emerge by giants side. If Yahoo was not fair to Google no one knows what would have happened.


That statement is roughly equivalent to "imprison no one or imprison everyone".


Isn't that fair?


That is also hopelessly naïve.


You don't have to be insulting to make a point. I understand you may be using Google and like it, you must think about people using Yahoo as well. It is selfish of yours to base your judgment on your preference. If I'm missing something let me know.


I sincerely apologize. I deserved that. My grandfather passed away today.


Sorry for your loss.


Are you planning to ban Yahoo/MSN because the fetches for useless urls put extra load on the news.YC servers?

I use yahoo search to lookup News.YC. If possible, please do not force us to switch to GoogSearch.


I use Yahoo at home, from the Firefox toolbar. I'm growing increasingly suspicious of Google (call me crazy) and for many things it's about as good.

That said, I would have no problem with banning Yahoo from HN. I've never used Yahoo to search HN, and Yahoo's "site:foo" support is one of their relative weaknesses, so I doubt I ever will. (It "redirects to Site Explorer", which looks basically like normal search but uglier and with less information.)

You could even redirect them to The Site Which Shall Not Be Mentioned.


for someone (like me) using Yahoo's web search api (which is far better than Google's imho), it would be ideal if you left Yahoo's crawler to do it's job here.


Why not just robots.txt out all search engines? Of course, if you did that, it'd be pretty nice to have a "Search" button up there in the orange bar.


How do you robot.txt for search engines? If anything, you could ban all robots, I suppose. I wouldn't like that - it is akin to those social networks where you can't get your data out.


I found reddit in mid-2006 when I was searching for widgets (or whatever they were called) for "My Yahoo" page. Without that I probably would have never foundit. So, I am very thankful that it was listed in the top 10 or 20 of their widgets (which is different from being in their search results, I know).


Either ban them all (Google included) or ban none. But if you must, just limit what they can and cannot crawl.


I tend to use Yahoo search instead of Google, but I guess I'll just use searchYC or something to search YC.


I think it is a horrible idea.

Blocking search engines is evil. This site helps the search engines(allowing them to rank other results better) and the search engines helps people (by providing the missing search functionality and allowing users to find useful content, on this site or elsewhere).

Blocking traffic from MSN and Yahoo might bring less non-hackers to the site, but it would also introduce bias in the community that is not good. I don't want Microsoft hackers (presumably using MSN search) to be excluded from this community.

I believe nowadays Google is only marginally better than Yahoo or Msn search. I don't need more hacker fashions.

The last thing is, the problem with the crawlers and traffic load is easily solvable using robots.txt. Don't mess with the system more than you need.


I'm feeling kinda dumb here, but what are x urls?


They're short-lived URLs on this site like the one I'm using to add this reply:

http://news.ycombinator.com/x?fnid=D5W545W084


I use both Yahoo and Google.

Why can't you just use <a rel="nofollow"...> or a robots file to stop them from indexing x-links?


Is there a way to block only the useless calls? The x that fail? Or to robots.txt anything that isn't main story pages so it wouldn't follow every individual comment link?

It might be good to still be indexed on the primary single page for each discussion... Then again I only use google, so it doesn't bother me any.


I don't use Yahoo or MSN search.

I forget how I found out about this site, I believe it was via a mention on another topic-oriented discussion site.

If the goal is to limit growth, I would recommend that News.YC NOT be easy to find, as well as continue to NOT be SEO friendly (ie: nofollows on links, etc.).


I agree to the site being invisible to yahoo, msn, ask, etc.

At the same time, please let the newer/upcoming search engines crawl the site, else, google would become IE circa 2000-2004 and there would be very little innovation in search.

my 2 cents.


This patchwork deny/allow is nearly as bad as banning everyone but Google. Every small site will make the decision of who's young enough to deserve a chance differently -- making only Google the safe choice for searching.

A principle of equal access by any well-behaved search engine is the only policy which supports startups and diversity. The market already has a winner-take-all quality; why make it even worse for competition?


I don't mean it to be a patchwork system, I should clarify, a simple deny Yahoo, MSN works fine with me since Paul is making a business decision. What is not fine with me is allowing only google, and denying everyone else.

Like you, I support startups and want diversity and innovation, which is why I would like search startups to crawl this website.

Hope that clears it up.


I use Yahoo search. No problem if it would be blocked though.


MSN's crawler is a spammer: they "check" using referral spam using the livebot IP addresses. They admitted to it, they said they stopped it, but they haven't.

Please ban them.


I use Yahoo search quite a bit (primarily, actually). Their context-guessing suggestions are really cool.

You didn't ask, but if it comes up I use Ask.com to search blogs.


I wonder if one could improve the efficiency of the crawlers by dynamically updating the robots.txt based on votes and time.


I never used MSN search. (For some reason I don't think I ever will.)


I'm all in favor of constraining growth.

And, I'd say that if there was a way to ban Google search, but have a decent way to search HN, then I'd say ban Google, too. But at very least I think that Yahoo and MSN search could be banned.


anyone who deserves this community will have enough in the way of intuition/connections to find you.


maybe one way of constraining would be to limit access say by invitation only or some other method.


bottom line no.


google knows how to crawl, slurp 3 is out just today so maybe yahoo will improve but even if you give it to them on a plate (sitemap.xml) they still seem to screw it up. The traffic is not worth their bandwidth imo.


Block 'em.


Hacker news should remain invisible to the crawlers.


Rather than looking for ways to constrain growth, why not use this as an opportunity to invent a better social news site?


I think constraining growth IS the way he's trying to invent a better news site.


It's one way. Another way is to use a different algorithm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: