Hacker News new | past | comments | ask | show | jobs | submit login

The basics of search engine ranking are so well understood that if your engine does not pull up results comparable or better than your competition that you have a serious algorithmic problem. The results, after all are a given. The method you use to arrive at them creates some variation but does not magically wipe out certain results completely and replaces them with spam pages. (unless of course your algorithm is seriously broken).

I dare to say the above because I built a proof-of-concept engine about 3 years ago, it took several months and was eventually abandoned because I foresaw that the amount of funding needed to do all the crawling (bandwidth costs) and to scale up the design to hold billions of pages in stead of 100's of millions of pages (pretty good for an amateur effort) was beyond my ability to raise.

The quality of the results had little to do with it, the differences when comparing with google were mostly in how the results were ranked, with google outperforming my little toy considerably. But if a page was present and relevant I invariably found that it was somewhere near the top, but usually not in the 'right' order.

To miss out on relevant content that had been 'seen' would have taken a serious effort.

Keep in mind that ranking only is a factor AFTER you have the relevant results for a query, it is a sorting issue. That does not mean that google or any other engine actually does the sorting when it displays your search results page, but effectively that is what is happening, it is just that for efficiency reasons that the process behind the scenes is setup completely different. Stuff does not magically disappear because of ranking (unless it gets pushed beyond page 100 in googles example, but that means there are more than 1000 links 'relevant' for a given query). You'd have to really mess up to get two engines rank the same page for the same result near the top 1,000 slots lower in the alternative.

I hope this is all clear, these are difficult concepts, the best way to learn about this stuff is to implement a toy search engine/crawler combo yourself. 80legs.com makes it a lot easier nowadays than it was in the past.




Hi. I'm an employee of Microsoft, which is why I have to be so vague about them. I was an employee of Powerset.com, which implemented a natural language search engine of non-trivial size. We were recently purchased by Microsoft. I've worked closely on Powerset's search engine infrastructure and with their linguistic packages for just about 2 years now, and I've started to understand some of what Bing is doing in the last 6 months.

The basics of search engine ranking are well understood, but compare those basics to the results that Google and Bing pull up and you start to see some serious discrepancies from naive rankers. This is because query type strongly influences ranking decisions, and there is implicit knowledge that STRONGLY affects ranking. A trivial example is the geoip data of a querier, which can be used to aid ranking for queries like "presidential scandals". But trending news headlines might also be an input to your ranking algorithm. A very good ranker like what Google and Bing employ is a complex beast with special inputs heuristics, and secret tricks which make them as good as they are.

I confess, I am predisposed to be very irritated at your post. You made a toy engine (and no, holding 100's of millions of pages in a keyword index really isn't a huge deal these days), a toy ranker, and now you're an expert on the state of the art? But before I can get too angry I have to admit that when I left mog.com to go to powerset.com, totally ignorant about all but the basics of search, I too had a similar opinion. I figured we'd just use some variant of LSA for relevance and be done with it. Boy, was I wrong. So I can't get too angry at you about it.


> So I can't get too angry at you about it.

Thank you for that :)

Ok, so I'll take your word for it then, you are obviously the expert in the field.

I did not mean to step on your toes, I hope that I'm as well informed as you can be about the subject as an 'outsider', if there is anything that you can reveal about the real reasons for these discrepancies then consider me all ears.

I'm not above wanting to learn about this stuff (that's why I'm on HN in the first place), my perspective to date (based on my own effort and whatever I could read up on that is publicly accessible) was that the results are not the hard thing, the spam is where the real problems lie.

Fudge factors does not sound encouraging by the way, possibly you are proving the original posters point here in some unintended way ;)

EDIT: it would be nice if you could state clearly that you are not aware of any direct effort on the part of microsoft that influences the search results in a way that either promotes microsoft and their products and/or changes the results when they are critical of microsoft, including the blacklisting of critical pages. I think that would go a very long way to laying these rumours to rest. It's microsofts trust image that comes to the surface here, and it seems that that is not very high. Chinese walls between search and the rest of the company would have been the way to go here.


Sure.

I am not aware of any direct effort on the part of Microsoft to influence the search results in a way that deliberately attempts to obscure negative press about Microsoft's products. I would not be surprised if some security-related things were in fact concealed, but I can see how that might be considered unreasonable in some circles. In any event, I'm certainly not aware of any generic policy in this regard outside of age-related filtering.

Now, I am a low-man on the totem pole working in a satellite office. So I wouldn't necessarily know. If I did know of such a scenario, it'd be a firing-level violation of my NDA to talk about it here, but it'd also be a violation of the ethics guidelines to lie about it publicly, so I probably wouldn't be posting here at all about this subject if I knew anything like that.

If I did discover Microsoft doing this, I would probably resign. But I don't believe they're doing it. Microsoft is very serious about this Bing project. I've met and talked with a lot of the people who manage the product, and they're serious, talented people who clearly understand (and have directly said to us) that a search engine is about results and people trusting those results. It would be incredibly risky to deliberately filter things in Bing and risk discovery at the formative stage of Bing-as-a-brand's reputation.

PS. "Fudge factors" are just some things like saying, "Wikipedia and c2.com are awesome, give them a nice boost." It's pretty clear that Google loves Wikipedia even more than what we know about its ranking algorithm's major features would suggest. Once upon a time all wikis scored very highly in google because of the way page rank and link text worked, but they've since reduced that effect, it seems. Wikipedia's ranking never really went down.

PPS. I really can't talk about specific features of Bing or Powerset's ranking algorithm. One reason for this is my NDA. The other reason is that they're not my primary domain of expertise, so I'd feel uncomfortable lecturing about them instead of their actual architects, who are often unsung heros of a search team.


Thanks, that is really appreciated, nice to see you being such an upstanding guy!

It's funny how absolutely crucial search is and how we depend on it but how little we actually know about what goes on inside. I think that is part of what drives these wild goose chases based on limited querying, if there were more transparency then this would not take hold. At the same time the spammers would waste no time or effort trying to exploit such knowledge so out of necessity it needs to be under wraps.

Or maybe a 'many eyeballs' approach here would help too.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: