Man, people are really eager to espouse malice to Microsoft. The author's posit is probably not the case. Let's go over 2 things the external viewer should understand about Bing:
1. It represents a major technology refresh. As such, there are bound to be rough spots as the search engineers fine-tune their new index and ranking. Same goes for the sub-sections of the site (like the reference search, video search, image search), most of them have not rolled out all their features just yet, and are still improving results.
2. One area where Google retains an advantage is with very long query strings. Some sub-components of Bing (e.g., reference) can handle long query strings with aplomb, but in general keyword search engines suck with long queries. Google has been working on this steadily for years, and Bing still has a ways to go before matching Google's performance there.
But then again, most few engines answer this query the way Google does, so I wonder if maybe Google has some tuning for this specific kind of query in their ranker.
Only Ask.com actually pulls back that page. And Ask.com's early business model was devoted to english-ish question queries.
People need to understand that search engine ranking is a problem harder than they understand. Unless you're working on it right now, you probably have no clue the huge amount of subtle thought, heuristics, and outright fudge factors that go into making a good, modern ranking function. Things which were reasonable 2 years ago are now insufficient. This is not a solved problem, and no one really shares algorithms to improve the performance of all their competitors.
The basics of search engine ranking are so well understood that if your engine does not pull up results comparable or better than your competition that you have a serious algorithmic problem. The results, after all are a given. The method you use to arrive at them creates some variation but does not magically wipe out certain results completely and replaces them with spam pages. (unless of course your algorithm is seriously broken).
I dare to say the above because I built a proof-of-concept engine about 3 years ago, it took several months and was eventually abandoned because I foresaw that the amount of funding needed to do all the crawling (bandwidth costs) and to scale up the design to hold billions of pages in stead of 100's of millions of pages (pretty good for an amateur effort) was beyond my ability to raise.
The quality of the results had little to do with it, the differences when comparing with google were mostly in how the results were ranked, with google outperforming my little toy considerably. But if a page was present and relevant I invariably found that it was somewhere near the top, but usually not in the 'right' order.
To miss out on relevant content that had been 'seen' would have taken a serious effort.
Keep in mind that ranking only is a factor AFTER you have the relevant results for a query, it is a sorting issue. That does not mean that google or any other engine actually does the sorting when it displays your search results page, but effectively that is what is happening, it is just that for efficiency reasons that the process behind the scenes is setup completely different. Stuff does not magically disappear because of ranking (unless it gets pushed beyond page 100 in googles example, but that means there are more than 1000 links 'relevant' for a given query). You'd have to really mess up to get two engines rank the same page for the same result near the top 1,000 slots lower in the alternative.
I hope this is all clear, these are difficult concepts, the best way to learn about this stuff is to implement a toy search engine/crawler combo yourself. 80legs.com makes it a lot easier nowadays than it was in the past.
Hi. I'm an employee of Microsoft, which is why I have to be so vague about them. I was an employee of Powerset.com, which implemented a natural language search engine of non-trivial size. We were recently purchased by Microsoft. I've worked closely on Powerset's search engine infrastructure and with their linguistic packages for just about 2 years now, and I've started to understand some of what Bing is doing in the last 6 months.
The basics of search engine ranking are well understood, but compare those basics to the results that Google and Bing pull up and you start to see some serious discrepancies from naive rankers. This is because query type strongly influences ranking decisions, and there is implicit knowledge that STRONGLY affects ranking. A trivial example is the geoip data of a querier, which can be used to aid ranking for queries like "presidential scandals". But trending news headlines might also be an input to your ranking algorithm. A very good ranker like what Google and Bing employ is a complex beast with special inputs heuristics, and secret tricks which make them as good as they are.
I confess, I am predisposed to be very irritated at your post. You made a toy engine (and no, holding 100's of millions of pages in a keyword index really isn't a huge deal these days), a toy ranker, and now you're an expert on the state of the art? But before I can get too angry I have to admit that when I left mog.com to go to powerset.com, totally ignorant about all but the basics of search, I too had a similar opinion. I figured we'd just use some variant of LSA for relevance and be done with it. Boy, was I wrong. So I can't get too angry at you about it.
Ok, so I'll take your word for it then, you are obviously the expert in the field.
I did not mean to step on your toes, I hope that I'm as well informed as you can be about the subject as an 'outsider', if there is anything that you can reveal about the real reasons for these discrepancies then consider me all ears.
I'm not above wanting to learn about this stuff (that's why I'm on HN in the first place), my perspective to date (based on my own effort and whatever I could read up on that is publicly accessible) was that the results are not the hard thing, the spam is where the real problems lie.
Fudge factors does not sound encouraging by the way, possibly you are proving the original posters point here in some unintended way ;)
EDIT: it would be nice if you could state clearly that you are not aware of any direct effort on the part of microsoft that influences the search results in a way that either promotes microsoft and their products and/or changes the results when they are critical of microsoft, including the blacklisting of critical pages. I think that would go a very long way to laying these rumours to rest. It's microsofts trust image that comes to the surface here, and it seems that that is not very high. Chinese walls between search and the rest of the company would have been the way to go here.
I am not aware of any direct effort on the part of Microsoft to influence the search results in a way that deliberately attempts to obscure negative press about Microsoft's products. I would not be surprised if some security-related things were in fact concealed, but I can see how that might be considered unreasonable in some circles. In any event, I'm certainly not aware of any generic policy in this regard outside of age-related filtering.
Now, I am a low-man on the totem pole working in a satellite office. So I wouldn't necessarily know. If I did know of such a scenario, it'd be a firing-level violation of my NDA to talk about it here, but it'd also be a violation of the ethics guidelines to lie about it publicly, so I probably wouldn't be posting here at all about this subject if I knew anything like that.
If I did discover Microsoft doing this, I would probably resign. But I don't believe they're doing it. Microsoft is very serious about this Bing project. I've met and talked with a lot of the people who manage the product, and they're serious, talented people who clearly understand (and have directly said to us) that a search engine is about results and people trusting those results. It would be incredibly risky to deliberately filter things in Bing and risk discovery at the formative stage of Bing-as-a-brand's reputation.
PS. "Fudge factors" are just some things like saying, "Wikipedia and c2.com are awesome, give them a nice boost." It's pretty clear that Google loves Wikipedia even more than what we know about its ranking algorithm's major features would suggest. Once upon a time all wikis scored very highly in google because of the way page rank and link text worked, but they've since reduced that effect, it seems. Wikipedia's ranking never really went down.
PPS. I really can't talk about specific features of Bing or Powerset's ranking algorithm. One reason for this is my NDA. The other reason is that they're not my primary domain of expertise, so I'd feel uncomfortable lecturing about them instead of their actual architects, who are often unsung heros of a search team.
Thanks, that is really appreciated, nice to see you being such an upstanding guy!
It's funny how absolutely crucial search is and how we depend on it but how little we actually know about what goes on inside. I think that is part of what drives these wild goose chases based on limited querying, if there were more transparency then this would not take hold. At the same time the spammers would waste no time or effort trying to exploit such knowledge so out of necessity it needs to be under wraps.
Or maybe a 'many eyeballs' approach here would help too.
The writer is a bit unfair on the "is microsoft evil" one - because those are news results. So assuming Bing gives equal share to the keywords (which is fair enough, it has no clue MS is the focus of your question) then the news results will, surely, depend a lot on when the story was published.
In which case the Google story (with 2 keywords in the title!) seems a fair one ot come top :)]
The thing that really surprises me about all this is the surprise. Bing was not set up with Chinese walls in place (they way it should have been done) and has been produced by a company that sees every communication with others as a marketing effort (this may be a good thing, I'm not sure).
They'll do everything they can to portray themselves in a good light.
For instance when you search on bing for microsoft vs stacker the wikipedia page that specifically adresses the lawsuit is not even in the top 10 search results, whereas it is arguably the best page on the subject on the web. Coincidence ? Possible. Probable ? I don't think so.
Bing has its uses, but to get objective information about its owners you'd have to look elsewhere.
The interesting thing with all this is that people are now so conditioned to find stuff using search engines that if a search engine does not list a page it might as well not exist.
"Bing" "microsoft vs stacker wikipedia". The first result is a wikipedia page on disk partitioning.
Google the same, you get as the first three results wikipedia pages on Stac Electronics and MS-DOS, and as the third result this HN page.
As you can see, Google did what you wanted. However, the litigation was between Stac Electronics and Microsoft, so now try "microsoft vs stac wikipedia".
Bing: returns wikipedia category "microsoft criticisms" and the third result is "microsoft litigation".
Google: returns the wikipedia pages "stac electronics" and "microsoft litigation".
The conclusions I draw from this is that Bing and Google just have very different algorithms, relatively speaking. Wikipedia appears to have a much lower ranking in Bing than Google, at least in cases where there isn't a page with a very similar name. Here are two searches to compare:
http://www.bing.com/search?q=microsoft+vs+stachttp://www.google.com/#q=microsoft+vs+stac
While I understand why people bash MS, as well as Google (I agree with most of it), I think sometimes its best to take an unbiased look. I cannot see any attempt by MS to remove useful results; the first one is the original text of the lawsuit!
I think he's moaning because he's in the SEO business. Basically, his entire working knowledge is shattered by a viable Google alternative. Now he's got to understand two ways of working (PageRank vs whatever Bing uses). I think that has caused some anti-Bing bias.
Try searching for "visual studio crap" and you'll see how unbiased it is.
1. It represents a major technology refresh. As such, there are bound to be rough spots as the search engineers fine-tune their new index and ranking. Same goes for the sub-sections of the site (like the reference search, video search, image search), most of them have not rolled out all their features just yet, and are still improving results.
2. One area where Google retains an advantage is with very long query strings. Some sub-components of Bing (e.g., reference) can handle long query strings with aplomb, but in general keyword search engines suck with long queries. Google has been working on this steadily for years, and Bing still has a ways to go before matching Google's performance there.
But then again, most few engines answer this query the way Google does, so I wonder if maybe Google has some tuning for this specific kind of query in their ranker.
Check out the yahoo results: http://idisk.me.com/dfayram/Public/Pictures/Skitch/yahoo-res...
Check out the Altavista (yeah, they're still up!) results: http://idisk.me.com/dfayram/Public/Pictures/Skitch/altavista...
Check out the Ask.com results: http://idisk.me.com/dfayram/Public/Pictures/Skitch/ask-20090...
Only Ask.com actually pulls back that page. And Ask.com's early business model was devoted to english-ish question queries.
People need to understand that search engine ranking is a problem harder than they understand. Unless you're working on it right now, you probably have no clue the huge amount of subtle thought, heuristics, and outright fudge factors that go into making a good, modern ranking function. Things which were reasonable 2 years ago are now insufficient. This is not a solved problem, and no one really shares algorithms to improve the performance of all their competitors.