Hacker News new | past | comments | ask | show | jobs | submit login
Google already knows its search sucks (and is working to fix it) (venturebeat.com)
88 points by shawndumas on Jan 13, 2011 | hide | past | favorite | 68 comments



I enjoyed that, but I'm not sure about a couple of the claims.

I switched from AltaVista to Google because it gave me better results almost all of the time - if that weren't the case I, and however many millions of others, wouldn't have switched. The 'expensive data centre theory' may have sped-up the demise of AV et al, but I don't think it's fair to say Google succeeded by having low costs rather than a superior product.

I'd also like to see data to back up the claim that "The vast majority of users are no longer clicking through pages of Google results". Again, not my experience, but I recognise that I am a datapoint of one. I do note that increasingly Google's own answers (especially maps and images) are providing me with the direction I need in response to a search query, but even then I usually click through.

Edit: Re-read that. At first I thought it meant clicking through to the pages that are returned as results; it may mean not clicking through the pages of Google results (1-infinity below). Still, I thought <5% of people ever clicked through to the second page (most people refined their search if the first results weren't what they wanted) so I'm not sure if anything has changed.

I think Google search is damaged, not yet a product failure but not yet "no longer" a problem.


I was at Inktomi in the late 90s/early 2000s. We were trying to keep up with Google in terms of scale, and we could do everything they could do in terms of link analysis, relevance, etc. Objectively, our results were as good as theirs. They took over in terms of index size because they could add cheap Linux machines easily. We were tied to our expensive Sun hardware, it took us three years to switch to Linux (long story) and then it was too late.

tl,dr: relevance doesn't matter if you don't have the result the user wants.


Google's build-out has been nothing short of astonishing. I remember when Google indexed the web once a month - and the web was much smaller then. Now my sites get crawled several times a day - and new articles on my news site are in Google News within less than an hour. The mind boggles.


I just did a search for an error message in a piece of software. The one result from Google was my own pastebinned traceback from five hours earlier.

Google indexes random pastebins faster than I can forget I posted them.

This still cracks me up, but now I laugh nervously: http://www.ftrain.com/robot_exclusion_protocol.html


The main reason their cache was useful back then is that many pages were long gone and the links were 404s. Inktomi marketing always tried to play up our freshness compared to Google. In 1999 they had some pages in their cache that were 3-5 months old. We were pushing new indexes once a week, no page was older than a month. Seems quaint now.


I find that the cache is useful now because articles get posted to social news sites, the influx of traffic brings down the site, and Google Cache is a handy mirror that usually seems to have the page already.


Ex-Microsoft here. It's been a while since I worked there, but my recollection is the same: blind, automated testing stripped of UI elements showed that Microsoft's search results were at parity to Google's in terms of relevance.

People sometimes forget that all of these large companies have teams of very smart engineers and researchers. Google may be a talent sink, but they're not a talent monopoly.


Swu said it best:

"Google lacks a feature that it should have added year ago:

A search user who is logged in should have the ability to block entire domains from all future results.

The benefits of this are many. The cost is very low.

Why is this option not already available? Google - we depend on you. Do it."


This is a feature that would be used by 0.00000001% of Google's userbase. It would have the effect of shutting up that tiny subset of Google's users while allowing Google's result to continue to degrade silently. Google may not agree that it's in their long-term interests to hide a problem that may eventually make them competitively vulnerable.

It's also possible that the fast path of "simple search" -> "standard SERP" is so optimized that serving that 0.00000001% of users might involve a headache disproportionate to the payoff.


>Why is this option not already available?

I think the answer to this question is the same reason why google's results have become spammy. Allowing users to exclude specific domains, or even having a "report as content farm" button, are in direct conflict with google's business model, to a certain degree.


I wonder how much is due to user-friendliness. An invisible global blacklist on my search results? How could that possibly go wrong...


Well, does it have to be an invisible global blacklist? Would it be possible to create personal blacklists and have one for each user? I mean, Google already customizes search results for logged in users, so this wouldn't be too far of a stretch.


> Would it be possible to create personal blacklists and have one for each user?

I think that's pretty clearly what we were talking about. There are invisible global blacklists already, as should come as no surprise (even if you haven't run into one of the hits omitted thanks to the DMCA).


Maybe Google will charge money for the "premium" search service. Take money from advertisers to show their ads, _and_ take money from users to hide the same ads. I'll hate Google when that day comes.

At least I take comfort in the fact that programming-related searches now return more StackOverflow links than ExpertSexChange links.


I believe there are Firefox and Chrome extensions to blacklist sites from Google search results.


How much of Google returning better results than AltaVista, though, came down to Google being able to expand its index with new pages faster? What I remember is exactly what the article claims: that competing search engines couldn't "keep up" with Google's increasing scale.


My experience, or at least my recollection, was quite different. I had no complaints whatsoever with AltaVista not adding pages to the index quickly enough; rather, it is that the results increasingly appeared to be random pages that happened to contain the search term, and you might need to wade through several pages of results to find what you were looking for. With Google, the page you wanted seemed to almost uncannily be located toward the top of the list. At the time, it seemed like magic.


I remember Google mostly for having good results, yes, but more for being FAST. It was just soo fast, when all other search engines took up to seconds to show results.


I distinctly remember 404s on Yahoo, which was why I switched.


> But the secret to Google’s success was actually not PageRank, although it makes for a good foundation myth.

From an algorithmic standpoint, I'm reasonably certain that pagerank wasn't the primary factor that catapulted Google's results in front of competitors.

I think that Google's algorithmic secret—initially, at least—was to use inbound link text as part of the index for a page. You don't hear people talk about it much, but this feature is one of the more powerful (and difficult) optimizations you can make to a hyperlink database search index.


So inbound links can be faked but Facebook likes can't? Please.

Also, Google won because of better search, to such an extent that all they needed was word-of-mouth to completely demolish the competition in the late nineties and I don't recall any indications that Yahoo or Altavista couldn't scale their technology. Back then using Google was simply like using a different, infinitely more usable internet.

Finally, anecdotally, Google is a lot more difficult to game than any other search engine. It is pretty clear to me that inbound links carry a lot more weight with Bing and Yahoo, whereas Google includes several other metrics in how to weigh search results (including, fairly or not, significant emphasis on how long a site has been around).


Likes can be faked, but what if Facebook only counts likes in my circle -- or weights them heavily?

The real problem is that likes on pages aren't used enough for them to be useful.


Until now, you mean. Facebook is training its users to add everything they like. This counts web pages that you can share, too.


I am not a Google fanboy, actually I've been using duckduckgo for a couple of weeks as my default search engine. However, I've done it for the privacy features of ddg, not because I think google search "sucks" more than it used to.

Maybe I'm blind. Maybe I don't pay enough attention. Maybe I just use my search engine differently than most (or at least than those who seem to complain a lot lately) but I fail to find a concrete example where google (or ddg for that matter) gives me results that are obviously "wrong" (whatever that means).

I think we've all learned to speak "google". When you query a search engine you don't speak english. You don't ask (or do you?) "I want articles concerning java on the android platform on the hacker news website". Instead, you say "site:news.ycombinator.com android java". And most of the time I get what I want. Maybe I'm more easily satisfied than most. At any rate, these days it's important to learn this language to effectively use any search engine.

In this particular article, the author states that searching for "iPod Connectivity" doesn't yield much results that "actually answer your query". My question is: as a human being, what kind of results do you expect when you search for "iphone connectivity"?

I'm not playing dumb. Is there one good way to interpret this query? Are you shopping? Are you looking for specs? What kind of connectivity are you looking for anyway?

I did the query just now. The third link targets apple.com, the fifth amazon.com. The rest is a bunch of websites doing reviews or selling iphone parts. I can't really judge if those site are legit or not, some do look fishy. But at any rate, why do you say they are "bad answers"?

I do have my gripes with google. Expertexchange used to be a huge pain in the ass, but it's mostly gone these days (probably more thanks to stackoverflow than google, that's for sure). Google has also _never given me any results on google groups even if it will gladly give me some ad-ridden usenet mirror. I think Google has effectively contributed to destroy usenet that way.

It seems people want google to answer queries such as "What phone should I buy?" and have google give the right answer in the first link. It seems some people took the church of google a little too literally. The day Google knows how to answer that it will probably follow up by sending a mechanical Schwarzenegger clone back in time to kill bill gates' mother.

To sum it up: of course Google may want to fight even more the various adfarms out there, but before you criticize the answer, ask yourself if you've asked the good question.


I think we've all learned to speak "google". When you query a search engine you don't speak english. You don't ask (or do you?) "I want articles concerning java on the android platform on the hacker news website". Instead, you say "site:news.ycombinator.com android java". And most of the time I get what I want. Maybe I'm more easily satisfied than most. At any rate, these days it's important to learn this language to effectively use any search engine.

You've learned to speak "Google". The majority of the world hasn't. My mom, for example, would search for "how to bake apple pie", rather than "apple pie baking" or any other such set of key terms.

In fact, your approach reminds me of the approach that I used with AltaVista (back when it was the premier search engine). The reason I used this approach is because I was taught to do this by one of my teachers. Back then, it was recognized that search engines don't recognize human language and that search queries had to be carefully crafted to return optimal results.

Of course, very few people think about that now. No one spends time crafting a search query - they just type their question into the Google box and click search. As Google search results decline in quality we might see a resurrection of the old way of thinking about and crafting search queries to return the optimal result set.


I agree with your notion of Google contributing to the destruction of usenet.


This article whitewashes the issue by first pointing out the problem, and then making some vague hand-wavy claim that Google is providing direct answers immediately that obviate further search.

This is nonsense.

I think the whole world knows that for many kinds of general searches (e.g., appliance reviews when shopping), Google has been completely overrun by spammy content.

I sure hope they're working on something.


Can you give specific examples where the search returns a lot of spammy content?


Common examples that are given are consumer goods, like household appliances. I tried ‘hoover‘, ‘washing machine’, ‘hair dryer‘, and I had relevant results: deep link into relevant and safe stores, both web-only and the web-arm of brick-and-mortar. However, I'm searching from abroad (France) and not in English, obviously.


i use thefind.com for product search instead of google. they bought like.com which will hopefully help them out in this area. peter yared (author of the article above)


I wish Google and the other search engines well.

A couple of minor additions to the article for context based on what little bit I've been learning:

- About 25%-30% of Google searches each day are searches that Google has never seen before

- While 90% of bad results is just, well, bad results, there is significant room for interpretation. In the example provided, is there a reason Google should return first in a search for PageRank? If so, what is it (described technically, not emotionally)? It may be (but probably isn't) that these other PR firms are actually honestly more cited than Google is. I'm sure this isn't the case, but whenever somebody says "And the result wasn't what I liked" I try to take a careful look at what they are saying. Sometimes it's that they had an academic reason for the search that wasn't validated. Sometimes it's that their opinion of what is popular and the rest of the internet's is different. Most of the time the system is gamed, yes, but there are times in which the author is just expressing an emotional dislike of the results in good/bad terms. Search isn't something that has a "right" result. You are either generally kinda pleased with it or you aren't. (In fact, blind studies show other search engines consistently scoring higher than Google, but when the participants knew it was Google, then Google scored higher. There is a lot of room in this topic for personal opinion and stupid human tricks)

Since it's all so much based on human reactions, what could happen is that Google could fix the problem and nobody would notice. Or they might not fix the problem and everybody thinks they did. It's all about perception.

I've tried DDG and Bing, and I'm still with Google. At least for now.


> In the example provided, is there a reason Google should return first in a search for PageRank? If so, what is it (described technically, not emotionally)? It may be (but probably isn't) that these other PR firms are actually honestly more cited than Google is.

What? No one said that the ranking algorithm is implemented incorrectly when it puts these sites above Google. The example is supposed to illustrate the (obvious) fact that it isn't always the best at ordering the most relevant results for a search. Describing "relevant" technically is not easy... that's the point.


> About 25%-30% of Google searches each day are searches that Google has never seen before

Can you cite that? That's very interesting.


I read it in "The Art of SEO" which I just finished.

I believe it was sourced as part of a speech from a Google VP sometime in 2007 or 2008.

What's happening is that the scammers are training the users to use longer and longer search queries. It's much harder to trick a system that is working off of 6 keywords than it is a system that is only using 2. Long-tail stuff continues to get more and more important.


> training the users to use longer and longer search queries. no, it isn't. it's a piece of cake. ... because there just is no competition. so the only thing you have to do is to create content that google sees as valuable (and a webpage that fullfills basic usage metrics requirements) ... but also the return is lower, that's were content farms come on in.

the art of SEO is one of the worst seo books ever, seo is a business, and that book is the worst book ever from oreily. 4 authors republish their outdated blogposts, stapled together in a near random way by an overwhelmed "editor".


Interesting-- see this thread http://news.ycombinator.com/item?id=2099774


I wrote a longer reply, but the more I write the more I realize how ignorant I am. I'm out of my depth. Beats me.

I think we can agree that if it is true that 25% of all searches are completely new to Google, it's not like they could have been gamed. Can't game a search that has never existed before. Right?


I've optimized for queries that have never happened before. A big fraction of that 25% comes from hyper-specific queries from a known pool of terms. One could optimize for, e.g.:

[size] [color] [quality] widgets

And have a page that ranked for:

Brobdingnagian Fuchsia Middling Widgets

Even if that term had never been searched. You wouldn't have to try too hard--your bespoke widget company could just list all of the types of widgets it could potentially manufacture.


>"Google is in the unique position of being able to learn from billions and billions of queries what is relevant and what can be verticalized into immediate results."

That's the problem with search in a nutshell. I don't want what Google thinks is relevant, I want what I think is relevant. As the article points out, for searches which might be monetized, Google treats monetization as a highly relevant factor (unsurprisingly since the relevance of search terms to the advertising they sell is the basis of their business).

In other words, over time Google has worked hard to "curate" search results (even if the curation of links is primarily done in bulk rather than more selectively) so that results are tied to your geographic location (e.g. "football stadium" is likely to return vastly different links in the US compared to the rest of the world.)

Localization and monetization go hand in hand. For example "weather" provides a generic local forecast with options for detailed forecasts from three commercial sites: Weather Channel, Weather Underground, and Accuweather. But tellingly does not provide a link directly to a NOAA local forecast which contains the most uptodate complete and reliable information even though providing such a link is trivial.

Yes Google looks at billions of searches, but in order to monetize those searches to the greatest degree possible. Their analysis is to see what the traffic will bear, not to make the results more relevant to the user.


"It’s a popular notion these days Google has lost its “mojo” due to failed products like Google Wave, Google Buzz"

Really? Taking a single one sided tech crunch article (that is in contrast to several other recent TC articles marveling at innovations produced by google such as the real time translation) as popular belief didn't make me want to continue reading.

But I did. Some interesting points about the vertical results being important, but the implication that the core results are a wasteland that google has ceded seems pretty unsubstantiated.


Yeah it's convenient to not mention the massive successes like Chrome and Android...


Though these products/ projects are certainly doing very well and look promising, I'll call them "massive successes" when they start to make significant contributions to Google's profit or even at least revenue.


It's not always about profit and revenue. By your definition, Apache and Linux could not be considered successes.

Personally, I think gaining 10-15% of the browser/smartphone market in a couple years is pretty good. And while the direct success of Chrome has been nice, it's also been a kick in the pants for the entire browser industry, which has responded with better, safer, and faster browsers for everyone.


Here's an easy place to start: detection of plagiarism. I think Joel Spolsky said something about SEO spammers copying content from stackoverflow then editing it to be search engine optimal and posting it without a link back. It seems feasible to detect this situation and actually exploit it!

It should be possible to 1) extract the optimizations and make them available to the original site and 2) start bringing the legal hammer down on the SEO spammers for violating the terms of service of sites like stackoverflow.


A lot of this is nonsense.

> If you search for any topic that is monetizable, such as “iPod Connectivity” or “Futon Filling”, you will see pages and pages of search results selling products and very few that actually answer your query.

I suspect that most people searching for those things want to buy them. I Googled "iPod Connectivity" and the results (Amazon, Consumer Reports, Apple) seemed like a good selection of links for someone who wants that.

> Case in point: The Google.com page that describes PageRank is #4 in the Google search results for the term PageRank, below two vendors that are selling search engine marketing.

Actually, the google.com page that's number 4 (for me) is very vague about PageRank (essentially useless for someone who wants to learn about it). The three links above it are a PageRank checker, Wikipedia's description of PageRank, and an article about how to optimize your PageRank. While the last article is somewhat scummy, there is a good chance that optimizing PageRank is what the searcher was looking for.

> The vast majority of users are no longer clicking through pages of Google results: They are instantly getting an answer to their question:

These kinds of "vertical search results" only appear for very simplistic queries; while it's handy to be able to instantly find out the "sf weather", anyone wanting even slightly more specific information needs to contend with the blue links.

I doubt they're "clicking through pages" (i.e. going to the second page and beyond), but that's hardly an indictment of the quality of the blue links-- rather the opposite.


> But the secret to Google’s success was actually not PageRank, although it makes for a good foundation myth.

I do love the story of algorithm wins, but is there any evidence at all showing how important PageRank was/is, either way? I think it is hard to be sure exactly why something is popular, even when it's happening, so I suspect there isn't much evidence either way. There's two parts to the issue:

    (1) did PageRank give "better" results, for what users wanted?
    (2) was this an active factor in their preference for google?
I've heard the argument that the speed of results is a very significant factor (as the article claims), and that was definitely a factor for me. Also, I recall research from google showing the dramatic effect on user satisfaction from even slight differences in latency (above a perceived-as-instantaneous threshold).

Another factor at the time was that google didn't have paid ranking, so search results were better in this sense - related to this is a psychological trust issue, which would make people feel more comfortable with google, even if the improvement in search results was insignificant. You didn't have suspect the results.

Possibly the crucial factor (at that time) was that everyone else had crowded homepages; whereas google was simple and sparse and just did search.

Both the last two have been eroded, as the competition copied google. One would think they could also catch up on latency - it's certainly easier for any search engine that processes fewer queries. Google retains the advantage of familiarity, which surprisingly is one of the strongest competitive advantages for consumer goods, where the technology doesn't change much (eg. gum and cola).

So... provided google keeps up with the competition technically, it will trounce them commercially.


I switched from AltaVista because Google.com was faster and less full of spam. I still think that is the case compared to other search engines for common searches.

That being said, more and more searches are using ever increasing numbers of search term words, and that can get spammy.

Google make no secret of investing in search and the presence here of Matt Cutts is testament to the fact that we(?) are listening, with very frequent algorithmic tweaks and responses.

I don't feel it's useful comparing Facebook's "like" system with a search engine. Whilst it is useful to know what friends think of things, this can also be gamed by smart marketers.

I don't think it sucks - but there are use cases and perhaps for certain users where the experience indeed sucks. I for one have been using the 'spam flag' extension for Chrome and feel that something can be done with the results. Perhaps this will help?

(my opinions are my own and not necessarily that of my employer).


more and more searches are using ever increasing numbers of search term words, and that can get spammy

Why is that? I thought more words would give more focussed, less spammy results.

On another point it seems to me that Google would have to tread carefully here if the set of spammers has a large intersection with the set of folks who buy advertising from Google. I would (however naively) suspect this might be the case.


> Why is that? I thought more words would give more focussed, less spammy results.

I've noticed that if you use too many words, you more often end up with keyword-stuffed or aggregated/scraped pages, because they just happen to use all the words in your search query on the same page. Sometimes this is because there aren't actually any real results for the particular narrow niche you wanted, but sometimes it's just because none of the real results use all of the keywords you tried verbatim, and Google wasn't able to figure out that those non-verbatim matches were more relevant than the exact-match-but-crap pages (admittedly a hard problem).


Ah, I see what you mean. These link farms are getting clever, and using content "real" enough to pass the automated sniff test.


Part is also what counts as spam. Google doesn't count a lot of index-type pages as spam, so if you search for a conjunction of a few programming terms, [scala foo bar baz], you often get a page that indexes blogs on a topic, in this case Scala. The post titles on that page will use all of your search terms, but often in different post titles, effectively erasing the conjunction operator in your query.


Alta Vista would consistently show adult websites as the top link for most search results.

As far as Facebook goes, the 'Like' button is great for things of a viral nature and trending topics, but over the long term, I have my doubts it can be used as a search engine.

Prove me wrong, Zuck!


"The vast majority of users are no longer clicking through pages of Google results" -> That doesn't sound right.


I will almost always research rather than click to the next page of results.


Maybe it means most people will give up instead of trying to click through too many pages of results. Its might be more effective to go to specific sites sometimes, example Wikipedia for general info, imdb for movies, stackoverflow for programming topics, amazon.com for product review.


stack overflow (+the other stack exchange sites) have 88.2% traffic referred from google search. http://www.codinghorror.com/blog/

so even if it is a nice story ("direct traffic is becoming more and more important") it is ultimately just bullsht


But it is. If users don't get a result in the first page, they'll try a different search instead of continuing to the next page.


"But Google’s fixing it." What are they doing exactly? Don't believe it's mentioned in the article.


"Over the past couple of years, Google has progressively added vertical search results above its regular results. When you search for the weather, businesses, stock quotes, popular videos, music, addresses, airplane flight status, and more, the search results of what you are looking for are presented immediately. The vast majority of users are no longer clicking through pages of Google results: They are instantly getting an answer to their question:"

Is this Google fanboy speak? Sorry, but I don't care how quickly the gamed results appear or how local they are, they still don't answer my questions.

This article is pointless.


> Facebook, which can rank content based on the number of Likes from actual people rather than the number of inbound links from various websites, can now provide more relevant hits, and in realtime since it does not have to crawl the web.

Unfortunately, the Like button means too many things. It doesn't just mean you like the content on that one page, it means you like the creator of that web site enough to let them put updates in your news feed. And it means you don't mind telling all your friends you like it.


For any given search the user can type a few words, click on a few results, and going back to the search results until he finds what he wants.

He can also change the words in the search in the process.

This last click when the user stops searching is, for this particular user, the most relevant result for any of the search strings he has used.

I don't know any search engine that uses this data.


I'd say it's a stretch to say "sucks" but yes, spam needs to be filtered.


> The much acclaimed PageRank algorithm,

I stopped reading here. Really, do people still believe that PageRank has any significant weight in the complex ranking machinery of web search?


Just to clarify, because of the downmods.

Search engines rank the pages according to a variety of (hundreds) "signals", some of which are global (PageRank, ...), some "local" (local link structure, page quality, ...) and some measuring the relevancy of the page for the query (BM25, ...).

All these signals are blended together usually with a function which is regressed using some machine learning algorithm, to fit human judgements and (possibly) click data. Look for "learning to rank" on Google Scholar for details, there are papers by Google, Yahoo and Microsoft.

Google itself admits that PageRank is not very relevant (http://sites.google.com/site/webmasterhelpforum/en/faq--craw...):

> [...] worry less about PageRank, which is just one of over 200 signals that can affect how your site is crawled, indexed and ranked.


i read the whole piece, but you did the right thing, the article is stupid, without any noteworthy information and it seems to be based on fairy tales.


The only thing you want to give Me is whats on your plate, how about unambiguous neutrality!


Too late, I've already switched to DuckDuckGo.


What a shamelessly self-promotional piece. It starts out pointing out how everyone else is slow for not noticing this as soon as he did, and concludes by reminding you that he was the very first to make fun of Google. Is no one else bothered by this? I find it hard to read without cringing because his goal is obviously to write history with him at the center, not to make any interesting point.

Furthermore, the central claim is a massive reference-less overstatement that I think is almost certainly false:

> The vast majority of users are no longer clicking through pages of Google results: They are instantly getting an answer to their question:




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: