Hacker News new | past | comments | ask | show | jobs | submit login
Google’s decreasingly useful, spam-filled web search (marco.org)
327 points by ihodes on Jan 6, 2011 | hide | past | favorite | 178 comments



I think people, possibly including me, get irked with Demand Media et al more because they're more successful than we think they deserve to be rather than because they actually decrease the value of the SERPs. For SERPS where DM ranks well, the results prior to DM existing generally pretty much sucked. Maybe that is a Google issue, maybe that is an Internet issue (memo to Internet: middle aged women exist, please write for them, kthxbye), but for whatever reason, if you routinely Googled for [how do i make a blueberry pie] every week for the last ten years I don't think you ever had an awesome search experience.

DM pages are adequate for much of what they rank for, in much the way that USA Today is an adequate newspaper, your local state school provides adequate degrees in history, etc etc. They're adequate in a scalable manner, though, and they understand Google much better than the average publisher, which means they get visibility in excess of what some people might expect.

P.S.

Demand Media: http://www.ehow.com/how_2933_make-blueberry-pie.html

Virtuous publishers on the Internet: http://www.pickyourown.org/blueberrypie.php

If I wanted to bake a blueberry pie, I'd go for that second page every day of the week, but it is highly non-obvious to me that it is a better result qua search engine result than the DM page. I love this example because I think Google fundamentally doesn't think [how do i make a blueberry pie] is looking for a blueberry pie recipe. Most searches will not actually convert to pies. For the 98% of searchers who merely want to satisfy their pie voyeurism need, the DM content may well be better.


(memo to Internet: middle aged women exist, please write for them, kthxbye)

Many years ago, when I was working at Joseph Beth Bookstore (where I suspect Borders stole a lot of their ops manual ideas from) management there told me that the primary demographic there was middle aged women. Middle aged women often have disposable income. Many of them are married homemakers and have tons of spare time. I suspect much of Oprah's success is built on that demographic.

If I were Oprah, I'd position her cable channel as just a part of her latest "new media" venture -- just one outlet for Oprah branded content. In her shoes, I'd make diverse investments in mobile and in TV-integrated platforms. I'd offer advertisers packages for reaching all those different outlets simultaneously. If Oprah does this, I'd recommend working with her. If she can't see this path clearly, then I'd recommend disrupting her.


My mother-in-law used to be an Oprah fan but she recently 'cut the cord' with the cable company. Personally I think people like her could really enjoy an experience like Digg or Reddit if all the stars aligned correctly.


Maybe something like reddit, but presented through things like Google TV and Apple TV. If someone can morph social media viewing/browsing into something that emulates channel surfing, then that idea will easily be worth 100's of millions.

EDIT IDEA: If Oprah doesn't have someone doing the following for her, then some group with the right CV items should hop on it! Start finding or putting key moments from shows followed by middle aged-women up on YouTube. (Oprah will make up a lot of this content.) At the same time, launch a social media site tailored for middle-aged women and create APIs so that it can be integrated with TV convergence devices.

Not only will Oprah make more money on her own content this way, she will become the middlewoman for a big chunk of the other content aimed at this demographic.


I don't know what a 'SERP' is but I don't think anyone shows up at Google bright and early in the morning with the burning desire to deliver the 'USA Today' quality of search. In fact, we had it before Google came along and it left much to be desired.

I don't know much about blueberry pie searches either or if the quality of Google results is really in decline. It seems pretty reasonable to expect original content (StackOverflow) to show up before copies of the same content. Or the top search engine to aim for a quality standard above 'USA Today'. Otherwise we can just use USA Today instead of Google.


The StackOverflow scrapers ranking higher than SO is the thing that most irks me about the big G at the moment. That and all the sites like wareseeker that just act as pointless aggregators of FLOSS download sites or forum sites.

Last time I tried Bing though they weren't any better, and DuckDuckGo had really low coverage. Maybe it's time to try again.


I use DuckDuckGo as my main search engine, since I figured out the bang syntax. If I don't like the results, I just resend the same query with !g prefixed (!google works, too, and so does !bing).


I've been trying to clean those out, but if you see more big ones feel free to send them my way.


SERP = Search Engine Result Page


I think I see where you're going, but I disagree. To me, link #2 is superior even if I just want pie voyeurism. (mainly, it has instructional pictures) I would click on #2 every time.

However, if #2 is buried on the 3rd page of results like it sometimes is, then I would not....

There is however, a definite gradation of google spam. ehow is somewhere near the top... domain squatters somewhere near the bottom...


Picking Virtuous over Demand Media is your opinion.

Personally, I've been using eHow's results for cooking for about half a year now.

I don't need pictorial aids for cooking--I know how to cook. What I need is a recipe, and a series of concise steps to follow. It's easy to toss it up on the laptop or table, then glance at every so often to make sure you haven't strayed too far from the recipee.

Demand Media is like a virtual cookbook, Virtuous is like calling your parents for help and having them hold your hand through every step.



> [how do i make a blueberry pie]

You know very well that this is not how search works. You don't "ask a question" to the search engine like you would ask your grandmother.

You type in words that you expect to be in the pages you're looking for, and the search engine lists pages that actually contain ALL of those words.

One of the main improvements of Google in the very early days was that it used the AND operand by default, whereas competing search engines used OR by default, resulting in an incredible amount of noise.

In essence, searching for "how do i make a blueberry pie" (with quotes) should return only spam, because only spammy and SEO optimized sites would contain the phrase as such. A real recipe would maybe contain the phrase "how TO make a blueberry pie" but not "how do i..."

- - -

I think your point was that "middle aged women" don't know any of this.

It would be arguable (probably wrong, but still) that people who don't know this, who didn't make the effort to understand a little how all of this works, deserve the spam they get.

There is a good way to discriminate between good and bad content, and that is to know a little about what you're searching in order to search for words that will be present in good quality content and NOT in spammy pages.

For example, it's reasonnable to expect a good recipe to give instructions in metric system as well as imperial; if you add "celsius" to the search then the second (informative) recipe arrives first:

http://www.google.com/search?q=how+to+make+blueberry+pie+cel...


I think your point was that "middle aged women" don't know any of this.

No. That is overbroad, untrue, and would be very injurious to my professional reputation. I said that the Internet is skewed away from producing content responsive to their needs, which is about as controversial as saying that they are slightly underrepresented on HN relative to, I don't know, twenty-something males.

Non-technical users frequently use natural language search. The experience for natural language search is fairly poor. There are many classes of search which offer poor experience, but it is the one which leaps to my head first because I deal with non-technical users every day.

people who don't know this, who didn't make the effort to understand a little how all of this works, deserve the spam they get.

Words cannot express the depth of my distaste for this position. I will accept that "Google screwed up" or "I screwed up" if one of my users has a suboptimal Internet experience (which starts at Google because Google is the Internet and ideally ends at my site), but I cannot accept that she is responsible if she has a poor user experience. We've got the teams of PhDs, the highly paid SEO consultants, and the lifetime of building an accurate mental model of how the devil box works. She wants to teach kids to read, not learn magic incantations. It should -- "should" in the sense of "would be optimal for the business", "would be optimal for society", and "as a moral imperative for computing professionals" -- just work for her.


(I really don't understand your first/second sentence (what's injurious?) Also, I'm 40.)

> I will accept that "Google screwed up" (...) if one of my users has a suboptimal Internet experience

I respect that, and from a business point of view you're very right.

The problem is, how do you improve her experience without screwing up mine? Why can't I search for pages that actually contain all the words I'm looking for, as I typed them, and not words "that were present in the page linking to this page" or words Google think I want although I didn't type them in?

From a "moral" point of view (which you brought up), if she "wants to teach kids to read" maybe she could start by learning how to spell?


Rephrased: Square peg, round hole, commence hammering.

Most of the world does not think like a nerd. Not even remotely. Remember that when working.

Clue: remember all those things that many of us nerds thought the iPad really needed?

Corollary: why is WinARM likely in deep trouble? Because it's three years late to market, and Microsoft is only recently shown themselves capable of designing a UI for mortals, and the question is whether the Windows UI or the (ill-branded) Windows Phone UI will win the internecine politics. (And I'm betting on the mass of the Windows group.)

Implementation detail: your Google Internet searches are already tuned, if you're logged into Google.

And for completeness, when using the phrase "the problem is..." remember too that with problems arise opportunities, and through opportunities can arise profits.


I don't know why the above comment is being downvoted, but I'm guessing it's because it sounds "elitist" (which is apparently a very great crime).

To elaborate, then: I agree with the parent comment that it's Google's job to make everyone's experience optimal (and not the user's), and it's certainly in the best interests of Google (or any business) to cater to the needs of as many of its customers as possible (although in the case of Google, as has been pointed out many times before, users are in fact the product).

But I would argue that the real elitists are people who think "middle aged women" shouldn't be expected to actually learn how to use machines.

"Middle aged women" (why single them out?) use machines all the time, whether at work or at home. They're expected to know how to use a spreadsheet, a word processor, a food processor. And they do. But somehow this expectation is lifted for "the Internet". Why?

A search engine is not a person; it's certainly not a mind reader. A search engine is just a machine.


The objective is Clarke's third law:

"Any sufficiently advanced technology is indistinguishable from magic."

The winner in search will be the one who strives for that.

If _I_ were presented with "How do I cook a blueberry Pie" - I would immediately do the following search:

http://www.foodnetwork.com/search/delegate.do?fnSearchString...

Plus a few other pre-eminent and trusted food-networks (Allrecipes) - searching for "Blueberry Pie" on each one, I'd then scan the quality of the comments - looking for insight into other chefs (cross checking their history to see if they, in turn, can be trusted) who clearly have tried out the recipes, and have made relevant comments. I would then identify the recipe that looked most likely to work for me.

I would expect no less from a sufficiently advanced search engine in this, and all other domains.


The Food Network? Really? The home of "semi-homemade"...? ;-)

About Clarke's law, here's an observation by George Bernard Shaw: "Build a system that even a fool can use, and only a fool will want to use it."

Quotations aside, the process you're describing is certainly excellent; it's probably what Blekko is trying to pull, in a scalable way. It'll be interesting to watch how it plays out.


The second iteration, of course, is to engage with every (valid, trusted, revenue generating, etc...) customer who searched, determine the quality of the results, and then feed _that_ information back into the algorithms. You could then bias based on domain experts (world class chef's feedback on BlueBerry Pies more important than an anonymous user)

It may be the case that (AllRecipes, TheFoodNetwork, Etc...) are NOT the best place to search for a recipe, and that, indeed, "http://pickyourown.org is knocking it out of the park this week.

There is a lot of room for search to improve - I think the company that beats google (if it's not google that does so first), will be the one that manages to start creating the search<-->Consumer<-->Search feedback quality looop.

PageRank was just the beginning.


Google may already collect enough data to do this. They track clicks on the search results, so they can see whether you liked the results, whether you went back to a different result after visiting your first, and whether you modify your search terms for another search because the first one did not work out.


1) If you don't understand tax law, should you have to pay more tax? If tax law is simplified (so everybody pays a "fair" amount), why should people who have bothered to structure their affairs in a tax-efficient manner lose out?

2) Google has an "advanced" tab. If they really wanted to, they could have a "fuzzy match" section, and a "required literal" section. They could also do some funky stuff with leximes, but it's just not worth it, even for advanced users.


Non-technical users frequently use natural language search. The experience for natural language search is fairly poor.

Spot-on.

When my wife has trouble finding something and asks for help, this is usually the issue. The best approach isn't to ask Google a question. It's to picture in your mind what your target page looks like, and enter that into the search query.

But this just underscores what all of us here already know: natural language processing is hard.


"You don't "ask a question" to the search engine like you would ask your grandmother."

It doesn't matter one jot 'how search works'. What matters is how the majority of their users think it works. And they think it works by asking a question. So publishers need to work with that.


You're right, and yet I disagree with you (which I guess means I'm wrong).

Your position is the path of least resistance: indulge users in their ignorance. It's optimized for the short term, and in other contexts leads to great catastrophes (fast food, for example).


I wish I could find a link to an article I was reading a while back, which pointed out that most users who actually use search engines beyond typing 'facebook login' every morning work out how to search within a couple of weeks at most. Given that, it seems like a bad idea to optimise for the first week at the expense of every day for the next few years.

Sadly, I have no idea where I found that and don't recall if they had any factual data to back it up :(.


I think your point was that "middle aged women" don't know any of this.

No. The web (and computers) should adapt to us, humans. Not the other way around. If people want to search for a question, then that's the correct way to search.

The reason 'spammy' sites show up high for that query is because they are better at knowing what people are searching for.


But that is not true of any other human activity. Every human activity is learned.

The correct way to eat is not to throw handfuls of food towards your face, hoping that some will end in your mouth. That is how toddlers eat, until they are taught otherwise.

The correct way to write is not to scribble on a table with the wrong end of the pen. Etc.

Yet you argue that the correct way to use a search engine should be "how you do it when you first try it and don't know anything about it". This is inconsistent with what you have been doing since you were born.


But that is not true of any other human activity. Every human activity is learned.

And most of those learned activities are eventually replaced by better ones. People learned that the right way to portion food was by tearing with their hands until someone invented the knife. People learned that the right way to draw was with a stick in the dirt until someone invented paper.

What you think is the right way to do anything is only that because you haven't discovered or been taught a better way to do it, and that includes searching the internet.

Yet you argue that the correct way to use a search engine should be "how you do it when you first try it and don't know anything about it".

Actually, I think the argument is that a good way to be successful in business is by offering your customers more perceived value than your competitors. In search engines, it's more valuable for your users to not have to learn how your algorithm works in order to craft a search query that will satisfy their needs. It's valuable to be able to use natural language to search.

Right or wrong, the business that refuses to provide that value because they disagree with it is going to fail miserably.


I totally, completely and unconditionaly agree with your last point: it is absolutely in the best interests of Google to respond to every query with relevant results, however it's formulated.

If 99% of Google users ask Google questions like they would a person, then Google should be able to provide answers; that is true even if only 10% of users used it that way.

My point however is not about Google: it is about the user. The user would gain from learning how search engines work, and formulate their queries accordingly.

It's often stated that users shouldn't be bothered to learn how to use your service, because "they have more important things to do" and "there are other services to choose from".

This is good advice for businesses (common sense, really); it is bad business to expect users to dedicate time and effort to use your system.

But I have a hard time accepting that this is an excuse for users to ever learn anything. At the same time as it is Google's responsibility to serve users as best it can, it is each user's responsibility to try to improve their mastery of such tools.


> In search engines, it's more valuable for your users to not have to learn how your algorithm works in order to craft a search query that will satisfy their needs.

Well, yes. But in this case it's not an algorithm. You just ask it for documents containing your search terms. It's about as straightforward as it gets. It's not computer programming, it's not even doing long divisions or whatever.

You can't seriously assume that method of searching is going to completely stump anyone.

> It's valuable to be able to use natural language to search.

Maybe you're forgetting here that not the entire world speaks English. Unless your natural language search engine is able to understand nearly every language in the world, you're still forcing the users to formulate a query in English.

And I'm willing to argue that for, say, your average middle-aged German, formulating a query in English is going to pose a much bigger problem than doing a conjunctive term search, as the latter carries transparently to most languages in the world. (not all of them--I should check which ones btw--but even in those cases it's not nearly as difficult to find a fix for that than it would be to somehow port your brilliant semantic context analysis engine to be able to speak yet another completely different language).

Give it a couple of decades, maybe we'll have usable NLP then, but before that time, I am convinced that even for the layman, searching for documents to contain certain terms is a lot more straightforward, likely to yield desired results and in quite a few ways actually easier to use than what has to pass for NLP currently.


The examples you cite are mostly about physical actions (e.g. eating), your advice to adapt yourself to the world is sound because we can't change how the world and physics and matter work. However a search engine is a totally non-physical thing. Software has no body and essentially no limits like a pen does. We can make software do anything we want (almost). We can try to make software that understand how people ask questions. That's what we should do.

A better example would be languages. They are entirely intellectual and change all the time. Don't like speaking a certain way? Then change it, and it might take off! No reason to limit us all to Latin, we've invented all the other languages.


> We can make software do anything we want (almost).

We can build new software to do anything we want (if we know how to program). But some individual somewhere cannot have Google behave in some specific way. In that sense Google is very much a physical object of the world.


> We can build new software to do anything we want ...

I'm being really pedantic here, but you actually can't. There are actually uncountably which are impossible to solve with an algorithm. There's a (by necessity) incomplete list of examples at: [1].

[1] http://en.wikipedia.org/wiki/List_of_undecidable_problems

[Edited for clarity.]


I'd agree with you, if we were talking about something that is even slightly complicated. But it's not. Searching with the old-school Google syntax is asking the question "what documents contain these words X Y Z?".

It's not that hard, in fact very easy, to wrap your head around that. It's easier than looking up a phonenumber in the phonebook, or a business in the Yellow Pages. It's even easier than using the term index in the back of a book.

While asking a question might be easier than that still, it also has disadvantages. It's impossible to ask an exact question on a very specific subject. Try it in real life. You'll find you need context, or at least a series of back and forth questions to arrive at the answer you want to have.

If a search engine would implement the same method, any advantage from using questions over a conjunction of terms is negated. People would get tired of typing long question phrases, and unless the natural language processing of the search engine is absolutely perfect, they'd get pretty frustrated quickly because the search engine would interpret some of the questions wrongly.

For simple queries like "where can i find a recipe for blueberry pie?" or "what is the capital of Denmark?", we already have seen search engines can do this pretty well. For anything more complicated, they fail sometimes, if not most of the time.

It's not that hard for people to search "recipe blueberry pie" or "capital Denmark" as you think, either. People can get used to barking commands like this quite easily. Look at Star Trek ("earl grey, hot") or many other scifi series featuring a semi-intelligent computer. People almost expect it to communicate in terse phrases like that.

The big advantage, also for the layman, is that a conjunctive term query will nearly always yield the result they were expecting, because it doesn't leave much room for ambiguity. And if it does, it's when query terms hit a range of documents that weren't intended, but still match, again I refer to scifi, as well as fantasy, it may be frustrating, but it's also somewhat endearing (putting the user in a position of superiority) in the sense of an intelligent robot taking a request too literally. Or a genie in a bottle. Or an ET alien. Or commander Data. Or whatever.

On the other hand, if the computer interprets a full question in the wrong way, people get more annoyed, because the "too literal" explanation doesn't work here, after all, you typed a complete sentence, it should be obvious what you meant, right? So instead, people get the idea that the computer is simply not listening. And that makes it a lot harder to come up with a "better" question that would give them the answer they need.

Now, of course, as soon as search engines are able to do perfect language processing, as well as guessing a whole lot of additional clues from whatever bits of context they can get, typing questions into a search engine might be the better and easiest way. But natural language processing isn't really going there any time soon. Just look at the search engines that try this, how well they are doing now compared to how well they were doing 5 years ago, and there's not much progress made.

On the other hand, in the area of text-query processing, organising of data, cataloguing data, we have made tons and tons of progress. There's tagging, social network friends recommendation, and vastly superior methods of indexing large amounts of text documents for logical queries.


> You type in words that you expect to be in the pages you're looking for, and the search engine lists pages that actually contain ALL of those words.

It's not as simple as that though is it? Otherwise how would those GoogleBombs or whatever they're called work where searching for 'warmongering idiot' or something turn up George Bush's biography on whitehouse.gov. I think search result quality is a bit more involved than just testing whether a page contains a set of words.


> I think search result quality is a bit more involved than just testing whether a page contains a set of words

Google bombs work because Google includes in the words of the current page, words in the links leading to the page (inside the anchor), and even, I believe, words in the page linking to the current page, outside of the link itself.

And then fields are weighted to calculate relevance. Those long, long urls with the title of the page in the url started to appear and proliferate when people noticed that words in the url were given an important boost factor by Google.

This broke the Internet a little, by the way. That's why we need links shorteners now, with urls containing whole goddamn paragraphs.


I am reasonably search-savvy, and when I am starting to research something out of my comfort zone, I often start with a "how do I…" search. I sometimes find exactly what I need with it, but more often I get enough information to perform other, better searches.

Your assumption that people don't search that way is dead wrong, though: I've seen many non-technical people do just that sort of search. My mother being one.

I know: the plural of anecdote is not data and all, but I'm fairly certain most people don't think of search the way you suggest.


Good point.

On the other hand it took google more than a couple of weeks to get rid of kods.net, too. Where ehow is readable, kods was, as far as I could tell, just a lot of computer generated oracle words mixed from other sources.


We're working on it (as always.) There is a big improvement inspired by the stackoverflow post on its way shortly.

If people want to help out, the best thing to do is to post examples of specific queries. Those become the "fixed points" around which we can tune until we get it right. The more example queries the better, and I'll make sure they get to the right people.

A good way to get example queries is to look through your search history, which if turned on can be found here: http://www.google.com/searchhistory


Hey moultano, I was just thinking about this problem: how can I tell that a site is spammy? Overwhelmingly, they look really really similar. For example, if you search for "diy solar homes" you'll see a wonderful example of some really spammy sites - they popup book offers on load, they have this kind of template where they have big garish fonts and a whole lot of information laid out carelessly on the page.

Then there's the "what you need, when you need it" category, and then there's the "put your google search in the title even though the page has no relevant results" category (mostly software download sites - ie. try searching for "application to use nokia e72 with itunes" you get this site filebuzz in the top two spots that has a whole lot of ads and a bunch of crappy non-related downloads).

So if you add a "uniqueness" index - ie. find ways to "semantically tag" not just the textual content, but the layout and font choices etc. of particular sites, that will catch the blatant affiliate spam bullshit (diy solar homes, what you need when you need it etc.) and then just figure out a way to prevent those "file buzz" type sites from sticking my search term in the title tags (I actually have no idea how this is done) you'll eliminate like 95% of the spam.


The top result for [diy solar homes] looks pretty good. http://www.builditsolar.com/ Looks like it has a lot of resources, though I don't know where it gets them or whether it has any claim to them. Was this one of the bad results for you?


BuildIt solar looks like a relatively genuine attempt at building an online resource for information about solar homes, as well as some kits they sell for themselves.

Likewise http://www.treehugger.com/ looks like it has genuine content - although it's obviously a little bit thin on the ground.

These guys are clearly a legitimate business selling a product (well, legitimate website anyway):

http://www.supremeheating.com.au/pool-heating-top/solar-pool...

Now compare those three sites, to these four:

http://www.diy-solar-power-for-homes.com/

http://www.energy4living.hottipsonly.com/solar-power-for-hom...

http://www.diysolarpower4home.com/

http://www.solarwindpowerguide.com/diy-solar-heating/

this one, too, links back to earth4energy - strikingly similar to "earth4living" above:

http://greenerhomediy.com/create-solar-electricity-build-diy...

then we have this incredibly reputable and highly respected forum whirlpool:

http://forums.whirlpool.net.au/archive/1539413

As a human I find it relatively easy to pick out the massively spammy sites amongst those results that are on the first page for the search "diy solar homes" - and I think that any time you were to get several sites that are similar in some set of ways to each other that rank for the same search term then they should be "de-ranked".

So for example, thousands of people use the same wordpress themes, you can't just say "they're all spammers". But if the top 5 results for a particular search all share some measurable characteristics, you could safely say "hmm there's something spammy going on here".


http://www.diysolarhomes.com/ (3rd result & 4th result) definitely looks spammy. It's like the site owner just bought a bunch of keyword-domain names in order to get high rankings on search. Almost all their links look like affiliate links.

If you click anywhere on the page, you'll get a popup asking you to buy some book of theirs.


Hi -- I run BuildItSolar. Its a non-commercial site for people who want to build renewable energy projects. Its a retirement hobby, not a business. Some of the projects are my own, but many are projects that people have built and sent in the details. Its a site that is of, for, and by DIYers :) Gary


I think they could find your search terms from the referring url, but I am not sure how they are able to get their pages with your terms into the search results and get the terms into their meta description.

They must just have compiled huge lists of relatively specific search terms and have pages against each? But I would think this would be easy to identify and downrank..

It is a puzzle :/


What happened to exact search queries:

For instance, if I search for "a-r" I receive results for "ar".

I hate this. It makes it impossible to filter irrelevant results.

Or try this query: "a-c" -"ac"

This will return 0 results.


And this is where the downfall of Google begins. I also hate that some exact queries are being broad matched without my consent.


I think your search terms are too short.

See "re-elect" -"reelect"


How about adding some sort of a feedback mechanism to search? For instance, when I search for something, maybe some way to mark a result as spam and optionally relevant?

The obvious problem is that spammers would attempt to game the feedback mechanism. But a combination of things like captcha to defend against robots, limits on how many times you can flag/upvote sites in a month (feedback credits), and exposing the feedback mechanism to only real, active-for-a-long-time users above a karma threshold (Google can definitely figure this out looking at the search history, gmail account etc), might be strong enough to beat the spammers.

You could start this as a Labs feature, and see if it works well.


Please don't dumb down Google. It happens all the time. It happened with stemming. It happened with "Instant". It happened, stealthily, quite a long time ago, when Google started to return results that contained most of the search terms, but not all, or when it returned results that contained words that appeared "in the pages linking to this page".

It happens when one tries to use allintext: and is identified as a robot (why??!?)

People should understand how to use a search engine instead of have machines (second-)guess what they're thinking.

People learn to drive; if people can't drive a car we don't give them a car that drives itself!

-- Oh, wait.


Great example - 35,000+ people a year die in the United States alone because of people's inability to drive safely.

Perhaps there is a better way.


Yes, the end of my comment was a (weak) attempt at humor, since there has been quite some talk lately about Google building cars that will drive themselves:

http://www.nytimes.com/2010/10/10/science/10google.html

Edit: actually, all I really wish for, is for allintext: to work all the time. Why would I be a robot if I'm logged in, on an account of a normally active Gmail box?

Also, why is it such a secret? Why not make it more visible? A checkbox on the home page...


Lots of bot queries do allintext: searches, while fewer humans do that query. Bots can hijack human accounts/cookies too. Email spam botnets often use the valid cookie of their owner's host computer.


Fewer humans do that query because it is kept almost secret (why do bots like allintext?)

I wouldn't mind answering a captcha from time to time, but instead Google just bans the use of allintext for several dozen minutes (or more). Really frustrating.


And number-range searches. I can't count how many times I've wound up with a number-range search query that wouldn't work for me from any IP.


So allintext is essentially a honeypot for bots? Otherwise it seems weird to offer the option at all, if all it does is get you banned for being a robot.


I just tried one of the example searches Marco lists and the result is very strange.

When I search for [2010 ira contribution limit], all the results are spammy. The real official answer (on irs.gov) doesn't even show up on the first page.

BUT, if I use Google Instant, it does show up as the first result. As soon as I hit Enter, it again disappears and only the spammy results remain.

The Instant guess suggests that it's because the IRS website ranks for the plural term with "limits" [2010 ira contribution limits], not the one with "limit".


When I search for [2010 ira contribution limit], all the results are spammy. The real official answer (on irs.gov) doesn't even show up on the first page.

The first result I get is is irs.gov. Then again, we know what a bunch of shysters they are, so you still might be onto something. ;-)


Maybe because I'm in the UK.


Oh, in that case Google is definitely trying to tell you how to make a donation to a terrorist organization (IRA)


I get irs.gov at the top as well, and all the results are relevant. Are you logged into google? If so try logging out and retrying-- it would be interesting to see if the results are tailored to individuals.


obviously you don't want this thread polluted with failed google queries, and them working properly on duck duck go, maybe you should offer a contact info, or even generate some sort of form, or even better have a 'this query didn't work right' button on search


There is a "this query didn't work right" button on search - it's the "Give us feedback" link at the bottom of a result page, which links to:

http://www.google.com/quality_form?q=foo

The problem is that it gets polluted by lots of people who have no clue that Google is not the Internet, or (for that matter) their neighborhood handyman that they found through Google and who did an awesome job repairing their windows. A lot of people don't make a distinction between "I found what I was looking for" and "What I was looking for worked out for me", which makes a lot of the feedback a little less than useful.

This thread is as good a place as any - I'm guessing that the URL will get passed around, the appropriate teams will read it and adjust their algorithms, and if DuckDuckGo gets to improve their algorithms too, great - it's one more good search engine that people can use.


Why do people insist on blaming users when users use things in the "wrong" way? If you give a general feedback button to a user, they will give you back general feedback of all kinds - it's not that hard to understand, really. If you want to get feedback specific to the quality of the search results, then provide another button that says "Tell us if the results you got are not useful / just spam". Be clear about what exactly the button does, and you'll get meaningful reports from the users. Note that I said another button - you always need a general feedback button for people wanting to report something else, that's how you get general and specific feedback.

Also, google is one of the worst companies, if not the worst company, at dealing with user feedback. You can't just expect people to give you feedback but never return the favor in any way. People have the feeling that giving feedback to google is like throwing things into a black hole - like talking to a machine - if you know no one will answer you and you will never know if anyone even read your feedback, not even a little thank you note or a clue that the feedback was useful, there isn't much incentive to give feedback, now is there?


Google Maps (a different beast, of course) handles feedback exceptionally well. You're first thanked and promised a response, then a couple of days later you usually get a "you were right, we'll fix that and let you know when it's fixed", and then within a few months another "we fixed it, here's a link to what you reported (shown on Google Maps), please let us know if we still didn't get it right".

As a result, I enjoy reporting issues to Google Maps, because I know they will be addressed. Maps are dealing with a lot more tangible and unchanging dataset than Search, though.


rmoulton at google if you prefer to email, but posting here is fine too. There is a "give feedback" link a the bottom of the search results which produces a lot of good data, but very little of it is "problems HN folks have with search."



Which result were you looking for in the DDG results?


What exactly were you looking for?


Please see my reply above to nostrademons about the usefulness (or lack thereof) of that button.


Where's the best place to post sample queries?


Right here. :)


This is why people say that Google don't get social. What if I don't want to submit a bad search right now, but in a month? How am I going to find this HN thread?

Create a better mechanism through which people can submit bad searches for human review.


I am sure you must be automatically tracking and analyzing queries where users go to page 2 and beyond. Or where they do not click on any result and instead change the query or abandon the page?


I would guess the general Google approach to this problem is to try to improve algorithms.

I wonder if a change towards "human input" might improve things more.

For example, what if the Chrome browser had a big feedback button so that if users wanted to help improve the Google search results they could rate the usefulness of the link they just followed?


If it makes you feel better, our first experiment along those lines was in 2001: http://www.cs.unc.edu/~cutts/toolbarbeta.html

Back then (and with "Remove result" and SearchWiki) it had issues because we were trying to get people to recognize spam, and people weren't that good at recognizing spam techniques like hidden text. The more recent complaints we've had are more like "here's content I don't like." So maybe it's time that we tried something similar again.


It's seems to me like an explicit "Report Spam" button would help. The problem with SearchWiki & "Remove result" is that they have many other influencing factors as to why people would click those buttons.

People are pretty good at recognising "scams", but you need to tell them that's what they're looking for, not just how the content makes them feel. It seems like the report spam button was a large factor behind gmail's spam filtering success. Would love to see the same approach applied here. I know I'd be hitting my report spam button in chrome pretty often :)


Here's a Chrome extension we wrote to allow explicit "Report spam" feedback: https://chrome.google.com/extensions/detail/efinmbicabejjhja...


I still feel like it's not simple enough to be practical. I'd only really use a button that submitted the form behind the scenes. I think the extra step of filling out the fields is likely to drop my submission rate down to times I really get pissed off :)


Wouldn't spammers just "Report Spam" their competitors?


If that strategy worked, spammers could use it in gmail to void spam filter utility. I'm sure the volume of legitimate requests would help drown that noise.


Once Google has significant amounts real human ratings on the usefulness of a site in general or the usefulness of the site given a specific search, machine learning techniques could then be used to predict the usefulness of unrated sites.

I just know I see a lot of worthless junky sites on the first page, and I wouldn't think it would be hard to recognize them using ML techniques, which requires training data.


So don't let just any random Joe provide feedback. Crowdsource it to longtime holders of Google accounts who've got a track record.

Recognizing that not all users are created equal is (IMHO) an incredibly powerful insight that Google and many other companies overlook. Qualified, technically literate users will be happy to volunteer, but you have to ask.


There are many dangers to this as well. Fundamentally, as soon as people are aware of the power they hold it will get abused. See: digg bury teams, reddit circle jerks, SEO link farms, etc. And the smaller the number of people that have the power the more likely, more dangerous, and more harmful their abuses can get.

I think the cases that Google breaks is that some people figure out some trick/technique that lets them become a small power circle, which they then obviously abuse. When Google works it's because they're able to algorithmically spread the power around and at scale see what is quality and what isn't. Therefore I think the better strategy isn't to concentrate power even within an "elite" class, in fact it's the complete opposite, it should be how to make sure the power is spread within the masses quite evenly.


I may come back and ask in a while, so I hope you're right. :)


I think the very first thing is to try to prevent malicious sites to be on the first page. For example, try searching for "lawsuit employment rejection". The very first hit is: jaysgrafx.com/char-tritan-energy-power-com-employment-san-marcos-tx-job/. Do not click on it (it forwards you 84bf4ada.logout3.cz.cc/).


Hey moultano, what about specific sites within an industry that abuse big time with doorway pages and link farms? How can we report that to you? IN SERPs these domains are technically relevant to the search-term, but only show up high in the results because of their deceptive practices.


Which stackoverflow post is this?



The Stack Overflow article seems to address the subject of content being scrapped and republished which is certainly a big issue. What about the equally concerning issue of the overwhelming amount of "fluff" content published with the sole purpose of passing anchor text weight back to a domain? Is anyone else noticing an exponential growth of these types of sites recently or is it just me?


In a lot of the comments around this lately, people have been saying that this is something Google can fix, or needs to fix.

I would suggest that the content farms' success in gaming specifically Google's algorithm was an inevitability (whatever the current state of the arms race) and the only thing that will weaken the effectiveness of their techniques is to expose their business model to a greater range of algorithms. If you have three or four search engines all working on slightly different principles, it becomes a lot harder to game them all with the same content, even if gaming any one of them would be trivial. In other words, competition in the SE space at the algorithmic level is something we sorely need to see.

In parallel, my suggestion for one new search engine to add to the mix: a crawler for unsubsidised content. That is, the results consist solely of pages that don't carry advertising of any kind. This wouldn't exclude ecommerce sites but would exclude most kinds of affiliate marketing. Subscriber-only sites could pay to be indexed at a flat rate, though guaranteeing that this fee wouldn't affect rankings might be tricky. Alternatively a journal-access style of subscription model could see the SE paying the content site owner when one of its paying users consumes their information.


http://xkcd.com/810/

The ideal solution would be to make it so that the easiest path for someone trying to abuse Google is simply to be constructive. If your abuse is what people want anyway, then you've won. It's sort of like capitalism: assume that people are nasty sonuvabitches who will lie and cheat their way to fame, money, and power, and then make it so the easiest way to get those is simply to make things that other people want.

Obviously, we're not there yet, and a lot of these spam pages are fairly useless to visitors.


The problem is that it's a tug-of-war situation and in a tug-of-war, you win by passing a minimum threshold; there's no such thing as a runaway victory. Rather than leading to a kind of informational meritocracy, what you're talking about leads to the situation we're in now, where content that is perceived by some proportion of SE users as 'good enough' grows at a rate that is much faster than natural, and drowns out all other content.

If you take advertising out of the equation then the motives for producing general-topic content revert to what they were a few years ago: personal expression, free expert opinion, community discussion, journalism in search of subscribers.


a crawler for unsubsidized content

Note that this doesn't require an all-new crawl/engine: just for an existing engine to offer an advanced operator that filters ad-drenched pages from results. Even just an operator that eliminated AdSense sites would be a big win for some queries.


While technically feasible, this would rule out a large number of high quality content sites which are currently ad-supported. For example, specialist community forums often carry ads just to cover costs. You need to allow for a different funding model for sites like that, such as subscriptions.


While I agree that we need to see more algorithmic competition in this space, I fail to understand how any of these new SEs would generate revenue apart from the current ad-supported/PPC model. If these SEs continue to use ads to make money, then the problem of content farm spam would simply migrate to these sites. All they would need to do is to figure out how best to game the new algorithm.


Just got a mail from Duck Duck Go that they're already working on that one:)


Neal Stephenson's novel Anathem has a section that talks about how the 'reticulum' (the internet, in the book's fictional world) was overrun with false copies of documents with slight changes made to them. 99.99% of all of the information on the internet was spam.

A huge industry of commercialized systems connected to the internet for the sole purpose of filling it with spam, and then the corporations would sell back filters and knowledge of which documents weren't spam to customers. Eventually, the algorithms used to modify documents developed a malicious edge, so that the thousands of spam copies of an original document would be deceptive in ways that would harm people (e.g., in Marco's electrical plug wiring example, the document would have been modified so that it could get you killed by telling you to touch the wrong wire or something.)

Inevitably, it spiraled out of control, and a sophisticated system of social trust and ranking was put in place by IT workers and systems administrators, which are a caste and race of people in the fictional world.

Good book. Prescient, even.


It wouldn't be the first prescient thing Neal Stephenson has put in a book. Whenever I see Google Earth, I think of Snow Crash.


Since Google Earth was inspired by Snow Crash I'm not sure you can classify it as prescient in the same way.


What do you mean? I don't remember any of the interesting parts of Snow Crash being about 3D globes with search results on them.


See excerpt here (also links to interview with a google earth cofounder explicitly crediting snowcrash for inspiration)

http://ogleearth.com/2005/09/snow-crash-redux/


This is much ado about nothing. Google has a few search problems, and they always have, and they always improve.

Also, if Marco is going to list some problems, how about listing some problem searches? I search for what he lists, and the top result is fine in most cases, and debatable in others.

You folks think Google sucks? I don't. It's awesome, and I rely on it more everyday.


It really depends on what you're searching for. I purchased an handset that runs android in september and was hoping to find out when it would be upgraded to a newer version of android. The first 2 pages were almost completely filled with the same article.

to be fair though, I tried the same search on duckduckgo and bing and got mostly the same results.

I was just expecting google to be better than that.


Sounds like an obscure search, and if you aren't going to list the search terms you tried, it's really hard to judge whether it's user error or Google error. Google is only psychic to a point.


It's a lot less good than it used to be.


Sites like Demand Media see a gap for particular content and churn out cheap crap.

Blogs see an idea in the public consciousness and jump on the bandwagon with derivative posts.

Anyone see the parallel? ActualLy there is a difference: the DM writer got paid.

Product searches have been screwed for years. I've often wished I could filter out any price search engines and/or retailers from results. What's worse is that all these sites have places for reviews (of which there are never any) but hey the review keyword is there.

But as for this post there's nothing new here. It's a rehash of a bunch of other posts from the last month.

I can still find what I want with ease on Google. Am I just some kind of gifted searcher? I seriously doubt it.

It's like these posts are all making slippery slope arguments ("there are two content farm results in the first page. If this trend continues there will be 7000 content farm resets") rather than complaining sbout the actuality.

The other mistake made here is to assume Google's algorithm is static. This is false. It's a rapidly moving target.

Like another comment says: such noise (spam) isn't unique to Google so is the "problem" with Google's index or the Web itself?

If nothing else these posts all make the case that Google's index is algorithmic. I say this because at different times you'll see conspiracy theories about Google promoting certain properties over others.

Here's a question: if Google started blacklisting sites, how soon would the complaints of censorship or favoritism take?


I agree with most of your points, but I would truly value Google providing me with the capability of blacklisting websites for myself.

Such a thing is already possible with just browser plugins, but I'd like that blacklist to follow me around and grow algorithmically (based on preferences of similar users), and hacking around a product's deficiencies is really not "voting with your wallet".

And in such a case Google couldn't be accused of censorship / favoritism.


Wonder how many duplicate topic and mostly duplicate content articles we're going to see about how Google provides duplicate content and duplicate topic answers to searches?

My irony meter is pegging.


It is a public debate whose nature is similar discourse that focuses in on an approximate truth. Your comment is especially unfair when directed at such a consistently high quality blog.


Could the OP be referring more specifically to HN? HN is akin to a curated system, and of course there are plenty of duplicate submissions on HN.

I'm not expressing an opinion one way or the other. Rather, that was my take on his comment, which amused me.


I apologize. My intent was to use sarcasm to comment about the nature of internet publishing and curating in general, and I believe it came across as criticism towards this author or work. That was not my intention. In fact it was just the opposite.

There is a lot of trash that gets returned in a Google search at times, yes, but there are also a lot of authors out there simply taking a subject and providing their own unique spin on it. Every time somebody publishes -- anything at all -- they open themselves up to exactly the types of criticism the author of this pieces makes about other publishers on the net. I saw other commenters doing this on HN about this particular article and the irony struck me as astounding. The only thing I would have added to his piece is a discussion about exactly what he means as "bad" or "good", since everybody fills in the blanks with their own prejudices when these concepts are mentioned, and that's not a good thing for purposes of the discussion.

I should have known better than to try to be subtle. Humor almost never consistently works the way you want it to on the net. Always backfires in some fashion.


What really irks me in the last few month is that google increasingly doesn't actually answer my questions. More often than not, none of the results on the first page contain all of my search terms, and most of the time it is the most specific term that it is missing everywhere. Or the big G has replaced that term with something completely unrelated. I have to prefix every search term with a + if I want to get a result quality that is even remotely similar to what use to be the default.


Example searches? That's the most constructive way to help us improve.


Hi Matt,

I noticed you were looking for some examples to spam and thought I would show you some blatant spamming for some real estate searches. I got a job recently in a small town and noticed, as I was searching for real estate in the area, the same type of websites started popping up. I mean, these sites are horrible. I dugg further and I think I uncovered a pretty blatant spam ring. Here are the sites I have found thus far:

www.ogdenutahhomes.com www.utahhomesforsale.com www.kaysvilleutahrealestate.com www.royutrealestate.com www.ogdenrealestate.com www.tooeleutrealestate.com www.smithfieldutahrealestate.com www.utahcornerstone.com www.realestatelogan.com www.realestateinogden.com www.homes4saleinutah.com www.northernutahhomesearch.com www.searchcachehomes.com www.bountifuluthomes.com www.ogdenutahhomesforsale.com www.utahhomesforsale.mp www.loganrealestate.mp www.southernuthomes.com www.retireinlogan.com www.davis.countyutahrealestate.com www.alansharpbarker.com www.realestateprovoutah.com www.slcutrealestate.com www.homesforsalelogan.com www.cachevalleyfsbo.com www.cachevalleyhomesforsale.com www.homesinlogan.com www.homesinloganutah.com www.utahrealhomes.com www.searchloganhomes.com www.liveinlogan.com www.searchcachehomes.com www.providenceutahrealestate.com www.providenceutahhomes.com www.paw.utahcornerstone.com www.hyrumutahrealestate.com

All of these websites seem to be run by the same person. Feel free to pm me if you have any questions. Thanks!


Another story on the front page of HN today documents how the original paper linking vaccine to autism was a fraud (a downright lie).

Interestingly, when one searches for just "vaccine autism" in Google here are the titles of the first 4 results:

Medical fraud revealed in discredited vaccine-autism study

MMR vaccine controversy - Wikipedia, the free encyclopedia

'No evidence' for autism-vaccine link (cnn.com)

Journal: Study linking vaccine to autism was fraud - Yahoo! News

There are zero results from content farms in the first 20 results.

So, curiously, on such topics, content from content farms is completely filtered out by Google.

(If one searches for the same topics only on ehow.com for example, the results talk about a "controversy" and say nothing about an actual fraud).

Is the filtering the result of general Page Rank algorithms (pages on eHow on autism receiving less links than pages on cooking or wiring outlets) or are pages already weighted by topic, by Google? ("important topic" => "casual content" bad; "casual topic" => "casual content" ok)


Isn't it telling that you, if youre speaking on behalf of google i.e. working there on search, dont even know where your weaknesses are dont know where to even begin? If you have to rely on community input to even spot a direction for improvement you probably are in deeper trouble than all these articles suggest. Would Google even be where it is today if they hadn't spotted a weakness and improved it a decade (or more) ago?

I've personally mostly given up google for basically _any_ kind of product search, because it reminded me of going through my spam-riddled email inbox before my provider hat spam filters (or before I switched to gmail). Before that, I've had to give up the Google Groups Usenet interface because it was so cluttered with spam, that it wasnt even funny any more how useless it was, and Google _still_ kept associating its name with it.

If you have to, kick the spammers out manually, to prevent users jumping ship, until you eventually sort it out algorithmically.

Edit: When you downvote a posting I put time in to write, you could at least tell me what downvoteworthy I actually did, so I can refrain from doing it in future, since it is in our common interest to avoid downvoting and being downvoted. So, can you please elaborate in hindsight?


I know plenty of weaknesses in Google, and we work hard on the problems that we think matter the most (e.g. in 2010, we worked a lot on hacked sites so that regular people wouldn't stumble into an awful experience).

But it's very helpful to get independent, outside examples. It moves the conversation past "Google sucks" to "Google sucks because of query X." Sometimes those queries are new, but often what's just as useful is hearing what people dislike about the current results for the search X.


I find it very hard to believe that Google's Matt Cutts would come down from the ivory tower to answer comments on a website.

Google suspended my Google Checkout account because two words in the title of my book (make money editing from home) sent up a flag. No matter how many times I offered to give them a copy of my manuscript, they declined.

The eventual apology was appreciated, as was the full reinstatement, but it doesn't change the fact that I was wrongly accused, wrongly convicted, and had no court of appeal to go to except the very people who shut me down.

I no longer use Gmail, Google Toolbar, Chrome, Google AdSense, Google Checkout, Picasa, Blogger, Feedburner, Google Webmaster Tools, Google as a search engine, or whatever else I can think of. Google violated my terms of service.

Oh, and the nastygrams looked outsourced, and the apology had typos and bad grammar. You guys should hire an editor.


Actually, I'm not done. I didn't do anything wrong. Google admitted this. It was their mistake, not mine. They admitted this in their apology, which I did appreciate.

For my account to be randomly suspended reminds me how much an Act of Google resembles an Act of God. That's what annoyed me so much. Google did evil that day.

And that, to bring it back to the point "Matt Cutts" made, is WHY Google sucks. It can't be fixed.


I find it very hard to believe that Google's Matt Cutts would [...] answer comments on a website.

http://news.ycombinator.com/threads?id=Matt_Cutts


Two examples I see in my history right now are

bsi +lnk scada +foof things i will not

In both cases the results (on www.google.com with a browser running with LANG=de_DE) without the plus are considerably worse, and foof is definitely not edible :-).


Thanks. You said two examples. Was this the single search [bsi +lnk scada +foof things i will not] or two different searches? Usually we write [X] or ["X Y"] to mean doing a search for X or "X Y" at Google, so adding square brackets would help me understand what the two searches were more clearly.

Or were you just frustrated that we assumed "foof" was intended to be "food" unless you added the '+' character to specify an exact match?


Ah, sorry, that should have been two lines, fat fingered the leading two spaces, those should have been

  +foof things i will not
  bsi +lnk scada


Yes! Something changed in the search algorithm recently, for the worse. I search for X and get Y. Is this documented anywhere?

(I didn't know that a "+" prefix can fix this. Thanks for the info.)


It's easy to filter out spam once we identify it; so the question is: "what is spam"?

Some argue that content farms such as Demand Media aren't spammers, because the content they produce actually satisfies better the casual searcher than elaborate, savant exposés on the same subject. Casual content for casual searchers.

Others consider that content farms-issued pages are the epitome of spam: spam that doesn't look like spam, and that ends up cluttering search results. Spam is not irrelevance: spam is clutter.

A corollary to "what is spam" is: "who should make the call"?

Originally, Google tasked itself with making this call, and it did a pretty good job at it.

But why not me? It should be possible for Google to make a difference between "casual" and serious content, and then let the user decide which they prefer.

Well actually, that is already possible: it's called "reading level" and it's accessible in the advanced page.

Searching for "how to wire an outlet" gives ~12 M results, the first of which comes from about.com.

When filtering the search to display only "advanced reading level" results, there are only 264,000 results left, the first two coming from Wikipedia (and the 3rd and 4th still coming from ehow.com).

So Google already knows what is "casual content" and already lets users filter it out.

Maybe a simple solution would be to add the filter directly in the search results page instead of having it buried in the advanced options.


One thing that I notice about the spam sites and scraper sites is that they often have very similar content and/or layout. What if Google was able to determine how similar certain sites where and consolidate those into a single result, like they do with Google News?

Then when I search for AMD Bulldozer news and there are 20 sites all with the same article, from the same date, I don't have to change my search parameters to show just the last month. Instead, it would determine that the content was similar, smash into a single result, and leave room for 9 more less-similar results that may better include what I want.


Decreasingly? This has been a rollercoaster for years. I was more of a Webmasterworld regular a few years ago than now but around 2005-2006 a lot of people thought Google had gone to pot.

http://www.theregister.co.uk/2006/05/04/google_bigdaddy_chao... http://www.webmasterworld.com/google/3040496.htm http://www.seo-news.com/archives/2006/apr/6.html http://www.webmasterworld.com/forum30/34407.htm http://www.mattcutts.com/blog/feedback-webspam/

Plus ça change..


Could it possibly be at google is in the middle of an innovator's dilemma?

Twitter, hacker news, tumblr, and quora are all really shitty google replacements. But I use them to get certain kinds of information. It isn't enough to justify a radical change at google -- especially if they are even slightly focused on maintaining revenue.

There must an an opportunity for a more curated experience where the browsing behavior of a few thousand selected people can be used to juice authority. I don't think the human editors need to know they are doing that job. Maybe they should use chrome data for this.


A first step would be to hide the queries data (especially trending queries). It was an interesting curio but its major consumers now are spammers.


Am I using a different Google than him?

I type in "how to wire an outlet" and all the top results look useful. Sure there are some ads embedded on the pages and the top hit is about.com with a 10-page slideshow, but every hit looks like it explains exactly how to wire an outlet.

http://www.google.com/search?q=how+to+wire+an+outlet

Even when I try the spammiest searches, it looks like they're returning pretty relevant results:

http://www.google.com/search?q=best+price+on+viagra http://www.google.com/search?q=wrist+watch+deals


i ignored these "google sucks because of spam" articles until this one. i tried his first worse example [large sensor compact camera]. almost all the results on the first page are good. #3 is suite101 which is one of these farms, but it actually contains good content too. so i will go back to ignoring these "google sucks because of spam" articles, unless one shows up that has some quantitative results.


My guess for the near-to-mid term is celebrity curation.

I keep thinking about how Roger Ebert, after decades of movie reviews, started branching out into political (anti-Tea-Party) commentary and other articles. If you knew that a trusted brand (for many) like Ebert was curating home TVs, or projectors, or blank DVD media in an unbiased way, wouldn't you want to see what he had to say?

Or Thomas Dolby on audio equipment, Sting on Tantric books, and so on. They'd make money through affiliate links or even subscriptions.


Isn't it possible that all of this recent bad press about Google would be a consequence of "Instant"?

Here's my thinking:

- to get good results, one needs to type as many relevant words as possible

- Instant encourages people to type less and less words (not even words: a few keystrokes and you're on)

But if you type very few words, or if you search for "frequent" queries (generated by Instant in response to your few keystrokes) then all you get is spam.

Spam is optimized for frequent queries, not very specific ones. Instant should be renamed Instant spam.


That's not really the case, plenty of spammers optimize for long tail searches because 1. they're easy to rank for. 2. they're easy to create ambiguous, autogenerated content for


> plenty of spammers optimize for long tail searches

But how do they do it? By nature, there are many more long tail searches than frequent ones, and each one is rarer (or unique).

How do spammers find them?


Through keyword generating tools such as the Google Adwords keyword generator and a host of free online ones (of questionable quality).


I started writing out a comment on the somewhat heretical notion that biasing search results against AdSense click throughs would probably be a strong predicter for spam detection, but the comment got long enough that I folded it into a blog post:

http://news.ycombinator.com/item?id=2074621


It’s impossible to do any meaningful product research with Google.

Right now, I often start my product research within Amazon. However, that's only a start, as Amazon isn't great for everything. For large appliances, Consumer Reports is a good starting place. I guess I'm an example of the switch from search engines to "expert" sites.


Amazon is great but search on Amazon doesn't work too well; what works very well is to use Google and restrict the search to Amazon:

site:amazon.com some product


Why people complain about product researches, there is "google products" where you will be searching only products, not any spam.


I have to say there's some truth to this. Why is it that I increasingly must search through the search results just to find the site that originally published the string returned in the first 3 to 8 results?

I don't want to patronize all these sites repackaging content created by others, yet they continually appear before the creator.


One wonders if Google is becoming the new Yahoo. If so, a big opportunity for the likes of DuckDuckGo and other nimble searchers. Today's upstarts can also run on the cloud, sidestepping the need to build Google-scale data centers (at least initially).


This kind of worries me.

On the one hand, Google isn't the best web search tool. I've switched to DuckDuckGo, and so has everyone who's seen me use it. But, I think Google still provides a valuable public service: indexing the entire web and handling that much traffic is not an easy task, and a lot of other things (like DDG) depend on that humongous cluster.

So on the one hand I want to see the best search engine win, but on the other hand if Google goes out of business (or more likely, starts losing money and canceling projects) then I'm afraid it'll take a lot of things out with it, with no clear replacement.


Indeed perhaps time to give duckduckgo a try. They seem to actively filter out all ad sites. I've been going nuts in the past couple months with google searching for technical solutions to problems, command clicking on links to open in tabs and discovering that most are clone ad sites from a question on stackoverflow.


It was these that won me over: http://duckduckgo.com/bang.html . Most of the time I know what general sort of thing I'm looking for, and I'm happy to give the search engine hints if it'll help. Even if I don't, it'll essentially flat-out ask me what sort of thing I want, and then give me more specific results: http://duckduckgo.com/?q=ruby

But, I mean, I'm doing all this by typing it into Chrome's address bar. And a lot of the time, DDG is just returning results from Google's API. I want to use DDG, but I want to make sure Google still exists because I need it even if I don't use it.


I just use the browser, instead of the !bang feature in Duck Duck Go.

Chrome / Firefox have the option of adding a search shortcut.

For instance I type "py package-name" for searching inside Python's index, or "am product" for searching in Amazon, or "w something" for searching Wikipedia, or "t some words" for doing a google translate, or "hn something" for searching Hacker News.

Chrome even does these shortcuts automatically, so you can just "amazon.com android" and it will do a search on amazon; although in Firefox it is easier to add your own.


FWIW, Opera has this too, and in fact had it first, by about 2005.


DuckDuckGo uses the index and ranking of Bing, not Google.


Isn't this how Facebook topples Google and completely dominates the internet? By incorporating your social graph into your search results, your relationships can influence what is returned by the search.

Suppose you could create some sort of "friend" list with HN users and that were used to prioritize your search results. If you get a result you don't like, click that you don't like it and the software will reduce the weights of the parts of your social graph which caused that result to be highly scored.


I set up a second, filtered search using Google Custom Search and added it to my browser. I don't always use it, but it's easier to switch to when I encounter spammy topics (like code look-ups). It's pretty easy to blacklist fakes... and even useless SEO-heavy sites like experts-exchange, bigresource, etc.

Here's how, if interested: http://radleymarx.com/blog/better-search-results/


I am beginning to use this mode to Google search more, cut out a lot of the spams.


Is there a huge incentive for Google to improve if a lot of the content farms are monetised by AdSense and actually return Google money?

You could argue that they might lose their spot as the default search engine for a lot of people, but Microsoft has presumably thrown a huge amount of money and expertise at the problem and hardly dominated. I suspect this is not going to be a significant problem for Google in a hurry.


I'm not sure if prioritizing links over keywords is really going to help matters.

I know a lot of 'little guys' who know something about a topic and can write prolifically, but who suffer under the delusion that 'If I build it, they will come.' Success in SEO is largely possible because 95% of webmasters have no idea how to promote content.

I've also developed 'digital libraries' for major academic organizations and a common thread there is a complete lack of interest in indexability. There's a lot of fantastic content trapped in the ivory tower because nobody considered the 'unwritten standards' for how the web worked.

A big part of the problem is that it's very hard to get legitimate links these days. You used to be able to get into the Yahoo directory for free, but now you have to pay a $300 a year bribe. Before 2000, it was common for people to create large collections of links they liked. Today, major players like Engadget have a policy of not wasting their PageRank on other sites. Afraid of spam, many blogs and forums are on a hair trigger to stop people from dropping links in comments, relevant or not.

If legitimate links are harder to get, that 'lowers the bar' for spammers.

A real answer to spam would be to strengthen the signal so it can break through the noise. It might be helpful to be able to get more feedback from web users about the quality of pages, but this is tough. The horrible truth is that there are more pages on the web than there are viewers, so even if you could get feedback from 10% of viewers, many pages would be badly undersampled. Spammers would also target any feedback channels that exist, and with low response and sampling rates, it might be easier to overload the feedback chanel than it is to create link noise.

Another answer is to beat Demand Media at their own game, the same way that Stack Overflow has beaten the spam sites that dominated programming questions two years ago.


I think search quality would go up if Google gave me the option of blocking domains from SERPs. I never want to see results from a content mill (eHow, Mahalo, etc) in addition to all the made for Adsense sites I come across less frequently. They could also use the collective blocking data to help tweak the spam filter.


I think the integration of social media is a possible solution. Recommendations and likes (from Facebook) or other places are hard to artificially jack up and can also offer great results if it ties them to your friends. I like the direction Bing is going with this. It is the only way I can see to get large human edited results of the web.

Unless Google develops highly advanced AI (which is a possibility) computer algorithms can be gamed. Humans can be gamed as well but because we are all so different I don't think there is a single approach that would fool a large segment of the population at once.


"Recommendations and likes (from Facebook) or other places are hard to artificially jack up"

If recommendations and likes are added to Google's algorithm, people will find ways to artificially jack them up. For example, marketers have aligned networks of Digg users to increase the amount of Diggs. I find it hard to believe that Demand Media and others will not be able to artificially inflate Facebook likes.


I'm not quite sure how it will happen, but at some point I think it will be beneficial to commoditize the underlying crawl and index data, so that there can be more domain specific focus and more diverse sources of innovation applied to solving this and other search problems. One or two sources trying to be all things to all people and all problems isn't going to scale.

Blekko's slashtags are a good start, but it needs to go much further.


These bad search results are not an IT problem, they are the results of top Google executive policies: act like you want to be good and do better for the Google users but keep on serving up the same old stuff because it is tied to the revenue cow and the paying clients. This problem would never have existed if Google considered the end user experience more important than the advertisers.


Why does Google no longer offer the option to permanently remove a specific domain from your search results? My personal search quality would be dramatically improved if I could specify even a short blacklist.

In fact, dear lazyweb: is there a browser extension or greasemonkey script that makes Google return 100 results at a time and then filters out the best 10 based on a blacklist?


Set Google to return 100 results and then use a userscript or extension to filter them. E.g. https://chrome.google.com/extensions/detail/ddgjlkmkllmpdheg...


Couldn't part of this problem be solved with an algorithm that identifies when several pages have roughly the same content (ie. original wikipedia article + 5 copies of it elsewhere on the web) and then giving the oldest occurance in the index a much higher rank?

That would kill incentive to create these spam-sites and give the user the result s/he was looking for.


Maybe Google could scrape DDG on the fly for each search, then do a diff, and filter out any results that arn't in DDG... that would be the fastest way to remove spam ;)


People just can't come up with right search queries and guilt search engine. I always find what i want via google.


I was on the Internet long before Google showed up, and I'll be here long after Google is dead and forgotten.


Google, can't you solve these problems with money?

Pay army of users pressing ham/spam buttons, mechanical turk style.


Please, not a rehash of what we've been reading for the last couple weeks.


I do find it interesting how this has become a meme, but I think that Marco has added something interesting to the discussion. I like his categories of searches, and the decision he outlines google as needing to make.

But, really, if you don't want to talk about this subject anymore, close the tab.


Apologies. I skimmed through the first few paragraphs and thought that it was a summary. I was attempting to help HN cut back on multiple posts on the same old content. Ah well, I guess I won't be doing that any more.


I don't think this is surprising - the top management seems more interested in building OSs and social networks. Search doesn't seem like their highest priority anymore.


There's more people working on search quality than ever in our history. The PR team for search quality does a great job pitching stories about search quality, how it's hard, etc., but lots of reporters prefer to write about the shiny things (or more tangible things--you can hold a phone in your hand) rather than improving search quality.

But Google continues to work hard on improving search every day, even if that doesn't always get covered.


One solution may be for Google to radically change their algorithms and policies for web search to de-emphasize phrase-matching and more strongly prioritize inbound links and credibility.

Inbound links and the calculated "credibility" from the same are what killed the web the first time around. There was once a democratized web era when that actually worked -- when millions of people had their little Goecities pages and were linking the cool stuff -- but in the modern era it's 99% consumers who cast no votes, and the last 1% is extraordinarily incestuous circular link love: Marcos links to Coding Horror who links to Daring Fireball who links to Scoble who links to Marcos, etc.

People with neither information or authority end up being the credible authority on matters they have aren't authorities on. Scoble a few years back pointed out the fact that according to search engines he was the most important Robert in the world. That is a frightening concept.

We will move from an era of search engines to an era of expert engines. Many of the questions I used to "ask" Google I now ask of Wolfram Alpha, and its approach has turned out to be quite useful. Expand that computer knowledge more broadly, and improve the human syntax parsing. and we'll have a winner. Several such systems are built around computer learning of the wikipedia corpus.


I noticed recently that Google has started to give explicit answers to some search queries - e.g http://www.google.co.uk/search?q=json+content+type


Yup, that is exactly that sort of "machine knowledge" that is beneficial.

I suspect where their algorithm isn't somewhat sure of the answer it essentially does A:B testing with users -- a while back I asked it who the governor of Illinois was and it replied Blagojevich (long after he was ousted). I pointed it out to a friend and they tried and got Quinn. In both cases it gives you the option of flagging whether it is wrong, though it seems suspect to let people who are asking the question in the first place declare its rightness, beyond egregiously wrong answers.


Several such systems are built around computer learning of the wikipedia corpus.

Any ones in particular that you've found work well?


Powerset worked well enough that Microsoft bought them for $100 million.


Cuil, of course!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: