Google Web spam - Gabriel Weinberg's Blog

moultano · on July 27, 2010

Being in the index isn't very meaningful. Very little of our spam-fighting/ranking prevents sites from showing up for "site:" queries, because in general we think that if someone is intending on going to a domain directly, the only reasonable thing to do is to show that domain.

Unfortunately I don't have a better way of assessing it to offer. Internally, we often look at impression-weighted precision as a metric, but I don't think there's an easy way we could expose that to you.

A more reasonable thing to do would be to take a sample of DDG's query logs, scrape the results from Google, then see what percentage of Google's results come from your spam domains, but that requires sending a lot more queries to get any useful data.

epi0Bauqu · on July 27, 2010

I think this method would violate (at least the spirit of) my privacy policy: http://duckduckgo.com/privacy.html

That said, if anyone has a meaningful sample query set, I'm certainly interested in running it against my spam index. I see a lot of hits on it via other search APIs.

gojomo · on July 27, 2010

You could let people opt their queries into such studies.

epi0Bauqu · on July 27, 2010

I don't have accounts right now so it's a bit tricky. If there was some way people could export their Google search history, then I could use that. Feel free to email it in anonymously.

moultano · on July 27, 2010

I found this link which looks like it dumps search history as an rss feed. That might be a convenient way to send it over.

https://www.google.com/history/?lookup?q=&output=rss&...

pierrefar · on July 27, 2010

Can you please explain "impression-weighted precision" and how it's used? Sounds very interesting.

moultano · on July 27, 2010

In general we want to measuring what matters to users, so removing a lot of spam that nobody sees doesn't really make anything better.

By the same token, if you are trying to launch a new spam classifier that has some false positives, if one of the false positives is yahoo or facebook, it doesn't really matter how good it is, it will never be worth the collateral damage.

As a result, rather than measuring precision/recall as a percentage of domains or as a percentage of urls we usually try to measure it as a percentage of results that appeared on a search result page or results that users click on, mined from our logs.

This is one of the bajillion reasons why it's absurd to expect Google to throw away all its logs data. The logs are essential to coming to any meaningful conclusions about the current quality of our search, let alone finding ways to improve it.

pierrefar · on July 27, 2010

Thanks.

Am I right in understanding that if a SERP result for a given keyword doesn't get clicked by users enough, it will be removed?

By the same token: does the result's bounce rate matter? I imagine spammy sites have a very high bounce rate.

ddemchuk · on July 28, 2010

"As a result, rather than measuring precision/recall as a percentage of domains or as a percentage of urls we usually try to measure it as a percentage of results that appeared on a search result page or results that users click on, mined from our logs."

So this basically means that you are able to discern content spam on authoritative domains (facebook, wordpress.com, etc) based on ctr, bounce, impressions compared to surrounding serp results rather than comparing that data against the parent domain as a whole?

moultano · on July 28, 2010

No, not at all. This is just talking about the granularity at which we compute metrics like "spam rate."

sadiq · on July 27, 2010

On an tangential note, it's a shame that Google don't offer a service like BOSS (though BOSS could be enhanced by offering revenue sharing, as an alternative to charging per query).

It seems the Google search APIs have actually gone backwards over the last few years.

employfive · on July 27, 2010

Yahoo's goal with BOSS is to fragment the search market; Google is so far ahead that Yahoo knows they can't compete head-on.

Google doesn't want the search market to be fragmented; they want to dominate the market. I think that's why Google doesn't have a good search API offering.

moultano · on July 27, 2010

The actual reason for this is that Google is extremely (unreasonably?) paranoid about leaking how we do things to our competitors.

irq11 · on July 27, 2010

I think DDG is a great example of why Google doesn't have a BOSS-like API. It seems pretty clear that DDG is violating the TOS of the Yahoo API by mixing search results. Yahoo seems to be looking the other way (for now), but you can bet that Google would be less forgiving.

There's not a search engine out there that wants to allow you to muck with their relevance algorithm by changing the results, and Google has more to lose from DDG-like activity than it might gain.

epi0Bauqu · on July 27, 2010

http://developer.yahoo.com/search/boss/

spec · on July 27, 2010

The writer states himself that the results of "site:" don't mean anything: "Of course this says nothing about how much they appear in the rankings." So what's the point of this article?

epi0Bauqu · on July 27, 2010

They mean something, i.e. that they "are in their index in some form." I've been blacklisted before, and when you're blacklisted, you don't show up in site: queries.

That said, I wanted to acknowledge that this isn't ranking data. However, perhaps as a result of this post, I'll be able to get some and re-post those results.

skinnymuch · on July 28, 2010

I'm sure you would agree with this but in case others are reading, simply blacklisting these sites wouldn't be the best thing to do. Many are simply expired or parked pages.

epi0Bauqu · on July 28, 2010

Google visits domains all the time so they should be aware quite quickly when things move from spam/parked to non-spam/non-parked. Therefore, I don't see why they shouldn't all be out of the index until they have useful content on them.