Being in the index isn't very meaningful. Very little of our spam-fighting/ranking prevents sites from showing up for "site:" queries, because in general we think that if someone is intending on going to a domain directly, the only reasonable thing to do is to show that domain.
Unfortunately I don't have a better way of assessing it to offer. Internally, we often look at impression-weighted precision as a metric, but I don't think there's an easy way we could expose that to you.
A more reasonable thing to do would be to take a sample of DDG's query logs, scrape the results from Google, then see what percentage of Google's results come from your spam domains, but that requires sending a lot more queries to get any useful data.
That said, if anyone has a meaningful sample query set, I'm certainly interested in running it against my spam index. I see a lot of hits on it via other search APIs.
I don't have accounts right now so it's a bit tricky. If there was some way people could export their Google search history, then I could use that. Feel free to email it in anonymously.
In general we want to measuring what matters to users, so removing a lot of spam that nobody sees doesn't really make anything better.
By the same token, if you are trying to launch a new spam classifier that has some false positives, if one of the false positives is yahoo or facebook, it doesn't really matter how good it is, it will never be worth the collateral damage.
As a result, rather than measuring precision/recall as a percentage of domains or as a percentage of urls we usually try to measure it as a percentage of results that appeared on a search result page or results that users click on, mined from our logs.
This is one of the bajillion reasons why it's absurd to expect Google to throw away all its logs data. The logs are essential to coming to any meaningful conclusions about the current quality of our search, let alone finding ways to improve it.
"As a result, rather than measuring precision/recall as a percentage of domains or as a percentage of urls we usually try to measure it as a percentage of results that appeared on a search result page or results that users click on, mined from our logs."
So this basically means that you are able to discern content spam on authoritative domains (facebook, wordpress.com, etc) based on ctr, bounce, impressions compared to surrounding serp results rather than comparing that data against the parent domain as a whole?
On an tangential note, it's a shame that Google don't offer a service like BOSS (though BOSS could be enhanced by offering revenue sharing, as an alternative to charging per query).
It seems the Google search APIs have actually gone backwards over the last few years.
Yahoo's goal with BOSS is to fragment the search market; Google is so far ahead that Yahoo knows they can't compete head-on.
Google doesn't want the search market to be fragmented; they want to dominate the market. I think that's why Google doesn't have a good search API offering.
I think DDG is a great example of why Google doesn't have a BOSS-like API. It seems pretty clear that DDG is violating the TOS of the Yahoo API by mixing search results. Yahoo seems to be looking the other way (for now), but you can bet that Google would be less forgiving.
There's not a search engine out there that wants to allow you to muck with their relevance algorithm by changing the results, and Google has more to lose from DDG-like activity than it might gain.
The writer states himself that the results of "site:" don't mean anything: "Of course this says nothing about how much they appear in the rankings." So what's the point of this article?
They mean something, i.e. that they "are in their index in some form." I've been blacklisted before, and when you're blacklisted, you don't show up in site: queries.
That said, I wanted to acknowledge that this isn't ranking data. However, perhaps as a result of this post, I'll be able to get some and re-post those results.
I'm sure you would agree with this but in case others are reading, simply blacklisting these sites wouldn't be the best thing to do. Many are simply expired or parked pages.
Google visits domains all the time so they should be aware quite quickly when things move from spam/parked to non-spam/non-parked. Therefore, I don't see why they shouldn't all be out of the index until they have useful content on them.
Unfortunately I don't have a better way of assessing it to offer. Internally, we often look at impression-weighted precision as a metric, but I don't think there's an easy way we could expose that to you.
A more reasonable thing to do would be to take a sample of DDG's query logs, scrape the results from Google, then see what percentage of Google's results come from your spam domains, but that requires sending a lot more queries to get any useful data.