This is a really great description of why building crawlers (and indexes) is a really hard problem. Basically 90% of the "web" is now crap, and by crap I mean stuff you would never ever want to visit as a real human being. Our crawler once found an entire set of subdomains with nothing but markkov chain generated "forum" pages, and of course SEO links for page rank love (note to SEO types, this hasn't fooled Google for at least 6 years).
The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)
But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.
Wow, very interesting comment, thanks! Wouldn't it make sense to build (and maintain) a kind of "official" reference of all pure spam domains? Or does this list already exist?
Well every crawler has to have this list, the Blekko crawler tries to keep these pages out of the index (with varying levels of success). But its not particularly useful for non-crawlers, and since every crawler will have a way of evaluating hosts (possibly uniquely) it isn't really transportable.
That said, if you have ever wondered why domains that used to have web sites on them suddenly become huge spam havens, it is because spammers buy up the domain as soon as it expires and try to exploit its previous reputation as a non-spam site, to push link authority into some (generally Google's) crawl.
Our index and API are primarily used by web marketers. We have no interest in building a general search engine, but there's nothing special about it that is for SEO.
For the most part, we just provide general facts about the web, and we've been contacted by academics on more than one occasion for data sets.
Is that kind of data called an "index" by any community, though? Seems quite confusing to me, and you can see some of the other folks commenting here that it's confusing.
they have an index of the web intended to mimic google's index which they use it to provide reports to marketers. with their service, a search engine optimizer can answer questions such as.: how many incoming links does this url have?
We provide a public facing API that can answer lot's of questions about URLs, FQDNs, and root domains. The example you gave is one of them. We calculate a large number of statistics about every URL, FQDN, and root domain that we know about on the web.
We also calculate MozRank which is our version of Google's PageRank as well as some in house higher level metrics like PageAuthority and DomainAuthority which are machine learning models derived from all of the other metrics we compute.
The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)
But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.