This is a really great description of why building crawlers (and indexes) is a really hard problem. Basically 90% of the "web" is now crap, and by crap I mean stuff you would never ever want to visit as a real human being. Our crawler once found an entire set of subdomains with nothing but markkov chain generated "forum" pages, and of course SEO links for page rank love (note to SEO types, this hasn't fooled Google for at least 6 years).
The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)
But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.
Wow, very interesting comment, thanks! Wouldn't it make sense to build (and maintain) a kind of "official" reference of all pure spam domains? Or does this list already exist?
Well every crawler has to have this list, the Blekko crawler tries to keep these pages out of the index (with varying levels of success). But its not particularly useful for non-crawlers, and since every crawler will have a way of evaluating hosts (possibly uniquely) it isn't really transportable.
That said, if you have ever wondered why domains that used to have web sites on them suddenly become huge spam havens, it is because spammers buy up the domain as soon as it expires and try to exploit its previous reputation as a non-spam site, to push link authority into some (generally Google's) crawl.
The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)
But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.