The woes of building an index of the web

ChuckMcM · on Nov 15, 2015

This is a really great description of why building crawlers (and indexes) is a really hard problem. Basically 90% of the "web" is now crap, and by crap I mean stuff you would never ever want to visit as a real human being. Our crawler once found an entire set of subdomains with nothing but markkov chain generated "forum" pages, and of course SEO links for page rank love (note to SEO types, this hasn't fooled Google for at least 6 years).

The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)

But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.

bambax · on Nov 15, 2015

Wow, very interesting comment, thanks! Wouldn't it make sense to build (and maintain) a kind of "official" reference of all pure spam domains? Or does this list already exist?

ChuckMcM · on Nov 15, 2015

Well every crawler has to have this list, the Blekko crawler tries to keep these pages out of the index (with varying levels of success). But its not particularly useful for non-crawlers, and since every crawler will have a way of evaluating hosts (possibly uniquely) it isn't really transportable.

That said, if you have ever wondered why domains that used to have web sites on them suddenly become huge spam havens, it is because spammers buy up the domain as soon as it expires and try to exploit its previous reputation as a non-spam site, to push link authority into some (generally Google's) crawl.

wumpus · on Nov 14, 2015

Note that they're building a graph of the web for SEO purposes, not a search engine index.

b4hand · on Nov 15, 2015

Our index and API are primarily used by web marketers. We have no interest in building a general search engine, but there's nothing special about it that is for SEO.

For the most part, we just provide general facts about the web, and we've been contacted by academics on more than one occasion for data sets.

wumpus · on Nov 16, 2015

Is that kind of data called an "index" by any community, though? Seems quite confusing to me, and you can see some of the other folks commenting here that it's confusing.

b4hand · on Nov 16, 2015

We do have a traditional inverted index over anchor text. That's just one aspect of our data set though.

Originally, that was a larger focus and that's the name we've always called it. I could see how that might be confusing.

However, I do think the common name for the data that we collect is called a backlink index, at least in our industry.

wumpus · on Nov 16, 2015

Ah, that's pretty neat that you have an inverted index in there. I'm about to build one for the Wayback machine.

sqldba · on Nov 15, 2015

I read the whole thing and did a search and still don't know what the index is, what it's for, or how to use it.

elsewhen · on Nov 15, 2015

they have an index of the web intended to mimic google's index which they use it to provide reports to marketers. with their service, a search engine optimizer can answer questions such as.: how many incoming links does this url have?

b4hand · on Nov 15, 2015

We provide a public facing API that can answer lot's of questions about URLs, FQDNs, and root domains. The example you gave is one of them. We calculate a large number of statistics about every URL, FQDN, and root domain that we know about on the web.

We also calculate MozRank which is our version of Google's PageRank as well as some in house higher level metrics like PageAuthority and DomainAuthority which are machine learning models derived from all of the other metrics we compute.