Carnegie-Mellon Public IR/Web Mining Datasets

Anon84 · on Feb 13, 2009

I wonder HOW/IF they'll ever release this particular dataset:

    web08-bst.v1

         * Description: A 25 terabyte dataset of about 1 billion web pages crawled in November, 2008. 
         The crawl order was best-first search, using the OPIC metric. The crawl was started from about 
         25 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 
         million page crawl, or ii) were ranked highly by a commercial search engine for one of 16,000 
         sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish,
         Japanese, French, German, Arabic, Portuguese, Korean, and Italian.
         * Creators: J. Callan, M. Hoy, C. Yoo, and L. Zhao.
         * Status: In progress. Expected to be available to other researchers by March, 2009.

I know of several research groups and start-ups that wouldn't mind playing with it.

gtani · on Feb 13, 2009

http://www.kdnuggets.com/datasets/index.html