I wonder HOW/IF they'll ever release this particular dataset:
* Description: A 25 terabyte dataset of about 1 billion web pages crawled in November, 2008.
The crawl order was best-first search, using the OPIC metric. The crawl was started from about
25 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200
million page crawl, or ii) were ranked highly by a commercial search engine for one of 16,000
sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish,
Japanese, French, German, Arabic, Portuguese, Korean, and Italian.
* Creators: J. Callan, M. Hoy, C. Yoo, and L. Zhao.
* Status: In progress. Expected to be available to other researchers by March, 2009.
I know of several research groups and start-ups that wouldn't mind playing with it.