Hacker News new | past | comments | ask | show | jobs | submit login
Carnegie-Mellon Public IR/Web Mining Datasets (cmu.edu)
11 points by Anon84 on Feb 13, 2009 | hide | past | favorite | 2 comments



I wonder HOW/IF they'll ever release this particular dataset:

    web08-bst.v1

         * Description: A 25 terabyte dataset of about 1 billion web pages crawled in November, 2008. 
         The crawl order was best-first search, using the OPIC metric. The crawl was started from about 
         25 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 
         million page crawl, or ii) were ranked highly by a commercial search engine for one of 16,000 
         sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish,
         Japanese, French, German, Arabic, Portuguese, Korean, and Italian.
         * Creators: J. Callan, M. Hoy, C. Yoo, and L. Zhao.
         * Status: In progress. Expected to be available to other researchers by March, 2009.
I know of several research groups and start-ups that wouldn't mind playing with it.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: