Let me know if you know of this approach being used somewhere in production proc...

PaulHoule · on March 20, 2011

Google and Bing are doing billions of web pages, not millions. I process millions of web pages myself with 3 computers -- millions aren't a lot these days, although I'm not currently using clustering methods.

Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.

My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.