Hacker News new | past | comments | ask | show | jobs | submit login

Let me know if you know of this approach being used somewhere in production processing millions of web pages. I would be very interested to know how they overcome the difficulties!

I can imagine the cost/benefit of the approach is favorable for largest search engines like Google and Bing that are trying to squeeze last few percentage points of precision out of results. For everybody else, the engineering and scaling difficulties are probably too big. I'd love to be proven wrong.




Google and Bing are doing billions of web pages, not millions. I process millions of web pages myself with 3 computers -- millions aren't a lot these days, although I'm not currently using clustering methods.

Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.

My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: