Hacker News new | past | comments | ask | show | jobs | submit login

Methods that do clustering on similar web pages are mostly too CPU intensive for processing larger sets (we're talking millions of web pages). They are also harder to scale from data-locality perspective, you need to figure out which pages to put together and then get the data together.

Looking at pages in isolation is much more horizontally scalable. You can take a look at Webstemmer (http://www.unixuser.org/~euske/python/webstemmer/index.html) for a method exploiting similarities.




Great reply. However, I think something is only worth doing if it's impossible.

Your argument is like Chomsky's argument about the poverty of the stimulus, just in reverse. There are heuristics that let us radically prune the N^2 possible relationships between things into a much smaller set that will let us do things that would be otherwise unscalable.


Let me know if you know of this approach being used somewhere in production processing millions of web pages. I would be very interested to know how they overcome the difficulties!

I can imagine the cost/benefit of the approach is favorable for largest search engines like Google and Bing that are trying to squeeze last few percentage points of precision out of results. For everybody else, the engineering and scaling difficulties are probably too big. I'd love to be proven wrong.


Google and Bing are doing billions of web pages, not millions. I process millions of web pages myself with 3 computers -- millions aren't a lot these days, although I'm not currently using clustering methods.

Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.

My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: