Hacker News new | past | comments | ask | show | jobs | submit login
Yinyang K-Means: A Replacement for Classic K-Means with Consistent Speedup (jmlr.org)
45 points by jcr on July 15, 2015 | hide | past | favorite | 8 comments



Huge, if true.

Researchers included a link to (C++) source code: http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2


I agree it's useful although there also exists a more scalable approximate variant of classical k-Means and the new Yinyang k-Means (referenced in the paper), namely Mini-batch k-Means:

http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

This approximate method is implemented (at least) in sofia-ml and scikit-learn.


Another approach is to transform the dataset into a smaller but representative dataset, called a core-set, and running k-means on that core-set instead which produces a "(1+e )-approximation for the optimal cluster centers".

See http://people.csail.mit.edu/dannyf/kmeancoreset.pdf


> Overall average Speedup (X) of Yinyang over Standard (9.36), Elkan (6.12), Drake (3.08)

There wasn't a single mention on GPU use. So I'm wondering if the yinyang algorithm would port well to the GPU.

Just a quick google led me to reports of GPU speedups of 2x to 75x.


I haven't read the paper yet, but I'm about to. Before I do, I want to ask: do people usually implement O(n^2) distance calculations in k-means when trying to find the closest centroid that each point is to? I think that's what I usually see, which seems so dumb to me. Are metric trees that obscure?



That would be interesting to investigate but you have to keep in mind that the centroids move between each iterations. So you have to re-build a new metric tree at each iteration. kd-tree / build time complexity are between `n log(n)` and `n (log(n))^2`. You don't benefit from the classifical speed-up of metric trees where you build the tree once and then do many queries on the same tree.

Also metric trees such as k-d tree and ball tree do no bring any speed-up in high dimensions.


>Also metric trees such as k-d tree and ball tree do no bring any speed-up in high dimensions.

This is not true. Only their naive implementations don't bring speedups in high dimensions and it depends on the number of data points, query points, and dimensions. Best bin first is one common optimization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: