Yinyang K-Means: A Replacement for Classic K-Means with Consistent Speedup

huac · on July 15, 2015

Huge, if true.

Researchers included a link to (C++) source code: http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2

ogrisel · on July 15, 2015

I agree it's useful although there also exists a more scalable approximate variant of classical k-Means and the new Yinyang k-Means (referenced in the paper), namely Mini-batch k-Means:

http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

This approximate method is implemented (at least) in sofia-ml and scikit-learn.

thecopy · on July 15, 2015

Another approach is to transform the dataset into a smaller but representative dataset, called a core-set, and running k-means on that core-set instead which produces a "(1+e )-approximation for the optimal cluster centers".

See http://people.csail.mit.edu/dannyf/kmeancoreset.pdf

ilurk · on July 15, 2015

> Overall average Speedup (X) of Yinyang over Standard (9.36), Elkan (6.12), Drake (3.08)

There wasn't a single mention on GPU use. So I'm wondering if the yinyang algorithm would port well to the GPU.

Just a quick google led me to reports of GPU speedups of 2x to 75x.

jordigh · on July 15, 2015

I haven't read the paper yet, but I'm about to. Before I do, I want to ask: do people usually implement O(n^2) distance calculations in k-means when trying to find the closest centroid that each point is to? I think that's what I usually see, which seems so dumb to me. Are metric trees that obscure?

pornel · on July 15, 2015

I use vantage point trees for finding the closest point:

https://github.com/pornel/vpsearch (https://github.com/pornel/pngquant/blob/master/lib/nearest.c)

ogrisel · on July 15, 2015

That would be interesting to investigate but you have to keep in mind that the centroids move between each iterations. So you have to re-build a new metric tree at each iteration. kd-tree / build time complexity are between `n log(n)` and `n (log(n))^2`. You don't benefit from the classifical speed-up of metric trees where you build the tree once and then do many queries on the same tree.

Also metric trees such as k-d tree and ball tree do no bring any speed-up in high dimensions.

CyberDildonics · on July 15, 2015

>Also metric trees such as k-d tree and ball tree do no bring any speed-up in high dimensions.

This is not true. Only their naive implementations don't bring speedups in high dimensions and it depends on the number of data points, query points, and dimensions. Best bin first is one common optimization.