Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.
It's interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc.
I agree that on some level more data sets would be nice, but I felt that it cluttered and obscured the exposition. Instead I used the one synthetic dataset, but crafted in to have various properties (noise, cluster shape, variable density, non-standard distributions) that will confound many different clustering approaches ... it is meant to be the "hard" case that with all the difficulties and confounding factors rolled into one dataset.
It's included in the upper left corner of the plots. To be fair, these are for the sklearn implementations, some of which are excellent, but I can't speak for the performance of all of them.
"It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc."
I agree. I think that would be the logical next step for this article. Show various real world examples and describe why certain clusters might be better for these types of problems. But all in all, awesome article, thanks for the education.
This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the "curse" of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they're quite distinct from the gotchas of low-dimensionality.
The examples are all in 2D because it allows someone to visualise what's going on. In higher dimensions things get messier and you have to rely on cluster quality measures ... which are often bad (or, more often, defined to be the objective function that a particular clustering algorithm optimizes, and hence give a false sense of how "well" the clustering has done). I've worked with HDBSCAN quite successfully on mid-range dimensionality data (50 to 100 dimensions). I agree that it doesn't work as well on truly high dimensional data, but then little can once the curse of dimensionality truly kicks in. Your best bet, at that point, is the assume that your data actually lies on some lower dimensional manifold (i.e. the intrinsic dimensionality is much lower) and apply suitable dimension reduction techniques (t-SNE, Robust PCA, etc.) and then cluster.
These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?
Is it just a matter of tweaking the definition of density / distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.
If you want to use an HDBSCAN algorithm on graphs then I suggest you look into Spectral Clustering (which traditionally uses K-Means, but could use HDBSCAN instead. Otherwise you may want to consider graph specific algorithms such as Louvain.
Looks at spectral clustering. Spectral methods usually take the distances between points to generate a graph, then operate directly on that graph. If you already have a graph/network, you can use those methods directly on it.
I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.
A model is a guess about the underlying process that "generates" the data. If you're trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.
On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.
soft margin SVM would be useful here. And there are manifold aware topographical clustering algorithms, like Persistent Homology techniques, Gunnar Carlson is the man on this new clustering offshoot.
Oddly enough HDBSCAN can be recast as a persistent homology computation; the trick is in simplicial complex construction; you need something slightly more density aware than traditional Rips complexes. I am currently working on designing a version of Carlsson and Memoli's Multiparameter Hierarchical Clustering based on the persistent homology interpretation of HDBSCAN.
This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I've seen, which is a useful approach.
The t-SNE algorithm isn't a clustering algorithm, but a nonlinear embedding. Clustering algorithms assigns a label (or no label) to each point in the data set. Instead, the t-SNE finds low dimensional coordinates for each point such that nearby points in the original data are nearby in the lower dimensional representation.
That being said, I've seen the t-SNE used as a preprocessing step prior to clustering, as in ACCENSE[1].
As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes.
The answer is maybe. You need to reduce that text to a set of features that you can pipe into a clustering algorithm, and what’ll really make or break your approach is how you convert the text.
There are two types of features and related clustering algorithms: categorical and numerical. Numerical features are the most commonly supported by out of the box clustering tools (meaning most of the algorithms made easily available to you will expect numerical features). Categorical features can be reduced to binary numerical features, however, though whether that’ll make sense for your data depends: For example, if products have a set of categories they can be sorted into you may want to just use that data along with K-Modes clustering or some other approach that considers categorical features. (Though to me those don’t really seem to produce useful results.)
For short text clustering I’d try an ngram frequency approach, wherein you reduce the text to a set of features describing the frequency of ngrams from, say, unigram up to trigram. You’re attempting to balance the number of features you end up needing to process with the amount of locality information you need to get a useful result. If you end up with far too many features, you could attempt to cull them with an approach such as PCA, but it may not even be necessary.
Do note that I’m only a little above 0 in terms of insight or knowledge, so take my advice with a good heap of salt. Clustering of this data would group together products that are described “similarly”, but “similarly” in this case doesn’t imply the products are in fact similar, but only that they’re described in a similar manner.
You may also want to explore graph clustering over vector clustering, which—while there may be no out of the box solutions readily available—is likely a much better fit for text in general.
If you're interested in clustering text documents, the canonical algorithm would be latent Dirichlet allocation, which is a topic modeling algorithm. You can find latent Dirichlet allocation in sklearn; however, you're more looking for something that returns a raw similarity score it sounds like, in which case it might be interesting to check out word2vec. Perhaps checkout this stack overflow answer: https://stackoverflow.com/questions/22129943/how-to-calculat...
In addition to everyone suggesting the classic n-gram approaches, now it is rather easy to use a word2vec (google it) representation of the words instead - obtain a mapping between words and an array of x numbers (either by finding a pretrained word2vec model on internet or training one on texts from your special domain), and then just run clustering on those numbers instead.
Desnity Peak clustering is an interesting idea, but is still centroid based and will have many of the same issues as K-Means (but may pick better centroids for this case). It also involves a little bit of parameter tuning (particularly with regard to bandwidth), and can be relatively slow if not suitably accelerated with space indexing structures such as kd-trees or cover-trees.
It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).
It was actually Ward. I agree that single linkage would have performed better, however the noise would have greatly confused the issue. Robust Single Linkage (Chaudhuri and Dasgupta, 2010) would be the better choice in the presence of noise as with the test dataset (which was designed to be as difficult as possible for clustering algorithms, while being obvious to a human viewer.
That's fair, but the subtitle was intended to be a little controversial. I could have included more datasets, but ultimately that just clutters the exposition -- instead I chose a dataset that can illustrate several different ways clustering algorithms can break. Ultimately it is meant to be a teaser as to why you should use HDBSCAN; the real answer is to grab the code and run it on your data, or your favorite test datasets, and see how it performs for you (because there is no substitute for that). I'm actually pretty confident you'll find the results impressive enough to make HDBSCAN a go-to choice for clustering.
I don't know of any implementations yet. On the other hand it isn't that hard to get the basics working (see http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas... for an explanation of the algorithm). The trickier part is getting good performance, but for small datasets that doesn't really matter. Hopefully someone will put together an implementation for R soon.
Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.
It's interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc.