Hacker News new | past | comments | ask | show | jobs | submit login
How to Use t-SNE Effectively (distill.pub)
84 points by allenleein on Oct 15, 2016 | hide | past | favorite | 15 comments



Has anyone here used t-SNE for visualization of their high dimensional data for machine learning? Were you able to ensure that your feature space was good via t-SNE?


t-SNE is used very widely in ML, both to analyze input data and to analyze learned representations.

To me, the canonical example of people using t-SNE this way is visualizing word embeddings, like this: http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/e...

Also, some lovely examples of using t-SNE to visualize conv net representations: http://cs.stanford.edu/people/karpathy/cnnembed/

I think people have found it very useful as one of the main tools for understanding what deep models are doing (along with optimization-based feature visualization) and to just check that your model is learning. I haven't really heard about people using it to pick input features, but that's probably mostly because I don't really work with anyone doing feature engineering.


> t-SNE is used very widely in ML, both to analyze input data and to analyze learned representations.

I was wondering if t-SNE could be used to know the "goodness" of your selected features for the prediction task. In the sense, t-SNE could be used as an indicator for feature selection. I was wondering if some people have successfully used t-SNE for feature selection and in what cases.

Edit - Off Topic : Cool, homepage/blog.


I made an interactive visualization which clustered ~10,000 news headlines (converted to 50D with word2vec) via t-SNE to help illustrate groups of clickbait headlines: http://minimaxir.com/2016/08/clickbait-cluster/

I did not hyper optimize parameters for that visualization, so I find this post interesting. Additionally, it was pointed out afterward that I may have cheated slightly in terms of labeling by feeding the news source to the t-SNE algorithm.


t-SNE is commonly used in single cell RNA sequencing experiments. These experiments use microfluidic technologies to profile the gene expression of thousands of single cells.

t-SNE is usually used on the top N (quite commonly N = 50) principle components. This tends to give good separation of different cell types.

A few papers examples are:

http://biorxiv.org/content/early/2016/07/26/065912 http://www.cell.com/abstract/S0092-8674(15)00549-8 http://www.cell.com/cell/abstract/S0092-8674(15)00500-0


This sounds interesting. I am in a field where a lot more focus is on visualizing samples using different metrics with PcOA instead of using regular PCA.

If i just scroll through the Zheng et al arxiv paper it all seems a little arbitrary to me. Selecting a 1000 features, then 50 components. They argue that it is for computation time reasons, but is there any kind of benchmark suggesting this is a better strategy than just plotting the two first components or using MDS which also has the advantage in this scenario of being convex?


There are so many arbitrary choices made when analyzing single cell RNA-seq data. There coverage cutoffs to decide when a gene is expressed, arbitrary QA points to decide when a cell is "good quality", the PC's chosen for t-SNE, the genes identified as more variable than estimated levels of technical noise, etc etc. is very frustrating. This leads to huge issues with reproducibility, almost every single paper uses their own in house custom analysis pipelines and they rarely make them open source.


The main reason is because no one wants to publish technical or "methods" type papers where they assess the technology. There is usually one or two initial big papers introducing the technology that makes a big splash. No one subsequently will want to assess it or improve much on it because it won't publish well and you likely will not get cited for it anyways.


Yes i know the feeling.

Our field has some very arbitrary threshold for noise on single features, sounds like there is some slightly more principled strategy in single cell genomics?


The seurat R package (http://satijalab.org/seurat/) tries to give you more information on how to choose your PCs. But as with anything in biology, it becomes subjective at some point.


Thanks!

I happen to have some super high-dimensional data (~100k-1m dimensions), which takes a huge amount of time to work with because i have to custom write everything, and i notice they claim all their underlying functions use sparse matrix representations. Have you tried it in a very high dimensional context?


With single cell data, gene expression data from each cell is considered a dimension. So you end up with something in the range of 20-30k genes and possibly thousands of cells (~25k x thousands matrix). I don't think the technology is at the scale of hundreds of thousands of cells yet. So I am not sure if this package will handle 100k-1m dimensions.


The article mentions that:

"There may not be one perplexity value that will capture distances across all clusters—and sadly perplexity is a global parameter. Fixing this problem might be an interesting area for future research."

There are some suggestions in the literature for fixing this. Michel Verleysen's group suggested a "multi-scale" approach:

https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014...

http://dx.doi.org/10.1016/j.neucom.2014.12.095 (more details in this one, but behind a paywall)

Their approach is to calculate the input probabilities using multiple perplexities and use the average. They also suggest tweaking the output probabilities, but it uses a free parameter that isn't present in the standard formulation of t-SNE (their suggested algorithm takes the same approach as t-SNE, but uses a different cost function and output weighting function).


Are there standalone packages that can perform t-sne, or are we limited to using the versions in R, etc. ?


There are plenty; This page contains a partial list: https://lvdmaaten.github.io/tsne/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: