I cut the following sections, on user interface and machine learning, because they were too speculative. But they might be of interest here, so I'll post them.
===
Perhaps you are now persuaded that deep learning has something helpful to offer in visualization problems. But visualization is really about making interfaces for humans to interact with data. It’s a small subset of the general user interface problem. I think that machine learning, and deep learning in particular, also has a lot to offer for the general problem...
I really like the figure showing the nearest-neighbour graph of MNIST being stretched as it goes through the hidden sigmoid layer. Really helps build an intuition for those of us who think visually or geometrically.
He quotes Bret Victor at the end:
"When Hamming says there could be unthinkable thoughts, we have to take that as “Yes, but we build tools that adapt these unthinkable thoughts to the way that our minds work and allow us to think these thoughts that were previously unthinkable.”"
This is a fabulous article in general, and I love the Bret Victor quote as well.
And while I love the analogy -- and I think is applicable to what we're calling data science -- thoughts, speaking broadly, are fundamentally different than sounds and smells and wavelengths of light. The essence of thought is that we think it (it is thunk?). All of those other examples are subjective representations of physical phenomenon, while constitutes a thought, again broadly speaking, is less well agreed upon (understood?).
Still a terrific article, my pedantic nitpicking aside.
He makes an excellent point about the possibility of comparing word vectors trained on different corpora to make quantitative statements about differences in culture, either over time or between sub-cultures:
"I’d like to emphasize that which words are feminine or masculine, young or adult, isn’t intrinsic. It’s a reflection of our culture, through our use of language in a cultural artifact. What this might say about our culture is beyond the scope of this essay. My hope is that this trick, and machine learning more broadly, might be a useful tool in sociology, and especially subjects like gender, race, and disability studies."
Thanks! That was one of the most exciting parts of the post for me.
It would be really cool to have something, like the Google Books ngram viewer [1], which would allow you to see how this changes over time, using a huge corpus. I imagine a graph where the x-axis is year, the y-axis is a linear combination of word vectors that the user defines, and then the user can select words and see them plotted over time.
Ha, I was thinking along exactly the same lines when I read that paragraph. Last week, I found myself reading a 1784 magazine article [1] about "Aerostatical Experiments" (the first hot air balloons) which referred to "inflammable air" which is what they called hydrogen in those days. Google n-gram viewer gives a beautiful illustration of when the name changed [2] -- this is an obvious switchover but I imagine many words change meaning and usage more slowly and in more subtle ways, so your proposal was quite exciting to think about. Let's hope someone takes up that line of research, either from the humanities side or the machine learning side.
Colah, your posts are really inspiring and thoughtful. Your thoughts about visualising the space of representations by looking at the properties of pairwise distance matrix is quite illuminating. It might be a nice empirical way to get a glimpse of the model complexity: If "simpler" models cluster close to more complex models, the simpler models are more desirable.
I wonder if all over-fitted models cluster in one region in the meta-SNE space, or do they show up as noise?
> If "simpler" models cluster close to more complex models, the simpler models are more desirable.
Well, it would suggest you aren't winning very much for your more complex model, at the very least.
> I wonder if all over-fitted models cluster in one region in the meta-SNE space, or do they show up as noise?
This corresponds to an empirical question: do models overfit in the same way, or different ways?
One small experiment I did, which might offer some intuition here, was training lots of extremely small networks on MNIST, with hidden layers of only 1, 2 or 5 neurons. What do they look like in meta-SNE?
Well, it turns out that when you only have a very small number of neurons, they latch on to random useful features! These randomly selected features don't tend to be the same, so you end up with the models horribly disagreeing on what is similar and what is different.
As you increase the number of neurons, the space of features they look at, if not the features of individual neurons, becomes similar across models. And so the models agree more, and cluster more tightly.
...
Another fun idea for using meta-SNE is ensemble models. We know that training a bunch of models and then averaging their results (ensembling) can improve results a lot. When is this helpful? My guess is that the farther apart compatibly good models are in meta-SNE space, the more ensembling will help, because they've learned different things.
Ensemble (and also boosted) models: Very nice idea.
I like the takeaway that meta-SNE idea is powerful to compare the space of models by through the lens of pairwise distances as a proxy for the distance metric. Are distances the defining property for a vector space R^d? Could you have used some other quantity instead of pairwise distances?
There "the defining property" if you want to mod out isometries. :) They're nice, because they encode the geometry of the data.
You could very reasonably try things like cosine distance. And I did some experiments, to good results, with sqrt(d(x,y)), to emphasize really close together data points as special. But these don't feel as motivated.
Hm. It might also be interesting to try with the p_ij values from t-SNE, which model the topology of the data. Then you'd really be getting meta. :)
Interesting. IIUC, what you're implying is that defining a metric defines the topology and they're equivalent.
Isn't p_ij in t-SNE also derived from the distances themselves, where p_ij ~ student_t(d_ij, degrees_of_freedom) (I forget how the d.o.f. is actually computed in t-SNE.)
Which leads me to one way this distance based approach might be limited: It models similarities using distances, which are symmetric. If similarities aren't symmetric, then this visualisation could hide some information. For example: The specific entity "BMW car" is more similar to the more general entity "car" than the entity "car" is to "BMW car." It seems this asymmetry could capture things (such as the generality of concepts), not reflected in metric spaces (on first thought).
A tangentially related question: why do the fonts look like computer modern? Do you write the article in LaTeX and have it translated into HTML preserving the font style?
===
Perhaps you are now persuaded that deep learning has something helpful to offer in visualization problems. But visualization is really about making interfaces for humans to interact with data. It’s a small subset of the general user interface problem. I think that machine learning, and deep learning in particular, also has a lot to offer for the general problem...
[Remainder moved to notehub, to be keep comment to a reasonable length: https://www.notehub.org/2015/1/16/perhaps-you-are-now-persua... ]