Decoding the Thought Vector

numinary1 · on Nov 30, 2016

When I was working on a recommender for television shows, I ran SVD on a large User/Item matrix to create a low rank approximation, essentially reducing thousands of user features (TV show preferences) to user vectors representing twenty or thirty abstract "features". Then I looked at the actual item preferences of users who expressed each feature at the greatest and least magnitude. The features, in some cases, mapped to recognizable constructs. There were distinct masculine and feminine features, several obvious Hispanic / Latino elements, and strong liberal versus conservative indicators. Others were less explainable using common labels.

It struck me at the time that the qualities that were expressed most strongly were the ones that ended up having names in our language. But there were others for which I would say to myself, there is something about this group (e.g. those with the greatest expressed value of F124) that I recognize, but can't quite put my finger on.

Of course, I was looking at people through a keyhole, their TV viewing preferences being the only information I had.

Also, I noticed that these "came into focus" most clearly at a certain level of compression (rank).

FWIW

jmde · on Nov 30, 2016

I started reading the essay not knowing what to think, and it turned out to be more relevant to my work than I thought.

The issues being discussed in the essay have been a central issue in some area of psychology and behavioral sciences for some time--how to interpret components such as these.

One thought about your "coming into focus at a certain level of compression" comment: I've done some analyses of these vectors as applied to text samples, and one thing that struck me was how unreplicable some of them were across datasets that should be ostensibly similar (but are not the same). Others, in contrast, reappeared across multiple corpora. To the extent some of these components represent "real" features, they should reappear consistently across different datasets where you'd expect them to. That is, they should be robust to changes in idiosyncratic features of the database.

yxhuvud · on Nov 30, 2016

Did you ever compare that focus with a graph over the singular values?

thearn4 · on Nov 30, 2016

It's a good question, FWIW I would expect a reasonably sharp "L" shaped curve in the focus. The assumption there I guess being that this metric of 'focus' is something well characterized by low-frequency type basis matrices given by the first few rows/columns of the SVD's U and V.

numinary1 · on Nov 30, 2016

Exactly what I saw. Your expectation is correct.

gallerdude · on Nov 30, 2016

Computers have us figured out in a way that we don't.

sweetdreamerit · on Nov 30, 2016

A question: is this much better / different than a principal component analysis (or a factor analysis)?

antognini · on Nov 30, 2016

It's a bit of an apples/oranges comparison to compare SVD to PCA. SVD is a numerical technique, whereas PCA is a method to analyze a dataset. You can use SVD to perform PCA (although there are other ways to perform PCA without explicitly doing a SVD). I'm guessing that the GP performed PCA using SVD. There's a good Stack Exchange answer to exactly this question here:

http://stats.stackexchange.com/questions/121162/is-there-any...

zo7 · on Nov 30, 2016

One way to do PCA is using SVD to find a transformation matrix of eigenvectors to project your data with, so they're similar.

SolarNet · on Nov 30, 2016

> Rather curiously, it turns "airplanes" into "knives". I do not understand why this happens.

I thought it was pretty obvious. The atoms are complected (defined - by Rich Hickey of Clojure - as, basically, a semantic that contains multiple interdependent concepts (for example how variables complect state, values, and names)). In fact that's the conceit of the whole idea, the thought vector is being extracted from the sparse matrix, sometimes that sparse matrix isn't that sparse and you will get complected concepts. It was obvious in the earlier pics. One atom might contain a piece of information needed by a hat, that when combined with other atoms makes a hat, but when combined with a different set makes a headband.

For example if you look knives is shared with scissors. It's one atom describing roughly "handheld sharp objects". The airplane atom + many items atom is actually a special mutation for many knives. Where as the many items atom + sharp objects atom is likely scissors. And the sharp object + airplane atoms are probably knife. They are all complected and interdependent.

Sure an atom may generally mean a specific concept but sometimes it will fall back to a combination specific mutation. For example there aren't often many planes, and many planes looks rather like many knives. And there probably aren't ever more than one pair of scissors, and one pair of scissors looks rather like two knives. It's a way to describe 3 things and their number (knives, scissors, plane) using 2 things, an existing counting mutator, and the fact that scissors and planes are often singular. It's a form of semantic compression, quite interesting, and I would imagine domain dependent.

That's my hypothesis anyway.

jawns · on Nov 30, 2016

This explanation reminds me of a concept in ASL (American Sign Language) called Classifiers:

https://seattlecentral.edu/faculty/baron/Summer%20Courses/AS...

They are "class" modifiers that modify different nouns in different ways.

ap22213 · on Nov 30, 2016

This is incredible and infinitely useful. For many years, I've had this hunch that Human symbols and abstractions had an algebraic quality. And, I have always wanted to substitute tags, category, labels, and attributes with global, permanent indices. Mainly, I wanted to do this because the 'view' of a thing is mutable: e.g. a word for a concept changes over time even when the conceptual meaning remains constant. I can't wait to get home and play around with this.

earthly10x · on Nov 30, 2016

Much of AI and Machine Learning all boils down to the vector and vector space in addition to how well those features are engineered, constructed, scored and ranked.

EvanMiller · on Nov 30, 2016

Nice interactive examples but I'm afraid the basic setup here doesn't make sense to me. The "atom" is defined as the average encoding of inputs with the feature ("faces with a smile"), but I'd think the proper definition should subtract off inputs without the feature (i.e. "smile" = "faces with a smile" minus "faces without a smile"). The way it's defined you end up adding an extra "average face" along with the feature of interest, which is clearly seen in "The Geometry of Thought Vectors" example -- the non-smiling woman isn't so much forced to smile as to have her face merged with that of a generic smiling woman.

ximeng · on Nov 30, 2016

Is there any way to read these responsive sites on mobile when they are cut off by the column margins?

dharma1 · on Nov 30, 2016

Request desktop site

webmaven · on Nov 30, 2016

Viewing with a landscape viewport sometimes works.

choxi · on Nov 30, 2016

The "atom" terminology is a bit confusing to me. Isn't an "atom" just another thought vector? If "sunglasses" + "smiles" = "smiling-while-wearing-sunglasses", then "sunglasses" and "smiling-while-wearing-sunglasses" are both just vectors. Is there a reason for the distinction?

Also, if all thoughts can be described as vectors and linear combinations of vectors in "thought-space", I wonder what the axis represent and how many dimensions there are. Are all thoughts just a combination of 100 "unit thoughts"?

Really interesting post!

mooneater · on Nov 30, 2016

I find it counter-intuitive that thought vectors in should have "Linear Structure" in multilevel autoencoders.

Since the whole appeal of neural networks is that they can model non-linear functions. Why would the autoencoder end up with an encoding that is essentially linear?

antognini · on Nov 30, 2016

The goal of a neural network is to take a complicated manifold and, through each of its layers, flatten it out into progressively more and more linear manifolds. If you are building a classifier, then the inputs to the last layer will necessarily have to be linearly separable because the final layer is essentially linear --- the softmax operation just transforms the pre-activation values from logits to probabilities.

So if the NN is well trained, the second to last layer will have linearly separated the different classes (as much as possible, anyway). Earlier layers may not have completely linearly separated their inputs, but they are probably going to lie along simpler manifolds than even earlier inputs.

There's a good blog post on this here: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

mooneater · on Dec 2, 2016

Yes I understand that conceptually. But the middle autoencoder layer is not really an output layer, what would constrain it to linear representations? I assume normal output layers are constrained to linear representations by the supervised training process.

webmaven · on Nov 30, 2016

From the post:

> Rather curiously, it turns "airplanes" into "knives". I do not understand why this happens.

I would venture to guess that this happens because of the existence of an plural ambiguous "thought" bridging the "airplane" and "knives" concept vectors, along the lines of airplanes -> propellers -> blades -> knives, and the "a group of" vector is causing the system to jump over that semantic ambiguity.