Predictive learning vs. representation learning

eli_gottlieb · on Aug 30, 2014

I read all the way into the next post, "What the hell is representation?". I don't really see why the question is all that confusing (first Red Alert for ignorance and bullshit goes here). My first halfway-informed guess is: a representation is a three-tuple of a compression function, a resulting data-structure, and a decompression function that, taken all together, capture at least some of the causal/computational structure of the process generating the data.

Of course, this extreme generality is what makes representation learning Very Hard, but also very powerful when you can manage to get it working at all.

Compression function: because we choose to assume (without ever being able to prove, see Chaitin) that our data is not algorithmically random. Conveniently, most real-world data isn't algorithmically random, so learning about the underlying process must result in a description that can compress the data more efficiently than just writing down the sample-set.

Data structure: the actual representation, which (ideally) explains (most of) the data on its own, plus or minus some kind of noise.

Decompression function: learning would be useless if we couldn't perform any kind of inference or prediction. We need a way to take the internal representation, plug in some parameter values (either ones we've learned from data or deliberately counterfactual ones), and then make a prediction about further samples which will constitute the actual action taken to perform a useful task.

So it's a question, I think, of whether we're doing information-theoretic learning, or algorithmic information-theoretic learning. Mind, this could all be bullshit, as I'm a total novice at this stuff.

tristanz · on Aug 30, 2014

Everybody brings their own terminology to these issues, given their background. This is all just statistics. While I think the intuition of your terminology makes some sense, it's probably the wrong frame. It's important for people to use same terminology, with a clear mathematical definition. Unfortunately there's still a big gap between statistics and machine learning communities.

Everything in "learning" follows a good parameterization of p(y, x), the joint distribution of the data, whether unobserved or not. If you have that, you can get everything else.

The core idea of representational learning is important, even though it's obvious in hindsight. When doing statistics, many people assume the parameters governing the conditional distribution p(y | x) are distinct from those governing p(x). So if you're just interested in predicting y, you don't need to model p(x) to get the posterior distribution p(y | x). Representational learning suggests that the parameters of these distributions are coupled. It's saying that if you understand the structure of the world -- you have a good representation of p(x) -- you can make better predictions p(y | x) with less data. This makes a lot of sense because the distinction between x and y is totally arbitrary.