I really liked the mention of the "three narratives" of neural networks: 1) human brain 2) nonlinear "squashing and folding" of the input space 3) probabilistic / latent variables. Seeing different ways to look at something is always refreshing and opens up deep learning to rich analytical techniques developed in optimization, type theory, etc.
HN: how do YOU think about NNs? As a matter of pure preference, most of my daydreaming of NNs comes from my inductive biases of (1) and (2), with a few ideas from MCMC methods and optimization thrown in the mix.
> Representation theory
One question I've always wondered is whether it is better to get a network to (1) learn representation transformations A->B->C->D via a series of layers, or (2) try to learn A->D via a single (highly nonlinear) transformation.
Obviously the former lends itself better to analytical understanding, and this seems to be what emerges when you train ImageNet. However, do researchers have any control over whether the learned net does (1) or (2)? I'm thinking that scattering "skip arcs" that pass gradients through layers (or an LSTM) would cause representations to mix more between layers.
> Each layer is a function, acting on the output of a previous layer. As a whole, the network is a chain of composed functions. This chain of composed functions is optimized to perform a task.
This is certainly true for the majority of models in use today, but I wonder if this will remain true for future architectures. In practice, most layer types store some state, and implement "forward" and "backward" functions that may utilize that state.
One could argue that state is not necessary if we only consider the network to be defined by its feedforward passes (de-coupled from the optimization problem, which requires the "backward" functions). But the neuroscience narrative argues that function and learning cannot be decoupled.
In the case of a network that is supposed to update it's own weights during normal operation (i.e. topic discovery), the functional narrative is less clear to me. Thoughts?
> Seeing different ways to look at something is always refreshing ... how do YOU think about NNs?
A year or two ago, when I was really doing the wandering researcher thing, I made a point of asking lots of people at different groups how they saw deep learning. I think these three narratives capture most of what I heard, although a people blended them in different ways.
Certain views are more common in particular groups. For example, people at the Montreal group seem more inclined to think about things from the representations narrative than other groups. (Although, I think I have just about everyone else beat in my extreme version of seeing everything from the manifold perspective. :P )
I feel like there's some pretty interesting quasi-sociological work to be done on deep learning. Because the community has grown so fast, and there's a lot of variety in how people think.
>> ... chain of composed functions is optimized..
> This is certainly true for the majority of models in use today, but I wonder if this will remain true for future architectures.
I suspect there's something very fundamental about chains of composed functions. I can't formalize my feeling into a strong argument though.
> In the case of a network that is supposed to update it's own weights during normal operation (i.e. topic discovery), the functional narrative is less clear to me.
I don't think matters too much. The functional graph becomes more complicated, but it's just a matter of the output of one function going to multiple places.
Cool, thanks again for your great post. Big fan of your articles.
> The functional graph becomes more complicated
Indeed. I once tried to come up with a "higher order function" that takes in a feedforward network and computes a separate function that computes the backward pass (like Theano's autodiff, but abstraction at the layer-level rather than ops).
Here's a diagram for a simple forward MLP (left to right), with the backward pass network below (right to left). I found this hard to work with, because of the explosion in the size of the computational graph when you try to decouple optimization / function. I notice something similar when trying to unroll a RNN visually across time.
http://imgur.com/ATNwknh
Let me know if this is way off base from what you were talking about in your post.
Starting at 1 is always amusing when talking to computer people. Computer people seem happy to dismiss most of the brain because "neural network—all we need is data!" But, the brain isn't just a bowl of neurons freely connecting to each other. There's a quite rigid structure to brains that give rise to things like the perception of free will and language acquisition (i.e. why no other animals have language).
The brain is organized in layers, but the layers don't care about respecting order. layer 5 can be directly connected to layer 0, forwards and backwards, as well as anywhere else in-between. Brains rely a lot more on inhibiting signals than generating signals. Plus, depending on how you count, we are made up of three brains put together: left, right, cerebellum, and those are seldom considered in our current fascination with single-task "AI" systems. You can put a shark fin on your head then swim for a while, but you're still not a shark.
As for "neuroscience narrative argues that function and learning cannot be decoupled" — saying learning is a bit of a stretch. The brain only cares about correlating coincident events ("neurons that fire together wire together") and wiring together neurons that fire at the same time ends up causing what we see as emergent behaviors (semantic memories bound to episodic memory, operant conditioning, etc).
The problem I worry about with the neuroscience narrative is that it seems easy to justify everything. As I understand it, we don't understand the brain very well, and there's lots of different theories.
A fair number of neural net papers will make appeals to neuroscience to motivate something. Whenever I read these, I wonder to what extent neuroscience could be cited to argue the opposite.
There's a lot of ex-neuroscientists in deep learning. Amusingly, my experience has been that they're generally the most skeptical of the neuroscience narrative.
Of course, some amazing researchers seem to really like the neuroscience narrative -- this is obviously a strong point in its favor. I don't mean to dismiss it. I just think that one wants to be kind of careful when using it as a vantage for thinking.
I wonder to what extent neuroscience could be cited to argue the opposite.
That's a great point. Often, in brains and bodies in general, the same mechanism causes more than one result. Sometimes even the absence of the mechanism causes the same result too. Biological understanding has a high number of latent variables we may not even know exist. (plus, arguing "because biology" is just a few steps away from arguing "because quantum" when motivated by vague optimism.)
The best overall insight from neuroscience is: AI, as envisioned by the 1950s pioneers, is possible. Brains are complex, but not irreducibly complex. Just slam some gametes together and a new brain appears. We can, at some level, eventually, get it all working digitally.
The best useful insight from neuroscience seems to be: the never-ending a/b test of natural selection converged on neurons+synapses as a universal computation system. But, neurons+synapses aren't quite enough detail to replicate a brain—that's like saying our computers use transistors, so all we need is transistors to build a CPU. Sure, that's important, but the structure and topology and interconnections are what make the magic happens.
What's the end result? More design or more brute force? Our GPUs are still Moore's Lawing themselves every few years. We are in early days of experimentation-at-scale here.
Since we will always need an overwhelmingly higher number of model parameters than data, we can't intricately tune models by hand (alternative: make an ANN to optimize ANN architectures). One breakthrough we don't really have is a minimum-description for ANN architectures. If that were to come about, we could have "ANN DNA" then 'breed' ANNs together so their topologies and functions merge to generate a (hopefully) more fit offspring. Then just keep repeating until they rise up and destroy us all.
This is an unformed thought, but to me the philosophy of emergence, intelligence and deep learning have something to do with each other. In emergent systems, you have a stack of levels of phenomena, and you can describe the phenomena at one level without reference to the details of the lower levels. It's like those details have been blurred out and you're left with a compressed representation with just enough information to be useful at the new level (e.g. pixels->edges, renormalization in physics). The success of architectures with multiple layers, including bottlenecks in layer size, seems very similar, and show that this is a good way to look at the world.
This seems to be quite a natural way that humans see the world, and in fact we might not think of something that doesn't behave this way as being intelligent -- we want there to be some sort of compressed 'trick' that doesn't just brute force the data.
I am surprised you didn't use the word abstraction even once in your description of emergent systems. I admit I have some catching up to do on the state of that topic but I couldn't help but notice the symmetry between the stack of phenomena as you described it and abstraction layers of basic computational systems. APIs and programming languages expose an interface that completely abstracts away the implementation beneath it. And much of the layers added have indeed emerged in the sense that they have been built based on what we thought might be practical on top of established layers. Anyway, just some food for that unformed thought of yours ;)
For classification, I think of it simply as a nonlinear transformation + multivariable logistic regression where parameters are learned jointly. In particular the nonlinear transformation is assumed to be of the form of some number of affine transformations, each followed by a nonlinear component-wise mapping. I tend to intentionally avoid brain comparisons because: 1) there's more than enough of that already 2) I don't know enough of the neurophysiology to speculate. I'd like to see some mathematical analysis on what classes of function are more efficiently represented (and/or learned) by networks with increasing numbers of layers.
I am writing a deep neural network lib in scala as part of my DQN implementation. It's developed in the FP paradigm with some type safety. If anyone is interested, the code is here.
https://github.com/A-Noctua/glaux
Right now only feedforward and convolution with Relu are implemented, no documentation either. But would love to collaborate with anyone also interested in the area.
It seems like the author has realised that there are a few common functional programming patterns (folds, maps) that are also common ways of combining information and operating on data structures, and seen the parallel to some operations that we frequently want to do within neural networks. A 'function' is simply a thing that takes another thing and produces a third thing. This doesn't seem that revolutionary or insightful - do these ideas give us any extra knowledge about neural networks or is this just a nice parallel?
I think that the theory is pretty much already clear to people who have a lot of experience with neural networks. I think the contribution is to explicitly write it down in a clear and concise way for people who it wouldn't have otherwise occurred to.
If this formalism works out then a few interesting directions spring to mind:
- we suddenly have a huge new toolbox from FP / category theory that we can apply to understanding and extending DNNs. E.g. what happens if we apply differentiation in the manner of zippers to these structures? I have no idea but it might lead somewhere.
- The deep learning world gets a precise way to describe network architecture which makes communication much easier and research much more reproducible.
- With a formal model you can automate building and optimising implementations.
Very possible. It just feels like something from that general area, and I wanted to point it out.
I played around with fitting this in to the Curry-Howard correspondence. For a moment, let's set aside neural networks and just talk about the probability distributions they operate on.
In Curry-Howard, one interprets values of a type as proofs of a theorem. Perhaps we could similarly interpret a samples of a distribution as verifying the distribution. This might lead to a kind of fuzzy logic version of Curry-Howard.
I'm not sure if this actually works or can be formalized -- I've only put a tiny bit of thought into it.
I've been idly chasing that idea for some time. There's some great literature out there on categories of distributions and (conditional) random variables as arrows between them. That would be a good place to start looking more deeply!
(God I wish I had the time to look into this myself!)
There are some linguistics papers going into a notion of probabilistic type theory.
It's definitely an interesting area to think about. After all, Homotopy Type Theory has been teaching us that types are spaces, with topological structure. So hypothetically, how would you construct probability measures over those spaces? Could you then take points/proofs/programs as samples? What would the samples tell you about the overall measure?
What would it all mean?
Well, the probability monad already gives us a way to compute with distributions, and thus do probabilistic programming, but I'm not sure that really tells us anything type-theoretical, since the distributions are all over preexisting data types.
The idea of deep learning as "differentiable functional programming" resonates with me. It makes me wonder if one could build an elegant monadic interface for deriving networks, where you could bind on the results of previous layers and use traditional functions to build new ones. That would be powerful.
Regardless of whether this post is right or wrong, it is a showcase for the kind of out-of-the-box thinking that can force people within and from outside the field to see deep neural nets from a different angle. For that alone, I think this is a fantastic essay.
Some comments:
The author writes: "using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is 'weight tying' – is essential to the phenomenal results we’ve recently seen from deep learning. Of course, one can't just arbitrarily put copies of neurons all over the place. For the model to work, you need to do it in a principled way, exploiting some structure in your data. In practice, there are a handful of patterns that are widely used, such as recurrent layers and convolutional layers."
Recurrent and convolutional layers are not just two examples of widely used "weight tying" patterns; they are THE TWO MAJOR WAYS in which virtually everyone "ties" weights -- across time or pixel space, usually. With the exception of tiny fully-connected models used to solve relatively small problems, every successful deep neural net model of meaningful scale I know of is either a convnet or an RNN!
If anyone knows of a counter example, I'd love to hear about it!
I would also have pointed out that "weight tying" not only allows the deep neural net to learn more quickly; it also massively reduces the "true" number of parameters required to specify the model, thereby reducing by a similar proportion the number of samples necessary to have reasonable PAC learning guarantees.
TreeNets are also pretty popular. Word embeddings are also a form of weight tying, and can't always be described as part of an RNN. If you want, you can think of most attentional mechanisms as involving a form of weight tying, in addition to the RNN tying.
There are also more obscure things, like deep symmetry networks (where are really a close relative of conv nets, with convolution replaced by group convolution).
Thank you. It seems to me that every time a new state-of-the-art result in AI is announced, a deep composition of convolutional or recurrent "neural functional programs" is involved.
I don't see deep compositions of treenets or word embedding layers, which tend to be used instead stand-alone as simpler models or as preprocessing layers to deep networks. I'd have to think about attentional models.
This is not a criticism. Rather, it's my way of suggesting that we need more experimentation with more interesting compositions using a broader range of "neuron functional programs" -- which I believe is also one of your points.
And again, I think your essay is fantastic.
--
Edits: changed my wording to express what I actually meant to write.
It's nice to get people thinking about possible connections and sharing terminology here, but I'm not sure that many of the connections which the article manages to make precise are particularly new or deep.
A neural network is just a mathematical function[1] of inputs and parameters, and of course you can build them up using higher-order functions and recursion. Libraries like theano already let you build up a function graph in this way -- see theano.scan[2], for example, which is something like a fancy fixed-point combinator.
The idea about types corresponding to representations seems like it would be hard to make precise, because at a type level everything's pretty much just R^n or perhaps some simple submanifold of R^n. "Has representation X" isn't a crisp logical concept, and in most type theories I know of types are crisp logical properties.
Even if you can accomodate fuzzy/statistical types somehow, I would tend to think of a representation as being the function that does the representing of a particular domain, rather than the range or the distribution of the outputs of that function. Two very different representations might have a similar distribution of outputs but not represent compatible concepts at all.
Still there is a neat observation here that by combining two different representations downstream of the cost function you're optimising (e.g. by adding their outputs together and using that as input to another layer) you can force them in some sense to conform to the same representation, or at least to be compatible with eachother in some sense. You could probably formalise this in a way that lets you talk about equivalence classes of representation-compatible intermediate values in a network as types.
It wouldn't really typecheck stuff for you in a useful way -- if you accidentally add together two things of different types, by definition they become the same type! But I guess it would do type inference. For most networks I'm not sure if this would tell you anything you didn't already know though.
If you want to find a useful connection between deep learning and typed FP, you could start by thinking about what problem you'd want your type system to solve in the context of deep learning. What could it usefully prove, check or infer about your network architecture?
[1] I was going to say "a differentiable function", but actually they're often not these days what with ReLUs, max-pooling etc. The non-differentiable points are swept under the rug in that I don't see anyone bothering with subgradient-based optimisation methods.
Over the past few months, I've been learning group theory, representation theory, and a little bit about reproducing kernel Hilbert spaces, and I think these concepts also tie in with the author's vision for the future of neural networks.
We are often interested in studying a set of objects in a way that is invariant to some type of transformation. Essentially, we want to retain all of the information describing an object except for that which is necessary to distinguish the object from all transformed versions of itself.
As a motivating example, let's say we want to compare a collection of photographs. One difficulty is that each photograph may have an arbitrary/unknown rotation applied to it. How, then, can we create a representation of each photo that does not depend on its rotation? The trivial solution is to map each photo to the element of a singleton. This solves the rotation problem but unfortunately removes all rotationally invariant information from the photos as well. So our goal is to find the boundary that splits the information describing each photo into "invariant" and "non-invariant" partitions.
One way to do this is to train a neural network on arbitrarily rotated versions of many photos. The network will then evolve to extract the unnecessary information from each photo (compressing them in a sense). The downside of using a neural network is that the training process can take a heck of a long time to figure out the invariants, and it isn't even guaranteed to work that well. And that's just for the preprocessing step; the rest of the network will take forever to train.
So what can we do? Another idea is to use the Gram matrix (a.k.a. the kernel matrix) corresponding to a set of n points in R^2. The Gram matrix preserves all non-rotational information about the set of points; however, it is also overcomplete, meaning that we started with 2n real numbers, and now we have n^2 real numbers (or n(n-1)/2 if we consider just the upper triangular portion of the Gram matrix). This is less than ideal.
Representation theory furnishes a better solution to the problem by providing a means of identifying the fundamental degrees of freedom that characterize a set of transformations (i.e., a group). A representation ρ of a group G is a homomorphism from G to the general linear group on a vector space V. A representation is considered irreducible if there exists no nontrivial proper subspace W ⊂ V for which ρ(g)(w) ∈ W for all w ∈ W, g ∈ G. So assuming you can find an irreducible representation, the remaining task is to figure out how to label (index) the mappings. A familiar example of a set of labeled mappings of an irreducible representation is the set of spherical harmonics: these functions constitute an irreducible representation of the group SO(3) (rotations in 3D space). The functions are indexed by integers l ≥ 0, |m| ≤ l. Every equivalence class {f(x, y, z) ~ (Qf)(x, y, z) | x, y, z ∈ R, Q ∈ SO(3)} maps to a unique linear combination of the spherical harmonics (augmented with a set of radial functions). Thus, the coefficients of these basis functions contain only the rotationally invariant information describing a function over R^3, and nothing more (these are what you want to feed into your neural network!).
In a similar vein, the symmetric group Sn (the group of all permutations of n objects) has irreducible representations that are indexed by the Young tableaux (unique geometric shapes corresponding to integer partitions). So you can isolate the permutationally invariant information that specifies a function as well.
HN: how do YOU think about NNs? As a matter of pure preference, most of my daydreaming of NNs comes from my inductive biases of (1) and (2), with a few ideas from MCMC methods and optimization thrown in the mix.
> Representation theory
One question I've always wondered is whether it is better to get a network to (1) learn representation transformations A->B->C->D via a series of layers, or (2) try to learn A->D via a single (highly nonlinear) transformation.
Obviously the former lends itself better to analytical understanding, and this seems to be what emerges when you train ImageNet. However, do researchers have any control over whether the learned net does (1) or (2)? I'm thinking that scattering "skip arcs" that pass gradients through layers (or an LSTM) would cause representations to mix more between layers.
> Each layer is a function, acting on the output of a previous layer. As a whole, the network is a chain of composed functions. This chain of composed functions is optimized to perform a task.
This is certainly true for the majority of models in use today, but I wonder if this will remain true for future architectures. In practice, most layer types store some state, and implement "forward" and "backward" functions that may utilize that state.
One could argue that state is not necessary if we only consider the network to be defined by its feedforward passes (de-coupled from the optimization problem, which requires the "backward" functions). But the neuroscience narrative argues that function and learning cannot be decoupled.
In the case of a network that is supposed to update it's own weights during normal operation (i.e. topic discovery), the functional narrative is less clear to me. Thoughts?