A note -- if you're linking to arXiv, it's better to link to the abstract (http://arxiv.org/abs/1312.6184) rather than directly to the PDF. From the abstract, one can easily click through to the PDF; not so the reverse. And the abstract allows one to do things like see different versions of the paper, search for other things by the same authors, etc.
Very interesting and well written paper. They show a deep-neural network (which people thought had more explaining-power-per-parameter) can be "emulated" by a shallow neural network with the same number of parameters.
What Ba and Caruana have shown is that deep neural networks---as architecture---are not more powerful. The previous general opinion (at least this is what I thought) was that deep architectures of the form:
predictions
layer 3 features of features of features
layer 2 features of features
layer 1 features
data layer
would be more efficient (i.e. more explaining power per parameter). Nope. Apparently, a shallow neural network of the form
predictions
layer 1 features (a lot more of them)
data layer
with an equal number of units in layer 1 as the total number of units in deep network,
can obtain the same accuracy. Thus, deep neural networks and shallow neural networks have roughly equivalent explaining-power-per-parameter.
A separate result shows deep convolutional neural networks can be simulated by shallow NNs with a simulation overhead of ~10x.
Another cool observation they make about "installing" the parameters of ML models. We'll use the following analogy to illustrate the point:
ML model <==> hardware
ML model parameters <==> software
Running program X involves (1) buying/renting a server, (2) compiling X for your architecture, and (3) running X. Of course there is always the alternate step (2') of installing pre-compiled binaries.
Similarly, we can think of using a ML model as a three step procedure (1) ML.init() (malloc for Model.parameters), (2) Model.train() (fit parameters to training data), (3) Model.predict(new_datum).
The authors point out the existence of a possibility of distributing models parameters as binaries. They demonstrate an alternate step (2') in which they train a ''FancyModel'', then set the parameters of the simple model so it "simulates" the FancyModel: ''Model.parameters = simulate(FancyModel)''. In other words, the model parameters of a ML model need not be learned through the model itself.
What I find fascinating is it seems deep architectures don't have any more representational power, but they do have more learning power. Something very interesting is going on...
The novel part is not that the shallow network can compute the same functions (that is interesting, but that particular result is decades old), but the notion that the shallow network can learn the same functions efficiently.
What we already knew: For any given neural network of N layers, there exists a 3-layer neural network computing the same function.
What we didn't know: There are (at least reasonably often) ways of training shallow networks that get similar efficiency to the ways of training deep networks given the same functions.
No, they require training a deep neural network first. Then training the shallow net to mimic the deep one.
The claim is that the shallow net can have the same number of parameters as the deep net. It's always been known that a large enough shallow NN can theoretically approximate any function. But that it can do so with few parameters is very surprising.
My best guess is that the majority of parameters in deep NNs are unused or redundant. I.e they have a low weight, or don't influence the output very much, or another neuron computes mostly the same function. Many types of regularization like dropOut and weight decay heavily encourage this.
Whereas the shallow NN is forced the maximize the usefulness of every single parameter to get the same result. It doesn't have to worry about overfitting. I am not sure if the paper accounted for this though, it's been awhile since I read it.
No, shallow NN definitely overfits in normal circumstances. However, in this case they are building a shallow NN that is not learning a normal function, but is being trained to emulate a deep network of equivalent amount of parameters.
Parents claim is that in that case, we don't have to worry about over fitting, since we want every result of target network to be as closely emulated as possible in the shallow network.
It seems reasonable that when you decompose the representation of the target function you are trying to learn into more layers, each of the layers can be "more gently" modulated to emulate the target function as observed through the training data. Differential techniques might therefore more readily find good approximations to the target function. Once the structure has been captured by finely slicing it into more layers, this structure can be compressed down into a shallower representation.
This is fascinating. "Model compression demonstrates that a small neural net could, in principle, learn the more accurate function, but with the current learning algorithms we are unable to train a model with that accuracy on the original training data; instead, we must train the complex ensemble first and then train the neural net to mimic it."
That's a profound result. It bears on how "abstraction" works, and gives some insight on how learned information becomes more rigid. Once you've compressed the neural net, you can't update the compressed form, which has much less state, based on new data.
It's a very interesting result, but as always with neural networks we have to keep in mind that what matters is not whether a model can be encoded in a different architecture (even at equal entropy), the question is whether the model can be learned in the first place. When you work with shallow nets trained with regular backprop + dropout, you see that their learning capabilities tend to "saturate" much quicker than deep nets. Often with shallow nets, after a point you don't get better results by adding more units or more training data. But deep nets are better able to make use of these extra parameters (extra layers) and extra training data.
Possibly because deep nets conceptually "break down" a learning problem into incremental steps (each new layer being a higher level of representation).
But then again, maybe the problem is simply that we don't have sufficiently good methods for training shallow NNs on large-scale problems. After all, it's only recently that we figured out how to efficiently train deep nets (either pre-training with Autoencoders or RBMs, or through Hessian-free optimization).
I like this paper because it turns some of the current intuition about deep nets around. It shows that the current understanding of why deep nets are so good at so many (perceptual) tasks is that the depth buys you a lot. Yoshua Bengio will point out that there are functions that require exponentially more gates to encode when using shallower circuits. This might lead people to believe that deep nets are working so well because they are more fundamentally capable of representing the solutions to problems that people care about in a terse way.
But this work proves (at least for this audio task), that there are solutions as good in the solution space spanned by shallow nets with memory usage that we can afford, we just didn't know how to find them.
> maybe the problem is simply that we don't have sufficiently good methods for training shallow NNs on large-scale problems
In a sense, that's what this is though, right? It's a training algorithm for the simpler class of model. It just has to go via the deep model to get there.
The TIMIT (speech) benchmark they use is not freely available - you have to pay for it.
It is a great shame that papers use this benchmark - how do I reproduce their results? How do I compare my methods to theirs?
Meanwhile the image benchmarks MNIST, NORB, CIFAR-10, CIFAR-100 are freely available. Kudos to the people who made them.
We start with one individual with a trust level of 1 (a probability of correct work in the bayesian sense). All other contributors start with a trust level of 0.
Anyone with a trust level above MIN_TRUST (say, 0.6), called a trustee, can validate others' work. This status is dynamic, such that a trustee can stop being one, invalidating all its verifications.
Valid work is work that has a score above MIN_TRUST. Such work is included in the benchmark (with a possible added check, such as a lower bound for the number of votes received).
The score of a work is the lower bound of the Wilson score confidence interval for a Bernoulli parameter with a confidence level of 95%. Given `total` the number of votes from trustees and `valid` the number of votes that claimed this was a valid work:
z = 1.96
z2 = z * z
positive = valid / total
score(work) = positive + z2 / (2*total)
- z * sqrt((positive*(1-positive) + z2/(4*total)) / total)) / (1 + z2/n)
The trust level of each contributor is computed as the proportion of validated work (by a trustee) amongst their work, minus the proportion of invalid work. In math:
Research costs money. You need machines, a place to work, employees need to eat and pay the rent (even if you're the only employee.) Cost of database is just another cost of research. Most scientific fields have significantly higher costs than computer science. Just be glad you're not in high energy physics...
Basically, one hidden layer networks are already universal approximator; however, it's a nonconstructive result. This article is semi-constructive, - they train more complicated model, then emulate it with a simpler model, but do not know how to train the simple model directly.
Right, but the narrative associated with this result is that the single-layer might require a lot more units to achieve the same performance as a deep architecture.
It seems this is not necessarily the case---based on the empirical results shown.
I'd like to add something to the discussion wrt network architectures and representations of neural net parameters. Neural nets when being trained can be packed up as one parameter vector to represent this same structure. In fact, this is how a lot of linear optimizers train neural networks. To demonstrate a quick example:
A typical neural net layer is made up of a weight matrix W and a bias vector. This represents the connections of the network.
A typical feed forward architecture will have a weight matrix of number of inputs x number of outputs with a bias equal to the number of outputs.
Say: W is a 3 x 2 with a bias of 2. We can then represent a neural net parameter space as an 8 length vector.
The only the the neural net needs to know how to do is "unpack" the parameters to retain the layer structure. I actually do this in deeplearning4j for training the weights and putting them in to a search algorithm.
You would then repeat this for however many layers you have in your network going in order to the output layer.
Anyways: the reason deep architectures are used with more "learning" capacity is due to being able to intermix different kinds of activations. A typical example of this is a deep belief network where you have an initial layer that takes in continuous data and then the activations change it to binary.
It really depends on the problem you're solving as to whether this is relevant or not.
It's not a coincidence. I recently attended a talk by Dr. Hinton, who is working on the same thing now. He showed something similar using several different datasets.
I skimmed the paper so I may have missed this but is there a benefit to using a shallow network to simulate the deep network? While this is an interesting result I'd think that for now if I already have a trained deep network I'd just use it instead of training a shallow network to mimic it.
While it doesn't have to be shallow, there is an advantage in training a smaller model to mimic a more complicated one. The smaller net uses less computations.
Neural nets are often trained to be as large as possible on clusters of GPUs. Machine learning also commonly uses ensembles of dozens of different models. So there is a lot to be gained by compressing it down to a single model.
If true, this seems like a good thing. Is there any general way proposed in the paper to go from deep --> shallow while preserving the learning function?