I'm less skeptical than fchollet (creator of Keras, for those here who don't know), but agree that we need to wait until the usual suspects at Google, FaceBook, Toronto, Montreal, Stanford, etc. have replicated this. In all likelihood the team will release code soon, either before or after NIPS, so we will all be able to check things out for ourselves.
One of the authors, Zhangyang Wang, just wrote this on his personal page: "We have discussed and decided to work on a software package release, perhaps accompanying it with a more detailed technical report in the future. Once the software package is ready, we will update everybody."
It's kind of an odd thing. I (random non-academic amateur) actually spent a bunch of time trying to parse the paper, which was kind of a combination of interesting ideas and incomprehensible ambiguities.
One real academic researcher also put some time into it. The good part of the paper is explained here. My guess is the problem is going from ARM to SARM.
Hi, I'm the author of the blog post. I added a blurb to the beginning the blog post explaining all the drama, and precisely what claim was made that was withdrawn.
The problem is not in the ARG->SARG approximation, but the bit on unsupervised pretraining. The paper could stand on its own without that section, but without that result it would have been a significantly more mediocre NIPS submission. Hope this clarifies things.
First, thanks for the excellent blog, it gave me a better idea what was happening
As far as the ARG-> transformation goes, maybe that's just something I don't get, I can see how one goes from sparse encoding to repeated ARG-type transformations and how this repeated application approximates the solution of a sparse encoding problem. And it is suggestive that these application look like a layers of a neural net.
But when you switch to stacking, what are you doing? Solving one sparse encoding problem then another? What analogy is there to say this works ... or that it would work better than just single sparse encoding? At that point, is it just "try it and see?"
One of the impressions I got from scanning the literature is that deep nets are kind of generally treacherous beasts - just getting a locally 1st layer may not be desirable off the bat. People have settled on backpropagation for very subtle reasons. See "Overfitting in Neural Nets: Backpropagation,
Conjugate Gradient, and Early Stopping", Caruana, Lawrence, et. al where backpropagation finds better solutions than the "more powerful" conjugate gradient method.
you are right. you are using the output of the previous sparse solution as input into the new one, i.e. stacking sparse coders.
Your second question of why this is a good idea is the million dollar question. Its pretty much "lets try it and see", with some heuristic reasoning thrown into the mix (its mirrors the brain, it abstracts information, etc, etc).
btw, I don't think people use early stopping anymore. It's been replaced by more powerful forms of regularization, such as dropout. The deep learning world is getting more tame, and that makes me happy.
gabrielgoh: thank you for this. Like others, I was more optimistic than others that this had promise, and I was wrong. Now I want to understand how the authors' got it wrong, so I've added your blog post to my reading list.
As far as I understand this, these guys claim they can train convolutional and many other types of deep neural nets faster by pretraining each layer with a new unsupervised technique via which the layer sort of learns to compress its inputs (a local optimization problem), and then they fine tune the whole network end-to-end with supervised SGD and backpropagation as usual. They have not released code, so no one else has replicated this yet -- as far as I know.
If the claim holds, the implication is that layers can quickly learn much of what they need to learn locally, that is, without requiring backpropagation of gradients from potentially very distant layers. I can't help but wonder if this opens the door for more efficient asynchronous/parallel/distributed training of layers, potentially leading to models that update themselves continuously (i.e., "online" instead of in a batch process).
I wouldn't be surprised if the claim holds. There is mounting evidence that standard end-to-end backpropagation is a rather inefficient learning mechanism. For example, we now know that deep neural nets can be trained with approximate gradients obtained by shifting bits to get the sign and order of magnitude of the gradient roughly right.[1] In some cases it's even possible to restrict learning to use binary weights.[2] More recently, we have learned that it's possible to use "helper" linear models during training to predict what the gradients will be for each layer, in-between true-gradient updates, allowing layers to update their parameters locally during backpropagation.[3] Finally, don't forget that in the late 2000's, AI researchers were doing a lot of interesting work with unsupervised layer-wise training (e.g., DBNs composed of RBMs, stacked autoencoders).[4]
This is a fascinating area of research with potentially huge payoffs. For example, it would be really neat if we find there's a "general" algorithm via which layers can learn locally from inputs continuously ("online"), allowing us to combine layers into deep neural nets for specific tasks as needed.
EDITS: Expanded the original comment so it conveys better what I actually meant to write, while keeping language as casual and informal as possible. Also, I softened the tone of my more speculative observations.
>For example, we now know that deep neural nets can be trained with approximate gradients obtained by shifting bits to get the sign and order of magnitude of the gradient roughly right.[1] In some cases it's even possible to restrict learning to use binary weights.[2] More recently, we have learned that it's possible to use "helper" linear models during training to predict what the gradients will be for each layer, in-between true-gradient updates, allowing layers to update their parameters locally during backpropagation.[3] Finally, don't forget that in the late 2000's, AI researchers were doing a lot of interesting work with unsupervised layer-wise training (e.g., DBNs composed of RBMs, stacked autoencoders).[4]
All of those use fundamentally use back propagation though, they just approximate the gradients. Pretraining with autoencoders even used back propagation to train the autoencoders. And researchers have entirely moved away from that strategy because it just doesn't work as well as pure supervised learning.
I don't think backpropagation will ever go away because it is so simple and powerful. Instead it will be tweaked and approximated like in your examples.
The paper itself is fairly sparse but references a number of approaches for more-quickly-learning neural-net-related learning systems from ~2013 (PCAnet, SCATnet,etc).
The paper presents these and other approaches as being instances of a classical, general form, regularized regression but with the "stacked" property involving each layer iterating however many times and then the next layer changing parameters (or features) and further iterating.
From my barely-informed viewpoint, this sounds like a fascinating way to unify the earlier efforts and one which could yield a variety of other approaches - even if the particular variation they use doesn't work out. But I assume lots of more-informed people are going to be looking at this.
Note: even though the paper talks about Approximate Regression Machine layers and uses equations that look sort-of like the equations of regularized regression, the layers aren't about regularized regression but about sparse dictionary encoding, a quite different approach.
This is a pretty good summary, but to note this is not a new unsupervised technique - it's a sparse coding method, and sparse coding has been around for more than a decade. Summarizing this paper without mentioning the word sparse coding seems wrong somehow. ;)
But, sadly, it doesn't work as advertised. My hypothesis is that it's extremely difficult to get great results with greedy training. Jointly learning multiple layers from huge datasets is why things started working so well.
Locally != independently; the training process is recursive, so interactions between layers are present. From the paper:
"The parameters of the entire SARM are solved recursively. The current ARM’s parameters are calculated using the output from the previous ARM. Then, the output of the current ARM is fed into the subsequent ARM (or the classifier, if the current one is the last ARM), as its input."
A twitter conversation reflecting some scepticism, but agreeing it would be interesting if it all checks out: https://twitter.com/fchollet/status/771862837819867136