Hi, I'm the author of the blog post. I added a blurb to the beginning the blog post explaining all the drama, and precisely what claim was made that was withdrawn.
The problem is not in the ARG->SARG approximation, but the bit on unsupervised pretraining. The paper could stand on its own without that section, but without that result it would have been a significantly more mediocre NIPS submission. Hope this clarifies things.
First, thanks for the excellent blog, it gave me a better idea what was happening
As far as the ARG-> transformation goes, maybe that's just something I don't get, I can see how one goes from sparse encoding to repeated ARG-type transformations and how this repeated application approximates the solution of a sparse encoding problem. And it is suggestive that these application look like a layers of a neural net.
But when you switch to stacking, what are you doing? Solving one sparse encoding problem then another? What analogy is there to say this works ... or that it would work better than just single sparse encoding? At that point, is it just "try it and see?"
One of the impressions I got from scanning the literature is that deep nets are kind of generally treacherous beasts - just getting a locally 1st layer may not be desirable off the bat. People have settled on backpropagation for very subtle reasons. See "Overfitting in Neural Nets: Backpropagation,
Conjugate Gradient, and Early Stopping", Caruana, Lawrence, et. al where backpropagation finds better solutions than the "more powerful" conjugate gradient method.
you are right. you are using the output of the previous sparse solution as input into the new one, i.e. stacking sparse coders.
Your second question of why this is a good idea is the million dollar question. Its pretty much "lets try it and see", with some heuristic reasoning thrown into the mix (its mirrors the brain, it abstracts information, etc, etc).
btw, I don't think people use early stopping anymore. It's been replaced by more powerful forms of regularization, such as dropout. The deep learning world is getting more tame, and that makes me happy.
gabrielgoh: thank you for this. Like others, I was more optimistic than others that this had promise, and I was wrong. Now I want to understand how the authors' got it wrong, so I've added your blog post to my reading list.
The problem is not in the ARG->SARG approximation, but the bit on unsupervised pretraining. The paper could stand on its own without that section, but without that result it would have been a significantly more mediocre NIPS submission. Hope this clarifies things.