Hacker News new | past | comments | ask | show | jobs | submit login
Decoupled Neural Interfaces Using Synthetic Gradients (deepmind.com)
130 points by yigitdemirag on Aug 29, 2016 | hide | past | favorite | 26 comments



Super cool stuff in this paper.

At its heart, this is a new training architecture that allows parameter weights to be updated faster in a distributed setting.

The speed-up happens like so: instead of waiting for the full error gradient to propagate through the entire model, nodes can calculate the local gradient immediately and estimate the rest of it.

The full gradient does eventually get propagated, and it is used to fine-tune the estimator, which is a mini-neural net in itself.

Its amazing that this works, and the implications that full back-prop may not always be needed shakes up a lot of assumptions about training deep nets. This paper also continues the trend of this year of using neural nets as estimators/tools to improve the training of other neural nets. (I'm looking at you GANs).

Overall, excited to see where this goes as other researchers explore the possibilities when you throw the back-prop assumption out.


This is a big deal. One of the weaknesses of NNs is their time/memory complexity. Having to store every previous state and iterate backwards through them to the beginning of time, to learn. And if you want to learn "online" then you need to do that at every single step. It's unlikely the human brain works that way, and that has been one of the big arguments for why the brain can't use something similar to backprop.


It has been proven that sparse backwards connections, even with tied weights, can substitute backprop. In biological systems the feedback routes can't be the same and share connections with the feed forward routes.


Are you referring to http://arxiv.org/pdf/1411.0247v1.pdf ? That doesn't mention sparse backward connections, but it does show that feedback weights that aren't computing the actual derivative dloss/din can till support learning. The network 'learns to learn', so to speak.


I also dimly remember that sparsity in the random fixed backward connections still works. There is actually a figure about that in the paper you've linked to.

Interestingly, feedback alignment is also patented, but it is unclear that is actually helpful, except for explaining neuroscience. To my knowledge there has been no application of it in two years.


Anyone knows how they generated the illustration images?

Seems much better than relying on the client to render cpu intensive JS.


Keynote for the static ones, After Effects for the gifs


Do most of the images not load for you guys too? (nevermind, works now!)


None of the images load for me, even after many refreshes.


Refreshed the page and it worked.


looks fine for me


Is this a big deal?


This area is a big deal - ML networks need to be much deeper and denser to provide human-level understanding, and training networks is currently a considerable bottleneck.


Does this method make it easier to spread a neural network over multiple GPUs/machines? I mean, does it reduce the amount of data being communicated between compute nodes or just decouples the updates from the need to wait for the rest of the net to finish?


Does this method make it easier to spread a neural network over multiple GPUs/machines?

Yes, but this isn't the primary focus of this work.

This is about a method of approximating error rates (gradient) back up the neural network.

This is important because allowing the use of approximate error rates means that earlier layers can be trained without waiting for error back-propagation from the later layers.

This asynchronous feature helps on a (computer) network too - there is no need to wait for back-propagation across the network.

As they point out the error will back-propagation eventually. The analogy with an eventually consistent database system (and the effect that has on scalability) is pretty clear.


ANN are not that big of a deal, IMHO, when you compare them to other machine learning techniques i.e. Support Vector Machines. Also, see https://en.wikipedia.org/wiki/Artificial_neural_network#Theo...

Though, this article is so well presented, it deserves an awards for how pretty it is.


My gut thinks this sort of training alternative to back propagation has a lot of uses where SVM have no applicability. The article talks a lot about RNNs (neural nets for sequence prediction), but I would guess it would have uses in online learning as well. Learning twice as fast in those situations seems pretty significant to me.


I can't believe I'm getting down-voted just because I'm not bullish on ANN.

As I said, in my humble opinion (IMHO), (and educated opinion) NN don't really have a lot of practical use. So long as they have to be processed in parallel, SVM will always have the advantage that they can be computed sequentially, meaning they can process much faster and without the need for specialized hardware. SVMs and ANN are solving the same problem in machine learning, they're both methods used for classifying data. Just SVMs do it much faster and within more practical means.


The classical solver used to train kernel SVMs and implemented in libsvm [1] has a time complexity in between o(n^2) and o(n^3) where n is the number of labeled samples in the training set. In practice it becomes intractable to train a non-linear kernel SVM as soon as the training set is larger than a few tens of thousands of labeled samples.

Deep Neural Networks trained with variants of Stochastic Gradient Descent on the other hand have no problem scaling to training sets with millions of labeled samples which makes them suitable to solve large industrial-scale problems (e.g. speech recognition in mobile phones or computer vision to help moderate photos that are posted on social networks).

SVMs can be useful in the small training sets regime (less than 10 000 training examples). But for that class of problems, it's also perfectly reasonable to use a single CPU (with 2 or 4 cores) with a good linear algebra library such as OpenBLAS or MKL to train an equally powerful fully connected neural network with 1 or 2 hidden layers. Hyper-parameter tuning for SVMs can be easier for beginners using SVMs with the default kernels (e.g. RBF or polynomial) but with modern optimizers like Adam implemented in well designed and well documented high-level libraries like Keras it has become very easy to train neural networks that just work.

Also for many small to medium scales problems that are not signal-style problems [2], Random Forests and Gradient Boosted Trees tend to perform better than SVMs. Most Kaggle competitions are won with either linear models (e.g. logistic regression), Gradient Boosting, neural networks or a mix of those. Very few competitors have used kernel-based SVMs in a winning entry AFAIK.

[1] https://en.wikipedia.org/wiki/Sequential_minimal_optimizatio...)

[2] By "signal-style" I mean problems such as image or audio processing.


> I can't believe I'm getting down-voted just because I'm not bullish on ANN.

probably it is happening because ANN's definitely do have some advantage over SVM's for modelling real world phenomenon.

specifically, ANN's are _parametric_, while SVM's are nonparametric in the sense that for an ANN, you have a bunch of hidden layers (of varying sizes) depending on the number of features, and a bias parameter. this is your model.

SVM's otoh (at least in the kernelized case), consist of set of support-vectors selected from a training set. which in worst case can be as large as the training set.

modelling real world phenomenon, for example optimal air-conditioning based on a large number of external inputs in a data center, are far more conducive in ANN's than SVM's. ANN's are afterall universal approximators. with SVM's you have to guess the kernel...

edit-001 : see for example, this paper: http://deeplearning.net/wp-content/uploads/2013/03/dlsvm.pdf , where folks try their hands at deep-learning via SVM's.


I can't believe I'm getting down-voted just because I'm not bullish on ANN.

I don't think it's that - I think it is because you are factually wrong. In particular this part: "ANN are not that big of a deal, IMHO, when you compare them to other machine learning techniques i.e. Support Vector Machines. " is wrong.

(Deep) Neural Networks are a very, very big deal because they work so much better than SVNs in every domain where there is sufficient training data, sufficient time (and enough GPUs!) to train them.

This post is a big deal because it shows a way to cut down that training time.


I believe SVM>ANN might be true if the classification task is relatively simple. Most of the state-of-the-art computer vision algorithms are based on ANNs, though (just as an example). Do you think SVMs will catch up in those areas?

(I upvoted you even though I don't agree with you, just thought getting greyed out was excessive)


wait... the model can be linear. So then it's just the second order terms of the error gradient? Or the Jacobian or something?


Hmmm - I'm also thinking that this is one of those things that probably has a much better explanation - and the science/maths will (hopefully) backfill why it works so well.

I half-remember from somewhere that as long as the gradient descent direction has the correct sign 'in expectation', then the SGD will ~work. So there's a whole lot of flexibility in there for having a good idea that at least doesn't fail horribly.

For instance, in other DeepMind work, they do lots of asynchronous weight updates - and the accuracy decrease from ignoring any kind of 'locking' is dwarfed by the speed increase of being able to run more stuff in parallel.

Another image I can't shrug off is that of Q-learning in a game, where the updates implicitly pass back from 'the future' (which also works ~better than it should). In this case, the linear model would just be an estimator of where the update values are going to land...


HOGWILD [1] lock-free weight updates are a great example of how forgiving SGD can be.

1 : https://arxiv.org/abs/1106.5730


It will be an approximation of the Jacobian because it is just backprop. It also seems they only use the linear model when they also condition on the correct class which makes things easier. It is also only MNIST (10 classes).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: