This reminds me of something that came up in Andrew Ng's online ML class. He said that it is important to check the correctness of your gradient calculation in backprop (by comparing it to a finite difference of the loss) because if you have a bug there, your algorithm might more or less work anyway, making it hard to tell that there was a bug. Apparently you can still get sensible output even with an incorrect gradient.
On reading this, my first question is about the properties of the "random" feedback matrix. They illustrate what is happening using a tiny 1-width machine and a "random" matrix of "1". It seems like some analysis needs to be done on what kind of "random" is most appropriate to replace the gradient update for larger machines. There could be something really interesting going on such that you could generate some optimal non-random B according to whatever the network topology is.
The implications of this are huge, it should drastically reduce processing time for neural nets. I wonder if given this if networks could be updated asynchronously/continuously.
I don't really understand how it would reduce processing time, could you elaborate?
The main implications seem to be for neuroscience, as far as I can tell. Backprop is considered biologically implausible because it requires either bidirectional communication over synapses (which doesn't happen) or weight sharing between neurons. But this allows the forward and backward connections to be decoupled (i.e. they are different synapses).
This is really interesting stuff, my first reaction was "why does this even work?" I think I still don't really fully understand what's going on.
This is not true. See Neural Back propagation [1]. There are known mechanisms for backwards feedback between neural connections, for example Spike Timing Dependent Plasticity - where neural inputs that are well correlated in time and potential to output firings are strengthened over time. These phenomena are vital to learning and neural development.
From reading the abstract it seems that they are claiming that introducing some randomness into your gradient of weight changes allows for the quicker convergence of solution - I did not read the paper. I also don't exactly understand why it works - it sounds like they are claiming traditional back prop has room for improvement.
That's a very old strategy called jittering (also see stochastic gradient descent.)
This is something entirely different. They are not doing regular backpropagation at all, but somehow using neurons to learn how to backpropagate values. I haven't read the paper yet, just read their slides earlier, so that might not be correct.
Revolutionary paper indeed, if this idea generalizes to bigger deeper networks! (Then) why hasn't it been discovered before? It's like the Cambrian explosion of neuron networks, very exciting times.
okay but how does this save computing time?
computing random matrices is not faster than (implicitly) transposing the weight matrix.
so it only has philosophical implications, right?
Ok, got it: It will simplify the approach of how to create hardware based neural networks! no more complicated look-ups of the transposed weight matrix needed.