On reading this, my first question is about the properties of the "random" feedb...

On reading this, my first question is about the properties of the "random" feedback matrix. They illustrate what is happening using a tiny 1-width machine and a "random" matrix of "1". It seems like some analysis needs to be done on what kind of "random" is most appropriate to replace the gradient update for larger machines. There could be something really interesting going on such that you could generate some optimal non-random B according to whatever the network topology is.