>MXNet can consume as little as 4 GB of memory when serving deep networks with as many as 1000 layers.
So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this?
Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).
Vanishing gradient isn't the same as memory efficiency. The memory mirror option is what allows this extremely efficient memory usage by only being 30% more compute intensive.
So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this?