>MXNet can consume as little as 4 GB of memory when serving *deep networks with ...

ogrisel · on Nov 22, 2016

For deep convnets the vanishing gradient problems can mostly be solved by using residual architectures. See: https://arxiv.org/abs/1603.05027

This is kind of related to solving the vanishing gradient issue in RNNs by using additive recurrent architectures like LSTMs and GRUs.

Alternatively it's possible to use concatenative skip connections as in DenseNets: https://arxiv.org/abs/1608.06993

Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).

bsfjgngdnxy · on Nov 22, 2016

Thank you!

deepnotderp · on Nov 22, 2016

Vanishing gradient isn't the same as memory efficiency. The memory mirror option is what allows this extremely efficient memory usage by only being 30% more compute intensive.

bsfjgngdnxy · on Nov 22, 2016

Yes, but that's not what I asked about.

alexbeloi · on Nov 22, 2016

Vanishing gradient is solved using model architectural choices: ReLu activation instead of sigmoid or tanh, using batch-normalization, using LSTMs

These are orthogonal to memory management and neural net framework choices.