Hacker News new | past | comments | ask | show | jobs | submit login

>MXNet can consume as little as 4 GB of memory when serving deep networks with as many as 1000 layers.

So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this?




For deep convnets the vanishing gradient problems can mostly be solved by using residual architectures. See: https://arxiv.org/abs/1603.05027

This is kind of related to solving the vanishing gradient issue in RNNs by using additive recurrent architectures like LSTMs and GRUs.

Alternatively it's possible to use concatenative skip connections as in DenseNets: https://arxiv.org/abs/1608.06993

Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).


Thank you!


Vanishing gradient isn't the same as memory efficiency. The memory mirror option is what allows this extremely efficient memory usage by only being 30% more compute intensive.


Yes, but that's not what I asked about.


Vanishing gradient is solved using model architectural choices: ReLu activation instead of sigmoid or tanh, using batch-normalization, using LSTMs

These are orthogonal to memory management and neural net framework choices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: