Hacker News new | past | comments | ask | show | jobs | submit login

Thanks so much!

And yes, that's quite true. When parameter gradients don't quite vanish, then the equation

<g_x, d x / d eps> = <g_y, d y / d eps>

becomes

<g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d eps>

where g_theta is the gradient with respect to theta.

In defense of my hypothesis that interesting approximate conservation laws exist in practice, I'd argue that maybe parameter gradients at early stopping are small enough that the last term is pretty small compared to the first two.

On the other hand, stepping back, the condition that our network parameters are approximately stationary for a loss function feels pretty... shallow. My impression of deep learning is that an optimized model _cannot_ be understood as just "some solution to an optimization problem," but is more like a sample from a Boltzmann distribution which happens to concentrate a lot of its probability mass around _certain_ minimizers of an energy. So, if we can prove something that is true for neural networks simply because they're "near stationary points", we probably aren't saying anything very fundamental about deep learning.




Your work here is so beautiful, but perhaps one lesson is that growth and learning result where symmetries are broken. :-D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: