Thanks so much! And yes, that's quite true. When parameter gradients don't quite...

Thanks so much!

And yes, that's quite true. When parameter gradients don't quite vanish, then the equation

<g_x, d x / d eps> = <g_y, d y / d eps>

becomes

<g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d eps>

where g_theta is the gradient with respect to theta.

In defense of my hypothesis that interesting approximate conservation laws exist in practice, I'd argue that maybe parameter gradients at early stopping are small enough that the last term is pretty small compared to the first two.

On the other hand, stepping back, the condition that our network parameters are approximately stationary for a loss function feels pretty... shallow. My impression of deep learning is that an optimized model _cannot_ be understood as just "some solution to an optimization problem," but is more like a sample from a Boltzmann distribution which happens to concentrate a lot of its probability mass around _certain_ minimizers of an energy. So, if we can prove something that is true for neural networks simply because they're "near stationary points", we probably aren't saying anything very fundamental about deep learning.