I wonder if an energy and work metric could be derived for gradient descent. Thi...

I wonder if an energy and work metric could be derived for gradient descent. This might be useful for a more rigorous approach to hyperparameter development, and maybe for characterizing the data being learned. We say that some datasets are harder to learn, or measure difficulty by the overall compute needed to hit a quality benchmark. Something more essential would be a step forward.

Like in ANN backprop, the gradient descent algorithm can use a momentum to overcome getting stuck in local minima. This was heuristically physical when I learned it.. perhaps it's been developed since. Maybe only allowing a "real" energy to the momentum would then align it with an ability to do work calculation. Might also help with ensemble/monte carlo methods, to maintain an energy account across the ensemble.