Gradient Descent: The Ultimate Optimizer

samatman · on Oct 3, 2019

Can’t pass on the opportunity to share one of my favorite (and topical!) paper titles:

sdan · on Oct 3, 2019

> We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum.

Made me laugh. Can't believe this hasn't been done yet. Maybe I should stop looking at ML research upon a pedestal (regardless, OpenAI's PPO and TRPO papers were amazing, not to mention their Learning with Dexterity one)

gwern · on Oct 3, 2019

No, it's been done. For example, "Gradient-based Hyperparameter Optimization through Reversible Learning" https://arxiv.org/abs/1502.03492 , Maclaurin et al 2015 (one of their cites). The idea is pretty obvious: of course you'd like to learn the hyperparameters like they were any other parameter. But that's easier said than done.

The problem is that, well, to backpropagate through a hyperparameter, you would need to, say, track how it affects every iteration throughout the entire training run, rather than simply tracking one parameter through a single iteration on a single datapoint. And it's difficult enough to do gradient descent on a single hyperparameter, so it hardly helps to start talking about doing entire stacks! If you can't really do one, doing ad infinitum probably isn't going to work well either.

If you look at their experiment, they're doing a 1 hidden-layer FC NN on MNIST. (Honestly, I'm a little surprised that stacking hyperparams works even on that small a problem.)

Mehdi2277 · on Oct 3, 2019

Reading the paper, instead of computing the gradient across the training run they compute the gradient for a hyper parameter after each batch and update the hyper parameter then. That keeps the computational cost pretty light. Stacking gradient descent 50 times on itself only costs about double normal gradient descent. That's specific to their experiment, but when done on bigger models the added cost for computing the hyperparameter derivatives should become a smaller fraction.

I'm surprised by the lack of any bigger experiments (imagenet would be nice, but even cifar10 would help) given how computationally light this was. Also surprised that an 11ish author stanford paper did their experiments on 1 cpu.

mlthoughts2018 · on Oct 3, 2019

This is just Bayesian modeling. Move the parameters into the model, push the hyperparameters into a prior. Move the hyperparameters into the model, push the hyperhyperparameters into a hyperprior, etc.

Choosing hyperparameters from minor tuning or literature review or domain expertise or whatever is just a form of creating a prior.

oh_sigh · on Oct 3, 2019

It's been thought of, but the standard view was that it was way too expensive to perform when humans are pretty good at fiddling with them to get good enough results after a little bit of practice.

PeterisP · on Oct 3, 2019

Ah, the GDGS (gradient descent by grad student) approach where you estimate the gradient direction using an educated guess, tweak the system towards that, run an experiment, write down the results and repeat until you're happy (or run out of time).

CardenB · on Oct 3, 2019

See also: https://news.ycombinator.com/item?id=21142456

Has been done before

tagh · on Oct 3, 2019

This is already done in Gaussian process classification / regression

jupp0r · on Oct 3, 2019

“Gradient Descent: The Ultimate Optimizer”

Did they get stuck in a local optimum?

Straw · on Oct 3, 2019

Unfortunately, the 1-step optimal learning rate often differs massively from the long-horizon best choice:

https://arxiv.org/abs/1803.02021

Due to the fact that larger LRs can result in worse immediate performance but better progress in low signal-to-noise ratio directions.

0-_-0 · on Oct 3, 2019

I suspect that's because higher LR leads to better exploration of the opt. surface, i.e. it works as an implicit regularizer. The ideal solution would be to develop better regularizers to go with the better optimizers, instead of relying on the noise in the worse optimizer for implicit regularization.

Straw · on Oct 3, 2019

I completely agree, we shouldn't be depending on our optimizers do some approximate Bayesian inference- an optimizer should optimize only.

However, I think it's a different effect- even purely in terms of optimizing the training loss, on a quadratic (with noisy gradients), the short-horizon bias effect exists.

axiom92 · on Oct 3, 2019

Interesting!

Relevant work:

1. What's one of the most important hyperparameters? Architecture: https://arxiv.org/abs/1806.09055

2. Dropout: https://papers.nips.cc/paper/5032-adaptive-dropout-for-train...

3. Learning activation thresholds: https://datascience.stackexchange.com/questions/18583/what-i...

shoeffner · on Oct 3, 2019

Regarding 2: I saw a talk yesterday by Alex Hernández-García in which he showed that dropout and weight decay can be replaced with data augmentation in many cases. The relevant paper by him and König is https://arxiv.org/abs/1806.03852v2

edit: changed to /abs/ instead of /pdf/ link

skunkworker · on Oct 2, 2019

So we could see Adam optimization with hyper-parameter tuning by Adam? But IIRC this is never guaranteed to always find the global minima/maxima, so why would it in this scenario? Or is this just finding a way to let your optimizer self optimize?

myrryr · on Oct 3, 2019

Pretty much. pick something stupid, but good enough.

What IS interesting, is that the problems all end up in the same place after a few levels of recursion.

mcguire · on Oct 2, 2019

Because gradient descent never gets stuck in a suboptimal solution?

gHosts · on Oct 3, 2019

If I understand what they are looking at, they are looking at problems that are simple in the sense of having no local optima, but complex in the sense of having very high dimension.

ie. It's not a problem for the class of problems they are optimizing.

I've been out of that domain for awhile... but when I did a lot of optimizing / fitting in a high dimensional space with a function that had lots of local optima... I found the downhill simplex method with simulated annealing was most effective / robust.

dleary · on Oct 3, 2019

There is a problem with our intuition about gradient descent that blinds us, a little bit.

For a local minimum to exist, that means that the function we're looking at has to have a local minimum when sliced in every dimension.

In our intuition, this is relatively common, because we usually imagine 2d manifolds embedded in 3-space (ie, "hill climbing"). And when there's only 2 dimensions, it's not hard to have a "bowl", where the function curves upward in both dimensions.

In fact, a "saddle point", where one of the dimensions curves up and the other curves down, feels "unlikely" to most peoples' intuition, because we don't often run into physical objects with that shape.

But, saddle points in higher dimensions "should be" far more common than local minima. You only need one dimension to curve downward to have a "hole" where the "water" can escape and continue flowing downhill.

Gradient descent easily gets stuck in 2 dimensions, and so we are often surprised by how well it performs in higher dimensions.

wnoise · on Oct 3, 2019

Yes, saddle points are far more common than local minima in high dimension. Unfortunately they're really good at slowing down naive gradient descent...

naniwaduni · on Oct 3, 2019

which is great as long as you choose your problems based on what your tools are good at

dodobirdlord · on Oct 3, 2019

Local minima basically don't exist in high dimensional spaces, so in practice the tools work for a wide variety of problems.

gHosts · on Oct 8, 2019

Definitely not true. Not even close.

dodobirdlord · on Oct 10, 2019

Not much of a response. Care to expand here? This is a well-understood principle in the existing body of research on convergence in high dimensional spaces.

sp332 · on Oct 3, 2019

This has been very lucrative for computers so far.

adyavanapalli · on Oct 3, 2019

I'm sorry that I can't find the reference for this right now, but I read a paper a while back showing that (particularly stochastic) gradient descent usually finds good solutions (and that these solutions are very close to each other), although not absolute extrema, to highly dimensional neural networks. Whats more, you usually want to avoid finding absolute extrema solutions because they often suffer from overfitting behaviors.

SheinhardtWigCo · on Oct 2, 2019

Don’t you, too, sometimes?

monocasa · on Oct 2, 2019

Depends on my strategy. For instance a superoptimizer tries all possible options and selects the best one.

myrryr · on Oct 3, 2019

of course picking to use a superoptimizer itself may not be optimal, and so it goes :)

I think Don’t you, too, sometimes? As a response to that has a lot more going for it than it first looks like.

I mean, I look at it, and the choices I make in my life, and think, yeah.... I really do.

linuxdude314 · on Oct 3, 2019

It is nice to see someone doing this!

This reminds me of using GD for tuning PID (which has been done for a while).

https://www.researchgate.net/publication/287359696_PID_contr...

Rainymood · on Oct 3, 2019

I couldn't find the code anywhere on the web (nicely formatted, easy to use) so I tried to make the code more easily accessible here [1]

https://github.com/Rainymood/Gradient-Descent-The-Ultimate-O...

bravura · on Oct 3, 2019

There are some cites to similar recent work in this thread, but even older work is Stochastic Meta Descent.

This inspired my consulting company / ML q+a site name MetaOptimize and it’s motto: optimizing the process of optimizing the process of...

codesushi42 · on Oct 3, 2019

But then who tunes the hyperparameters for the gradient optimizer?

Jaxan · on Oct 3, 2019

As said in the abstract: another gradient optimizer!

codesushi42 · on Oct 3, 2019

But then who tunes that optimizer?

jchook · on Oct 4, 2019

mikepalmer · on Oct 3, 2019

11 authors, impressive!

jchook · on Oct 3, 2019

One for each gradient descent.

ur-whale · on Oct 3, 2019

I thought it was, to quote "and so on, ad infinitum"?

Maybe they chopped the long tail of authors?

bigred100 · on Oct 3, 2019

Hopefully this isn’t useful for anything because it’s the dumbest thing I’ve seen in my life