Hacker News new | past | comments | ask | show | jobs | submit login

No, it's been done. For example, "Gradient-based Hyperparameter Optimization through Reversible Learning" https://arxiv.org/abs/1502.03492 , Maclaurin et al 2015 (one of their cites). The idea is pretty obvious: of course you'd like to learn the hyperparameters like they were any other parameter. But that's easier said than done.

The problem is that, well, to backpropagate through a hyperparameter, you would need to, say, track how it affects every iteration throughout the entire training run, rather than simply tracking one parameter through a single iteration on a single datapoint. And it's difficult enough to do gradient descent on a single hyperparameter, so it hardly helps to start talking about doing entire stacks! If you can't really do one, doing ad infinitum probably isn't going to work well either.

If you look at their experiment, they're doing a 1 hidden-layer FC NN on MNIST. (Honestly, I'm a little surprised that stacking hyperparams works even on that small a problem.)




Reading the paper, instead of computing the gradient across the training run they compute the gradient for a hyper parameter after each batch and update the hyper parameter then. That keeps the computational cost pretty light. Stacking gradient descent 50 times on itself only costs about double normal gradient descent. That's specific to their experiment, but when done on bigger models the added cost for computing the hyperparameter derivatives should become a smaller fraction.

I'm surprised by the lack of any bigger experiments (imagenet would be nice, but even cifar10 would help) given how computationally light this was. Also surprised that an 11ish author stanford paper did their experiments on 1 cpu.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: