No, it's been done. For example, "Gradient-based Hyperparameter Optimization thr...

Mehdi2277 · on Oct 3, 2019

Reading the paper, instead of computing the gradient across the training run they compute the gradient for a hyper parameter after each batch and update the hyper parameter then. That keeps the computational cost pretty light. Stacking gradient descent 50 times on itself only costs about double normal gradient descent. That's specific to their experiment, but when done on bigger models the added cost for computing the hyperparameter derivatives should become a smaller fraction.

I'm surprised by the lack of any bigger experiments (imagenet would be nice, but even cifar10 would help) given how computationally light this was. Also surprised that an 11ish author stanford paper did their experiments on 1 cpu.