With squared loss where it's easy for the loss to be zero, then yes, it will have lots of global minima, all at a loss of zero. For losses that asympotote, like logloss, they may have no minima.
Thanks. I guess my worry was that, once you are doing extremely well and your loss is very low, gradients are no longer independent, and will tend to go mostly up. Is this wrong?
I’m an ML newb but I think this would be true only of a converged model. Your model could always technically diverge in another epoch if learning rate is high enough and you process a batch of extreme outliers
Even then, you may have still converged on a local optimum which was the take away I got from the article