I've always wondered how the neural network parameter fitting got away with using just steepest descent: In such fully or largely unconstrained optimization, going way back, decades, steepest descent was seen as often poor. The usual, first analogy was finding the bottom of a convex bowl that was long and narrow. Then steepest descent would keep going down hill parallel to essentially the short axis of the bowl and make nearly no progress along the long axis. So, sure, the idea was make a quadratic approximation to the bowl, find the orthogonal eigen vectors and values, and then make some rapid progress to the bottom. That could be conjugate gradients or Newton iteration. Then could build up an approximation to the quadratic approximation and get quasi-Newton iteration. And there were other ideas.
I would guess that somewhere among the hundreds or more parameters to be adjusted, there would be at least two that would determine a bowl such as I described.
But, but, okay, I can guess: The neural network has so many parameters that not doing well on a few of them doesn't make much difference.
They aren't technically using just steepest descent, there is momentum, adaptive rate, stochastic pieces in the most used Adam optimizer, so it's not like it's the most basic version of gradient descent. There are also some structural rules on top, for example if one wants to model discrete variables, they create some relaxations, resembling simplex, tightened by simulated annealing, hoping by the time temperature drops to 0 the optimum is similar to the discrete optimum they wanted to find. It would be super surprising if the gradient descent by itself did all the difficult part, which we know for sure isn't because otherwise shallow fully-connected networks would be sufficient to figure out everything. Instead the whole craft nowadays consists of figuring out how to assemble certain primitive pieces together so that steepest descent actually does a good job, like how convolutions reduce searchable space considerably or embedings allow to group related terms together.
So, "it's in there", in the pot, garlic, tomato paste, olive oil, crushed tomatoes, oregano, basil, rosemary, a tin of anchovies, along with the daily stochastic mystery ingredient!
Another factor is that what one wants from a neural network is the best-generalizing approximation. What is this best generalizing thing almost seems like a property of the world rather than a mathematical curve but still, to various degrees one would guess a rough approximation might turn out to more generalizing than an absolutely exact approximation (since you talking about approxing a set of data with with curve rather than, say, getting the closest approximation to known-correct equation)
Yup. Soon after I heard about curve fitting, I thought, maybe we can always fit a polynomial! So, for 2 points, we can use a polynomial of degree 1, 3 points, degree 2, etc. Think a little and see how to fit n points with a degree n-1 polynomial. Can just write it down or just program it. So, I rushed to do that.
I typed in a few points and got a nice fit; the polynomial went exactly through the points. Hmm ... I couldn't be the first to think of that! So, I typed in points something like a square wave and thought, "Let's see a polynomial fit that!". Well, it did fit -- a polynomial fits a square wave -- gotta be kidding.
So, I plotted out the whole thing. Below, spoiler!!!!
Sure, the polynomial went through the points but between the points it shot off toward either positive infinity or negative infinity and made a lot of progress before turning around.
Then there are spline fits, least squares spline fits, multivariate spline fits, and, ..., ironically, neural network fits.
It's important to recognize that the space of all programs that take x,y as input and then output 1 or 0 is equivalent to the mathematical space of all possible functions. We interpet the programs as indicator functions.
Neural networks are usually smooth because they must be differentiable. Stochastic binary neurons and other techniques take us into a larger space of possible mappings, discontinuous ones.
But yeah, ultimately the model we solve for, will affect what kind of interpolation the solution ends up having.
The biggest question is what kind of interpolating techniques the brain uses. The search for the magic model....
Local minima are basically equivalent to global minimum in the high dimensionalities achieved in deep learning. It's not even just a conjecture by the way. This has been mathematically proven.
Maybe GP is talking about the hypothesis that in high-dimensional space you will probably almost always find /some/ direction which decreases the error function. IIRC this hypothesis is well supported, but not yet proven. So the real problem may not be local minima but saddle points that just /appear/ to be minima due to slow optimisation progress. For more details, check out https://arxiv.org/abs/1406.2572
No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis [1]. Authored by Rong Ge, an absolute prodigy in the field if I might say. I'm sure we'll be seeing a great deal more from him in solidifying the theory behind the current day trial and error sort of deep learning we have today.
That theorem only implies global convergence for convex or pseudoconvex functions. For a generic function it only guarantees convergence to a local minimum.
I recall seeing some research on this topic but I got the impression (perhaps wrongly) it has to make so many simplifying assumptions to get a result that it’s not clear how much carries over to real systems where not all those assumptions hold.
The 'bowl' you described are these saddle points, and this 2014 paper was pretty big for tackling it. But, as others pointed out Adam or such optimizers are used more often than steepest descent.
There's a nice characterization of min, max, and saddle points in W. Fleming, Functions of Several Variables. It was a question I got asked once on an oral exam in optimization.
It's an old book; the results are still true! Likely most good research libraries have a copy. Photocopy a few pages and will have that little subject in good shape for a long time.
Fleming was long in the Division of Applied Math at Brown. He also wrote more advanced books, e.g., on optimal control.
He also does a lot on convexity and does the inverse and implicit function theorems and applies them to Lagrange multipliers. If get very far in optimization, especially with constraints, then may like Lagrange multipliers.
See also elsewhere the simpler but shockingly commonly useful Lagrange multiplier approach of Hugh Everett, right, THAT Everett, the quantum mechanics many worlds guy, got his Ph.D. in physics IIRC under J. Wheeler at Princeton.
He did the Lagrange multiplier work for optimum allocation of resources questions in, e.g., military situations near DC. Started Lambda Corporation to do that work where Lambda is commonly the variable used for Lagrange multipliers.
At times I've guessed that there might be a good use of Lagrange multipliers for taxi ridesharing where want to maximize revenue from sharing but also use a path that is a solution to the relevant traveling salesman problem.
Optimizing a neural network is not really minimization. Sampling the training data in batches introduces a stochastic element, so a better analogy is perhaps something like Langevin dynamics. Decreasing the learning rate would then very roughly correspond to decreasing temperature, as in simulated anealing. (These analogies are only approximate.)
This title is really confused - backpropagation is one implementation of automatic differentiation, and yields a gradient. What you do with that gradient next is up to you - you can just follow it (steepest descent), or try other fancier second order techniques.
Can you get second order with only the gradient? Even fancy accelerated gradient descents are still first order. I suppose there are BFGS-like quasi Newton methods, but those are still building up estimates of the second order info over multiple iterations.
You can get closer to second order if your error is a sum of squares of the form 0.5 * f^T * f, in which case you case compute the jacobian J of the fitness f, and approximate your hessian as J^T * J. That's the Gauss-Newton method and it's very effective if a linear approximation of f holds locally.
That's an approximation but that's arguably a second order method, since it acknowledges the nonlinearity coming from the squared norm of f.
It is because of articles/books like this one which try to complicate basic ideas more than they need to be. People have forgotten how to convey concepts in a simple manner before formalizing with symbolic notation. As an example, i really understood the basics of NN only after reading Sandhya Samarasinghe's book "Neural Networks for Applied Sciences and Engineering: From Fundamentals to Complex Pattern Recognition" which explains everything simply and directly.
Most people couldn't pass calculus 101 so it's mysticism for them. Operations research people on the other hand must be shaking their heads that somebody still uses it and probably can't believe whole state-of-art is based off their most stupid method.
You known, ML researchers are aware of the enormous body of research in optimization. A lot of these methods simply do not work or scale well to deep neural networks. Second order methods have been tried and are not worth the cost.
> We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients
with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients
Most of the popular variants of SGD use approximations of the hessian in one way or another
Not to be peculiar, but I don't know if approximating the hessian using the gradient counts as a second order method. I was talking about "full-blown" second order methods where you compute de hessian through AD.
Furthermore, I don't think by "moment of the gradients" they actually mean second derivatives.
Also from the paper: We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions...
It's written right in the abstract that the authors consider it a first-order method.
That gradient descent works so well for the non-convex loss landscapes involved is quite surprising and intriguing. There are a few OK handwavy explanations; supposedly most local minima of high-dimensional losses are saddle points, perhaps 'good models' have high entropy.
I would guess that somewhere among the hundreds or more parameters to be adjusted, there would be at least two that would determine a bowl such as I described.
But, but, okay, I can guess: The neural network has so many parameters that not doing well on a few of them doesn't make much difference.