Understanding gradient descent

chestervonwinch · on Aug 5, 2016

I've always thought that this was a fun way to look at it:

Suppose you solve the differential equation,

    x'(t) = -f'(x(t))                    (1)

Then,

    d/dt f(x(t)) = f'(x) x'(t) = - [f'(x)]^2 <= 0

In other words, if `x(t)` follows a path that solves (1), then `x(t)` follows a path that decreases the value of `f(x)`.

The gradient descent algorithm is a numerical approximation to solving (1) using the forward Euler method:

    (x(t_(n+1)) - x(t_n)) / dt = -f'(x(t_n))

RBerenguel · on Aug 5, 2016

Nice and clear way to think about it (and touching closer to the kind of numerical analysis I drilled most, too). I wonder, if having a good polynomial description of f then we can "just" use a Taylor integration method, approximating the inverses needed along the way with some high order approximation.

dxbydt · on Aug 5, 2016

Perhaps an example might help.

Say x(t) = exp(-t)

f(x(t)) = (x^2)/2

Then it satisfies conditions imposed by parent, namely x' = -x = -f'

chestervonwinch · on Aug 5, 2016

Nice. I think you mean f(x) = 0.5*x^2 though.

Chris2048 · on Aug 8, 2016

Is it not:

d/dt f(x(t)) = f'(x(t)) x'(t) = -(-f'(x(t))) x'(t) = -(x'(t)) x'(t) = -[x'(t)]^2 = -[f'(x(t))]^2

?

jordigh · on Aug 5, 2016

I need to write something about conjugate gradients. The first thing we did in my optimisation class at CIMAT was gradient descent and showed how it was so slow compared to everything else. In practical applications, there's way too much magic in choosing the so-called "learning rate" (known as "step size" in numerical optimisation). Even choosing an optimal step size per iteration (by doing a line search), gradient descent is still slow. Conjugate gradients are a bit more difficult to understand, but they're not that hard to implement if you just blindly copy a few formulae, and they are a dramatic speedup.

All of these blog posts about gradient descent feel like everyone keeps going on about what a great algorithm bubblesort is because it's so easy to understand and implement.

eliben · on Aug 5, 2016

Claiming that gradient descent is a great algorithm certainly wasn't the goal of this article. But GD is the basis from which most other iterative algorithms stem, and it's worthwhile having a good understanding of its inner workings as a prerequisite for more advanced algorithms.

Personally I found http://sebastianruder.com/optimizing-gradient-descent/ interesting - it goes into the advanced variants like nesterov, adadelta, etc

jordigh · on Aug 5, 2016

I realise, but since a lot more gets written about steepest descent than everything else, it also gets implemented and used a lot more often, even when other alternatives are readily available and already implemented. For example, the trainscg method of Matlab is relatively obscure thus unused and not frequently re-implemented.

There's an implicit endorsement in these blog posts. People wouldn't be spilling all this ink nowadays on plain adalines, even if they're building blocks for backprop networks, right? So by writing so much about it, people get the impression that it has to be studied and implemented very carefully, to the exclusion of better methods.

maxerize · on Aug 5, 2016

As the above poster said, if you can introduce people to a topic by explaining a simplified or 'naive' solution/algorithm, then that could be a good springboard to learn more about the topic. Which is why there should be more comments presenting improvements and alternatives, rather than criticising what is obviously meant as a primer on optimisation. From the intro:

Gradient descent is a standard tool for optimizing complex functions iteratively within a computer program. Its goal is: given some arbitrary function, find a minima.

For some small subset of functions - those that are convex - there's just a single minima which also happens to be global. For most realistic functions, there may be many minima, so most minima are local.

Making sure the optimization finds the "best" minima and doesn't get stuck in sub-optimial minima is out of the scope of this article.

Here we'll just be dealing with the core gradient descent algorithm for finding some minima from a given starting point.

theoh · on Aug 5, 2016

The way the author uses the word "minima" as if it was both singular and plural is a really discouraging sign.

dharmon · on Aug 5, 2016

This was the "standard" text we used for new grad students to understand CG. It's not perfect but covers the ideas and algorithms pretty well:

http://www.cs.cmu.edu/%7Equake-papers/painless-conjugate-gra...

theoh · on Aug 5, 2016

Presumably you are aware of Jonathan Shewchuk's "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain": https://www.cs.cmu.edu/~quake-papers/painless-conjugate-grad...

nkurz · on Aug 6, 2016

Ilya Kuzovkin did nice companion article in the form of a Python Notebook to accompany Shewchuk's classic:

"The Concept of Conjugate Gradient Descent in Python"

http://ikuz.eu/2015/04/15/the-concept-of-conjugate-gradient-...

And for those with more math and less patience for explanation, Robert Gower also offers:

"Conjugate Gradients: The short and painful explanation with oblique projections"

http://www.maths.ed.ac.uk/~s1065527/pdf/GowerR_Painful_PCG_p...

pmarreck · on Aug 5, 2016

If only there was a faster known way to get the inverse of a matrix... Gradient descent seems like an ugly hack compared to that (having taken an ML class)

It reminds me of taking the derivative by manually measuring the slope between 2 points on the curve, instead of, you know, directly getting the derivative

plg · on Aug 5, 2016

http://www.johndcook.com/blog/2010/01/19/dont-invert-that-ma...

Sherlock · on Aug 5, 2016

This is the clearest signal to tell apart people who have studied applied math and people that haven't.

xaedes · on Aug 5, 2016

Taking the inverse of a matrix only solves problems which can be described in a specific way. There are situations where you want to optimize an arbitrary function. Also taking the inverse of a matrix isn't always easy. For example for very large matrices. The are also problems where the optimal solution changes over time (e.g. trajectory following control). This gives a good guess for the solution at each timestep(t>0) and improving that with gradient descent can be more performant than completely solving for a matrix inverse each time.

pmarreck · on Aug 5, 2016

> Also taking the inverse of a matrix isn't always easy. For example for very large matrices.

Yeah, I know. I am saying that I think there is probably an undiscovered solution to doing faster inverse matrices, based on occam's razor (which I'm taking to mean that "complex and ugly" suggests there is probably undiscovered "simpler and more elegant"... which I've seen over and over again in other areas... gradient descent being the "complex and ugly" here)

> This gives a good guess for the solution at each timestep(t>0) and improving that with gradient descent can be more performant than completely solving for a matrix inverse each time.

That's a good point, but perhaps the hypothetical matrix inversion I'm assuming exists also allows you to do computationally efficient differential inversions based on small changes to the matrix.

I admit this was just a "gut feeling," but when I first learned about gradient descent vs. matrix inversion I had a very strong gut feeling that there was a better way out there.

My career is programming (where I have followed this "gut feeling" many times, usually to success) and I did papers on matrices and determinants in high school, this may be informing my "gut".

Not sure if any mathy people are reading this, but I think there's gold to be had here, if you look hard enough

RBerenguel · on Aug 5, 2016

Golub/van Loan (Matrix Computations) is a thick book talking only on methods to work with matrices, old... And not "much" has changed in the basics. If there is gold to be had there, is damn deep, and won't be an easy algorithm at all

maxerize · on Aug 5, 2016

Gradient descent is neither complex nor ugly, it's the most immediate and straightforward way of searching for extrema. Iterative methods in general might be your only shot if the problem is not linear and linearisation is not an option. From a practical point of view, iterative methods in general allow you to terminate on your own terms, whereas many direct methods don't have a useful representation at hand to return prematurely.

math_and_stuff · on Aug 5, 2016

Many, many, very smart people have worked on exactly this problem and the result is a large, somewhat disjoint, family of techniques for forming/applying approximate inverses in subquadratic time for delicate families of matrices (e.g., discretizations of elliptic partial differential equations).

Keep in mind that many linear systems can be solved in near linear time despite the inverse being dense.

pmarreck · on Aug 6, 2016

> Many, many, very smart people have worked on exactly this problem

If I applied this logic to anything genuinely new that was ever discovered, then nothing genuinely new would ever be discovered. I don't know what the fallacy there is, but it must be one.

Modern example: Many, many, MANY people worked on globally disparate data synchronization (consensus) problems (Raft, Paxos were some solutions), while ensuring their security, and then Bitcoin came along. Bitcoin hasn't been hacked (although its exchanges have).

I'm just saying my gut says there's more to discover there.

GFK_of_xmaspast · on Aug 6, 2016

I think your gut feeling needs some serious recalibration. Please consider "the Hilbert matrix" (http://blogs.mathworks.com/cleve/2013/02/02/hilbert-matrices...) and more generally condition numbers (https://en.wikipedia.org/wiki/Condition_number), and in regards to "computationally efficient differential inversions based on small changes to the matrix", consider what happens to a polynomial's roots when you make small changes in the coefficients.

jofer · on Aug 5, 2016

Gradient descent is for non-linear problems where you can't directly invert or somehow linearize the problem.

pmarreck · on Aug 5, 2016

Often you can directly invert, but with current methods it would just be prohibitively expensive computationally.

What I'm trying to say is that occam's razor seems to suggest that there is a better way to do matrix inversions which we simply haven't discovered yet.

dragonwriter · on Aug 5, 2016

> What I'm trying to say is that occam's razor seems to suggest that there is a better way to do matrix inversions which we simply haven't discovered yet.

Occam's razor doesn't suggest that. Such a method is an additional assumed entity that adds nothing to our ability to explain what is observed, and is thus the kind of thing Occam's razor calls one to reject, rather than suggests.

Now, belief in the existence of such a method might be a means of satisfying an aesthetic preference for a computationally convenient universe, but that's something very different than Occam's razor.

pmarreck · on Aug 5, 2016

Fair enough, you got at what I was getting at in your last paragraph

thatsadude · on Aug 5, 2016

Check out Nesterov's Accelerate gradient. Its convergence rate is O(1/t^2).

pmarreck · on Aug 5, 2016

https://blogs.princeton.edu/imabandit/2013/04/01/accelerated...

for anyone else curious

gulpahum · on Aug 5, 2016

What about minimum of a non-differentiable function? I've been using something like Nelder–Mead method (downhill simplex method / amoeba method) [1], but it gets slow near the minimum.

[1] https://en.wikipedia.org/wiki/Nelder–Mead_method