An Introduction to Gradient Descent and Linear Regression

throwaway283719 · on June 26, 2014

I get that this is suppose to be an introduction to gradient descent, and that the author chose a simple example to work with to illustrate the method (and explained it very well - kudos!). However, they should probably at least mention that the linear regression problem has an analytical solution[0] which is much faster than using gradient descent.

The strength of gradient descent is that it can be used for more complicated problems, where an analytical solution isn't known (for example - you can train a neural network using gradient descent! The standard algorithm uses back-propagation to compute the errors associated with each layer of the network, and then the "learning" step shifts each parameter in the direction of those errors - just like gradient descent).

[0] http://en.wikipedia.org/wiki/Linear_regression#Least-squares...

RogerL · on June 26, 2014

Also, he states that "Lines that fit our data better will result in lower error values". Strictly speaking, that is not true. The error function is a mathematical model for reality; the result is a 'best fit in a linear least squares fit' sense only. I.e you are fitting your model, not necessarily reality - what the data represents.

Sure, that is all implied in the article, and I don't believe that the author is confused at all by this point. But I witness so many people just running a linear regression, claim it's "optimum", and go on to conclude grossly incorrect things. So I think it is worth stressing (repeatedly), that this is 'optimum' only in the very strict sense of minimizing your error function.

Let's take a problem I am working on at the moment - tracking an object in flight. The data is noisy, but in a complicated way. Yes, you have the typical Gaussian measurement noise, but then you also have outlier measurements that have nothing to do with the object you are trying to track. Blithely applying least squares unduly weights the bad measurements because of the distance squared amplifying the influence of points that lie far away from the real trajectory. I could run gradient descent on the data (we tend to use Levenberg-Marquardt), but boy, the output will not model reality very well. It's optimum, but it is also wrong.

So, for my data, the sentence should read "Lines that fit our data better will result in higher error values". Well, the relationship is more complicated that that, but you get the idea.

I'm not dismissing the article (I liked it), just discussing the next steps you have to start thinking about. If you know you have linear data with Gaussian noise you'll get a meaningful minimum using this approach. If that is not true, then do not trust this to give you meaningful information. Get thee to a University, go (which is what the author also suggests at the end).

mattnedrich · on June 26, 2014

Author here. These are all great points. I chose linear regression because it's simple and easy to understand. There is obviously a simple closed form solution to this problem. However, as already mentioned, gradient descent has some nice advantages like scaling. Further, a "close enough" solution/approximation might be sufficient depending on the application.

grayclhn · on June 26, 2014

There are algorithms to add variables and observations to an estimated linear regression model, so scaling is not an advantage of gradient descent. You should mention in the post that there are better ways to estimate this model, even if you're just presenting it as an example.

nkurz · on June 26, 2014

One tiny improvement to the article would be to change the "visualization of the surface" images to to show the part of the surface where the solution is eventually found. The 3D graph cuts off at y = -2, which makes things a little harder to understand.

mturmon · on June 26, 2014

Agreed. Someone might actually implement linear regression in the way described! That would be embarrassing.

Another thing worth mentioning is the behavior of gradient descent for problems where the Hessian is not well-conditioned (i.e., highly eccentric elliptical contours in the cost function). This would be well-illustrated by a simple example - I think a case where the

  y = a x + b

line had a very large a would do the trick, because then a and b would be tightly coupled (choose the wrong "a" and the intercept would change a lot, due to the large slope).

The number of iterations will skyrocket, which is why gradient descent is avoided if possible in favor of conjugate gradient or related approaches.

jmalicki · on June 27, 2014

Actually, although the direct method is often faster than gradient descent, it is not faster than conjugate gradient.

And conjugate gradient is more numerically stable. So the question is, why would anyone competent use the direct method for linear regression?

on June 26, 2014

[deleted]

grayclhn · on June 26, 2014

Just to add to the other comments: you almost never want to invert the X'X matrix. (Which is k by k, where k is the number of regressors.) The QR decomposition is much more reliable. (for those that want a citation: Seber and Lee's Linear regression analysis is great.)

teraflop · on June 26, 2014

Nope, ordinary linear regression can be done in O(n) time and O(1) space.

http://en.wikipedia.org/wiki/Simple_linear_regression

Sherlock · on June 26, 2014

That's a common (notation-induced) mistake. You just need to solve a linear system.

GFK_of_xmaspast · on June 26, 2014

I think n is the number of predictor variables, not the data size.

babs474 · on June 26, 2014

You need it for l1 and l2 regularization which is popular these days don't you?

throwaway283719 · on June 26, 2014

No. L2 regularization (ridge regression) also has an analytical solution, and L1 regularization (lasso) can be more efficiently solved using the LARS (least-angle regression) algorithm.

Gradient descent would work for ridge regression (it would just be slow). It wouldn't even work for lasso, because the penalty function isn't differentiable at the origin.

Jemaclus · on June 26, 2014

This is explained very well and makes each step easy to understand. It's not the best way to do it, but it does explain why gradient descent is awesome and how the intuition works. Hopefully future posts go into greater detail about gradient descent and linear regression to do things more efficiently and faster, but for an intro post, this is most excellent.

Well done.

SandB0x · on June 26, 2014

If you want a deeper introduction to this topic, and a surprisingly accessible piece of mathematics, I recommend the classic paper: http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradi...

natosaichek · on June 26, 2014

This was my introduction - I used the techniques described to write an interplanetary trajectory timing optimization tool. (finding good launch windows between various celestial bodies). It was an excellent read and eminently transformable into functional code.

nullc · on June 26, 2014

It really should mention the closed form solution, it's quite silly to compute a linear regression this way.

It should also give a nod to robustness, a pair of sufficiently crazy outliers would leave the line nowhere near the points, seemingly in contradiction that lines the "fit" better will have a lower error value. (They will, only in the circular sense where you've defined fit in terms of that particular error function)...

shitgoose · on June 27, 2014

"In our linear regression problem, there was only one minimum. This made our error surface convex."

Shouldn't it the the other way around?

jmalicki · on June 27, 2014

Correct... if there is only one minimum, it is unimodal, but not necessarily convex. For instance, if there are inflection points, it will not be convex.

oliwary · on June 26, 2014

There is also a great course running on coursera right now covering this:

https://class.coursera.org/ml-006

santaclaus · on June 26, 2014

Do machine learning cats use Nesterov methods? Add some inertia to gradient descent, and boom, kick butt theoretical benefits and super fast in practice.

ajtulloch · on June 26, 2014

Yes - see e.g. "On the importance of initialization and momentum in deep learning" from ICML 2013 (http://www.cs.toronto.edu/~fritz/absps/momentum.pdf) for an overview in the context of NNs.

sjtrny · on June 27, 2014

Accelerated gradient descent is pretty common everywhere.