Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to Gradient Descent and Linear Regression (atomicobject.com)
62 points by CrocodileStreet on June 26, 2014 | hide | past | favorite | 23 comments



I get that this is suppose to be an introduction to gradient descent, and that the author chose a simple example to work with to illustrate the method (and explained it very well - kudos!). However, they should probably at least mention that the linear regression problem has an analytical solution[0] which is much faster than using gradient descent.

The strength of gradient descent is that it can be used for more complicated problems, where an analytical solution isn't known (for example - you can train a neural network using gradient descent! The standard algorithm uses back-propagation to compute the errors associated with each layer of the network, and then the "learning" step shifts each parameter in the direction of those errors - just like gradient descent).

[0] http://en.wikipedia.org/wiki/Linear_regression#Least-squares...


Also, he states that "Lines that fit our data better will result in lower error values". Strictly speaking, that is not true. The error function is a mathematical model for reality; the result is a 'best fit in a linear least squares fit' sense only. I.e you are fitting your model, not necessarily reality - what the data represents.

Sure, that is all implied in the article, and I don't believe that the author is confused at all by this point. But I witness so many people just running a linear regression, claim it's "optimum", and go on to conclude grossly incorrect things. So I think it is worth stressing (repeatedly), that this is 'optimum' only in the very strict sense of minimizing your error function.

Let's take a problem I am working on at the moment - tracking an object in flight. The data is noisy, but in a complicated way. Yes, you have the typical Gaussian measurement noise, but then you also have outlier measurements that have nothing to do with the object you are trying to track. Blithely applying least squares unduly weights the bad measurements because of the distance squared amplifying the influence of points that lie far away from the real trajectory. I could run gradient descent on the data (we tend to use Levenberg-Marquardt), but boy, the output will not model reality very well. It's optimum, but it is also wrong.

So, for my data, the sentence should read "Lines that fit our data better will result in higher error values". Well, the relationship is more complicated that that, but you get the idea.

I'm not dismissing the article (I liked it), just discussing the next steps you have to start thinking about. If you know you have linear data with Gaussian noise you'll get a meaningful minimum using this approach. If that is not true, then do not trust this to give you meaningful information. Get thee to a University, go (which is what the author also suggests at the end).


Author here. These are all great points. I chose linear regression because it's simple and easy to understand. There is obviously a simple closed form solution to this problem. However, as already mentioned, gradient descent has some nice advantages like scaling. Further, a "close enough" solution/approximation might be sufficient depending on the application.


There are algorithms to add variables and observations to an estimated linear regression model, so scaling is not an advantage of gradient descent. You should mention in the post that there are better ways to estimate this model, even if you're just presenting it as an example.


One tiny improvement to the article would be to change the "visualization of the surface" images to to show the part of the surface where the solution is eventually found. The 3D graph cuts off at y = -2, which makes things a little harder to understand.


Agreed. Someone might actually implement linear regression in the way described! That would be embarrassing.

Another thing worth mentioning is the behavior of gradient descent for problems where the Hessian is not well-conditioned (i.e., highly eccentric elliptical contours in the cost function). This would be well-illustrated by a simple example - I think a case where the

  y = a x + b 
line had a very large a would do the trick, because then a and b would be tightly coupled (choose the wrong "a" and the intercept would change a lot, due to the large slope).

The number of iterations will skyrocket, which is why gradient descent is avoided if possible in favor of conjugate gradient or related approaches.


Actually, although the direct method is often faster than gradient descent, it is not faster than conjugate gradient.

And conjugate gradient is more numerically stable. So the question is, why would anyone competent use the direct method for linear regression?


[deleted]


Just to add to the other comments: you almost never want to invert the X'X matrix. (Which is k by k, where k is the number of regressors.) The QR decomposition is much more reliable. (for those that want a citation: Seber and Lee's Linear regression analysis is great.)


Nope, ordinary linear regression can be done in O(n) time and O(1) space.

http://en.wikipedia.org/wiki/Simple_linear_regression


That's a common (notation-induced) mistake. You just need to solve a linear system.


I think n is the number of predictor variables, not the data size.


You need it for l1 and l2 regularization which is popular these days don't you?


No. L2 regularization (ridge regression) also has an analytical solution, and L1 regularization (lasso) can be more efficiently solved using the LARS (least-angle regression) algorithm.

Gradient descent would work for ridge regression (it would just be slow). It wouldn't even work for lasso, because the penalty function isn't differentiable at the origin.


This is explained very well and makes each step easy to understand. It's not the best way to do it, but it does explain why gradient descent is awesome and how the intuition works. Hopefully future posts go into greater detail about gradient descent and linear regression to do things more efficiently and faster, but for an intro post, this is most excellent.

Well done.


If you want a deeper introduction to this topic, and a surprisingly accessible piece of mathematics, I recommend the classic paper: http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradi...


This was my introduction - I used the techniques described to write an interplanetary trajectory timing optimization tool. (finding good launch windows between various celestial bodies). It was an excellent read and eminently transformable into functional code.


It really should mention the closed form solution, it's quite silly to compute a linear regression this way.

It should also give a nod to robustness, a pair of sufficiently crazy outliers would leave the line nowhere near the points, seemingly in contradiction that lines the "fit" better will have a lower error value. (They will, only in the circular sense where you've defined fit in terms of that particular error function)...


"In our linear regression problem, there was only one minimum. This made our error surface convex."

Shouldn't it the the other way around?


Correct... if there is only one minimum, it is unimodal, but not necessarily convex. For instance, if there are inflection points, it will not be convex.


There is also a great course running on coursera right now covering this:

https://class.coursera.org/ml-006


Do machine learning cats use Nesterov methods? Add some inertia to gradient descent, and boom, kick butt theoretical benefits and super fast in practice.


Yes - see e.g. "On the importance of initialization and momentum in deep learning" from ICML 2013 (http://www.cs.toronto.edu/~fritz/absps/momentum.pdf) for an overview in the context of NNs.


Accelerated gradient descent is pretty common everywhere.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: