Hacker News new | past | comments | ask | show | jobs | submit login

If you solve f(x) = 0 where f: ℝ^k -> ℝ^k maps vectors to vectors, most schemes are based on Taylor expansion around x_n

    f(x) = f(x_n) + Df(x_n)(x - x_n) + O(||x - x_n||^2)
Drop the quadratic term, equate to 0, and you get an approximate solution for x for the next iteration x_{n+1}:

    x_{n+1} = x_n - Df(x_n)^{-1} * f(x_n)
Like this it's the Newton method. The problem is that the Jacobian Df(x_n) is a matrix of size k x k, and inverting it may require O(k^3) work.

So, pretty much all schemes are based on approximations of the Jacobian.

Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:

    x_{n+1} = x_n - a * f(x_n)
A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute:

    x_{n+1} = x_n - P * f(x_n)
Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.

Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.




Gradient descent also allows batching, which reduces the memory requirements, can the other methods support that?


As far as I know, this is the #1 gradient descent superpower that makes it the preferred choice above all others. I don't think e.g. L-BFGS supports batching, I've certainly never seen it.


There's also Coordinate Descent which was traditionally used for fitting LASSO regression.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: