Optimization in the context of machine learning

blt · on Nov 25, 2018

> To avoid getting stuck in local minima, we make sure we use the proper learning rate

This sentence is extremely misleading. It implies that gradient descent always reaches a global minimum if the "proper" learning rate is used. Guarantees of that form are only available for restricted families of function such as Lipschitz-smooth convex functions.

rq1 · on Nov 25, 2018

And you don’t necessarily want a global minimum of an universal approximator...

taeric · on Nov 25, 2018

This is... Misnamed? Optimization solvers are a different topic than what has grown to dominate machine learning. Isn't it?

mkl · on Nov 25, 2018

Yes, it's talking about optimisation in the context of machine learning, which is a narrow slice of optimisation as a field.

dang · on Nov 25, 2018

I've pinched that phrase to make a less misleading title. Hope that's ok!

mkl · on Nov 25, 2018

Of course, no problem!

david_draco · on Nov 25, 2018

I am surprised the SGD extensions of https://arxiv.org/abs/1704.04289 have not been picked up yet (via https://www.inference.vc/everything-that-works-works-because...).

moultano · on Nov 25, 2018

It's interesting work, but I don't think it's relevant to real world models, because real world models violate the conditions they assume at the outset. Namely, real models are not run until anything near convergence on the training set. Everyone uses early stopping to prevent them from overfitting, which they happily do if left to train forever.

sgt101 · on Nov 25, 2018

Has there been any progress in the art around this issues? The idea of optimise until it stops making sense to optimise is terribly unsatisfying. I think it means "don't optimise this, but move to another state in a way that is approximately optimisation" or "optimise to this special state". Is there art to define "this special state" which is the generalising classifier. There wasn't anything (for sure) in 1997, although we thought that there were two types of risk minimisation - structural risk (size of classifier) and empirical risk (error on datasets) but no one had an idea of how to use these ideas well. I did some work where I pruned optimised ensembles using GA's to get better generalisation vs direct trained, but everyone hated it because it was inefficient.

Then people looked at deep networks and the structural risk minimisation stuff went out of the window I guess.

beagle3 · on Nov 25, 2018

In simpler worlds, there's a variety of criteria such as Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Neither applies directly to NN optimization because they amount to "every additional parameter costs _this_ much".

The more philosophical Minimal Description Length (MDL) principles and friends might yield something applicable for NN training, but hasn't yet AFAIK.

mlevental · on Nov 25, 2018

something i've never understood: all of the ml frameworks (tf, pytorch, etc.) are seemingly pretty good at optimization of at least a particular set of cost functions (convex quadratic? but i know that with activation functions ml cost functions aren't really convex quadratic) so why aren't there general optimization libraries built on top of them? like why are GAMS and AMPL still a thing? is it simply because they're more general? and also as a corollary how good are the optimizers baked into the ml frameworks relative to tools like GAMS and AMPL?

tmyklebu · on Nov 25, 2018

It might be instructive for you to try solving the linear programs in netlib[0] with stochastic gradient descent. ;)

The short version of the story is that different optimisation problems have different kinds of structure, and the user will want solutions of different quality and have different amounts of patience depending on the context. Practical optimisation is largely about finding ways to exploit the structure of a problem to get solutions of some quality in some amount of time. No known optimisation algorithm dominates all others.

In many machine learning contexts, you want a low-precision solution to a problem with an objective that's expensive to compute but can be reasonably approximated by random sampling. Often, when you use a linear program solver, you want a solution to a linearly-constrained problem where the geometry of the constraints may not be "easy" and any violation renders a solution completely useless.

Good LP solvers can solve "typical" well-formulated LPs with hundreds of thousands of variables and constraints in a matter of seconds on modern hardware[1]. These algorithms can't solve many of the problems you're going to feed into your deep learning framework at all. Conversely, your deep learning framework will have a hell of a time solving LPs.

As for why GAMS and AMPL are a thing, they're interfaces to solvers, not solvers themselves. You use them because they can handle pulling the data out of the database, translating your GAMS/AMPL program into a formulation suited to the solver you choose, running the solver, and handing back the results. It's straightforward but tedious to do this all on a case-by-case basis; GAMS and AMPL automate much of the uninteresting work.

[0] http://www.netlib.org/lp/ [1] http://plato.asu.edu/ftp/lpcom.html

david_draco · on Nov 25, 2018

For a bit of a wider view, see for example the OpenOpt problem tree: https://web.archive.org/web/20150206060441/http://openopt.or... Probably most interesting is "Non-Linear Problems (NLP)" there, and if you click in, you can see algorithms.

DrNuke · on Nov 25, 2018

It always depends on your problems, your targets and your metrics. The most critical aspect for many industrial applications is false positives, so that machine learning extreme / redundant optimisation becomes cheaper for the business in the long term.

aldanor · on Nov 25, 2018

Wanted to share a wonderful addition to scipy I’ve recently discovered for myself (since the topic kind of fits):

https://docs.scipy.org/doc/scipy/reference/optimize.minimize...

This was added in 1.0+, along with a few other trust-* methods, basically a quite modern miltivariate constrained optimization method, deals pretty well with noisy functions you typically encounter in ML.

aldanor · on Nov 26, 2018

Ooook thanks for downvotes then..

enthdegree · on Nov 25, 2018

newton's method?

hnuser355 · on Nov 25, 2018

I don’t know any of machine learning but I’ve heard folks in some fields will prefer a gradient descent algorithm even though it doesn’t have nice convergence speed because the Hessian (Hessian approximation) will be too huge or difficult to calculate

salty_biscuits · on Nov 25, 2018

There are always tricks to keep the problem size small for quadratic methods(such as L-BFGS). In ML stochastic gradient descent (a bad name for doing gradient descent on batches of the data at a time) seems to have a bit of magic in it's ability to go uphill to escape local minima as well as letting the problem scale to much larger sets of data.

mkl · on Nov 25, 2018

Newton's method finds zeros of functions, which is different from optimisation (finding maximums or minimums). If you use Newton's method on a function which is the derivative of some other function you're trying to optimise, then it will find critical points, which may be the optimums you're looking for (or inflection points, saddles, etc.).

snovv_crash · on Nov 25, 2018

This is why you actually do something like Levenberg-Marquardt, which combines a gradient descent with a trust region of quadratic optimization, which is varied depending on if the cost function is decreasing or not. And of course it is iterative, so a single bad step doesn't destroy everything. The advantage comes from faster convergence in practice.

mi_lk · on Nov 25, 2018

Do you know of any good literature on LM or trust region? I'm interested in the details.

snovv_crash · on Nov 29, 2018

I strongly recommend reading the documentation for Ceres Solver [1]. It is intuitive and in-depth, and references everything you will need.

1. http://ceres-solver.org/nnls_solving.html#trust-region-metho...

knicholes · on Nov 25, 2018

Newton's method requires calculating the second derivative, which is quite a bit more expensive than SGD, especially with many parameters.

nagVenkat · on Nov 25, 2018

By Newton’s method, do you mean simple gradient descent?

enthdegree · on Dec 5, 2018