Non-Convex Optimization for Machine Learning

Xcelerate · on Jan 24, 2018

From the preface:

> Put a bit more dramatically, [this monograph] will seek to show how problems that were once avoided, having been shown to be NP-hard to solve, now have solvers that operate in near-linear time, by carefully analyzing and exploiting additional task structure!

This is something I've noticed in my own research on inverse problems (signal recovery over the action of compact groups). And it's really quite mind-blowing. What this means is that you can randomly generate problems, and these will be NP-hard to solve. However, assuming the problem is not randomly generated (i.e., there is some regularity in the generative process that produced the data), there often appears to be some inherent structure that can be exploited to solve the problem quickly to its global optimum.

I feel like future research will focus on finding the line that divides the "tractable" problems from the "intractable" ones.

BoiledCabbage · on Jan 24, 2018

> I feel like future research will focus on finding the line that divides the "tractable" problems from the "intractable" ones.

My gut feeling is there is no "tractable" line. All subsets of problems are tractable.

Via analogy, think of auto-encoders or dimensional reduction. You may have 500^2 dimensions representing each pixel in an image recognition task. Representing all of that data in a lower dimensional space is impossible via pigeonhole. But if you only feed your net the subset of images of cats (or dogs, or only images commonly taken on cameras) it's actually a much smaller space and it can be represented efficiently in lower dimensions. And you can take any subset and it will have structure and because it has structure it can be represented more efficiently and solved efficiently. The fact that we're only interested in a subset of the problem space it what makes it efficient for us to solve.

Even if we generated "random" pixels, but used a method to only generate a subset of that noise (ie gave it structure), then we'd be able to reduce dimensions on it.

There is no line of portion of the space that's tractable or not, it's the fact we're working with a structured portion of the space that always makes it tractable.

I think it will be phenomenal when it's finally common place, but I think the comment they're making is "It's amazing that there's this think called compression and it works on, and works well on just about anything we want to compress. I understand theoretically it should make some things larger, but just about everything I try to compress it makes smaller!" Of course, it does because the only things we want to compress are things that have enough inherent structure that we want to save and share them. If all we were interested in were just the output of a random number generator (which wouldn't compress), then we'd never want to compress it anyway.

It's phenomenal actually. The only problems it works on are the problems we want to solve. Because the problems we want to solve have structure in them, they follow some rules (which is why we care about them to begin with).

cs702 · on Jan 24, 2018

I fully agree.

However, that still leaves an open problem. To the best of my knowledge there is no good way to measure "tractability," i.e., how much structure a problem has in comparison to other problems.

For example, I don't think there is a good way to measure how much more tractable -- i.e., how much more structure there is in -- the problem of classifying dog/cat photos versus the problem of classifying dog/cat/wolf photos versus the problem of classifying fruit/vegetable photos.

The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.

Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.

BoiledCabbage · on Jan 24, 2018

> Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.

Which is an interesting question. Why is hierarchical structure so important for ML, but not for compression?

One view would be, our ML approach is "wrong" and could be improved. In almost all cases running compressed data through another compression scheme yields little to no improvement. Yet in ML we're focused on multiple transforms requiring a concept of hierarchy. Maybe it's because ML is the equivalent of a multi-layered modular compression schemes but in practice all compression is fairly monolithic. If instead we compressed using first a single layer doing key based look up of long identical strings, and then did run-length encoding we'd have items that look more like our ML approach. But, this doesn't intuitively feel right to me.

Another view is that ML works in the physical world and the physical world has inherent hierarchical structure. And we don't do this in the digital world of compression. But a counterpoint is that when we program, we exploit hierarchical structure in our code and data via functions, layers and abstraction.

Which leads me to think it's the third answer. Compression is lossless, and ML is lossy. Being lossy is what allows hierarchy to work. If you are lossless you can't ever leave working with the lowest level of abstraction or else you're by definition losing/changing information. The command equivalent "Send an email with content X to my sister" is absolutely a lossy summary of what is happening or will happen below the surface. Me giving my computer that command it will translate to different low level actions each time I do it. Similarly if someone asks what command I just ran and I give that description, it isn't enough information for them to recreate a CPU & network packet level exact reproduction of what I did. It is lossy. Machine Learning of "is this a cat?" is exactly like this. By contrast, in compression you have to have every bit recovered.

So I think the answer, back to your point, is we may not have an equivalent methodology of discussing information theory / content in a lossy domain. As a result of this we don't have a way of discussing information content in a hierarchical structure as they may be one and the same.

cs702 · on Jan 25, 2018

Yes, that makes sense to me: we don't have a formal, precise way of discussing (measuring) information content when faced with the kind of "hierarchical structure that is robust to loss" that is prevalent in natural high dimensional data.

(FWIW, I don't like the term "hierarchical structure" that much, but cannot think of a more appropriate term. It's as if there's no good terminology for discussing these issues, at least not yet.)

Xcelerate · on Jan 24, 2018

> The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.

You don't think Kolmogorov complexity captures hierarchical structure? (Ignoring the fact that it isn't computable).

cs702 · on Jan 25, 2018

As you correctly point out, it's not computable, so we can't use it for measuring anything in practice. But let's ignore that.

The issue I have with it is that it theoretically measures the lossless compressibility of some sequence of bits, instead of the kind of "hierarchical decomposition that is robust to data loss" that I'm talking about here.

Think about this way: we can randomly zero out half the pixels in a cat photo, and a human being can still see it's the same cat, because it's composed of cat parts that "approximately look like" cat parts (pointy ears, whiskers, etc.), and those parts look like cat parts because they are in turn composed of things that "approximately look like" cat part parts, all the way down to pixels.

AFAIK, there's no way of measuring whether cat photos have more or less of this kind of robust-to-loss hierarchical structure than, say, dog photos, or fruit photos.

nbulka · on Jan 24, 2018

Check out algorithmic information theory:

https://arxiv.org/pdf/math/0210035.pdf

gabrielgoh · on Jan 24, 2018

A simple example of this that has been shown rigorously is compressed sensing. Finding the sparsest vector, subject to linear constraints Ax = b is NP hard for general matrices, but is solvable in polynomial time if A satisfies the RIP property (e.g. w.h.p if A is generated by randomly sampling gaussians for each entry). Quite surprising!

mturmon · on Jan 24, 2018

Another example (that may be related to yours) is the general linear programming problem, that is, for vector x,

  max c^T x

  s.t. A x <= b  [component-wise]

If problem instances (A, b) are chosen at random, from a rotationally-symmetric distribution, Borgwardt and others showed that, with high probability, the number of steps of the solution method is bounded by a polynomial in the number of dimensions. But on the other hand, there are explicit constructions of (A, b) that cause an exponential number of solution steps.

The usual interpretation is that "most" problems are friendly to the standard linear programming approach, but a few are not.

jjgreen · on Jan 24, 2018

Sadly, that not many A arising in practice have the RIP property.

wenc · on Jan 24, 2018

This has been known for a while now, and I would suggest it's not so much about "tractable" vs "intractable" problem classes.

It's important to understand what NP-hard means -- it doesn't mean exponential complexity, but worst case exponential complexity (a theoretical bound). It means there's always the possibility we will hit the worst case, but it doesn't mean we always do. If you exploit structure and apply heuristics, you can achieve very good average case performance.

In fact, we have been solving NP-hard problems in mission-critical environments quite efficiently for many years now.

Take the case of Mixed Integer Linear Programs (MILP) -- that is, Linear Programs (LPs) with integer variables. It is well known that they are NP-hard. However, modern solvers (CPLEX, Gurobi) implement a whole bunch of heuristics and techniques that make solver performance extremely good on most problems. MILPs are used to do large-scale planning, routing, etc. and industry relies on MILP solutions every day.

MILP solvers today are 10^9 times faster than in 1991, and most of that speedup owes firstly to algorithmic breakthroughs, and secondly to machine improvements. It's astounding.

https://www.fields.utoronto.ca/programs/scientific/11-12/bc6...

That said, MILPs remain NP-hard, which means there's always a chance you will run into cases where the solution will take exponential time to solve (full enumeration). It's an escapable property of this problem class.

But the take-home is: NP-hardness does not inherently mean intractability in general.

JD557 · on Jan 24, 2018

> I feel like future research will focus on finding the line that divides the "tractable" problems from the "intractable" ones.

IIRC, there's some research about "Phase transitions in Computation Complexity" that seems to be what you are looking for: http://www.lsv.fr/Events/fmt2012/SLIDES/moshevardi.pdf

(I think I saw that talk at Youtube a long time ago, but I can't find it now)

98324298f · on Jan 24, 2018

might you be thinking of Florent and Lenka Krzakalas research?

https://www.youtube.com/watch?v=i0N6mn9L-GA&list=PLvKBgJ4p0p...

tw1010 · on Jan 24, 2018

This is a really great observation. A lot of the time when people say "but that problem is NP!" is it still likely that you'd be able to solve it by exploiting some structural properties of the specifics.

blt · on Jan 24, 2018

I agree. I think we will see an explosion of techniques that exploit prior knowledge about the set of problem instances typically encountered in real life. The prior may come from analysis by hand or be learned from data.

We will need to move away from worst-case analyses. Randomized analyses will use more expectations over nontrivial probability distributions instead of uniform distributions over the entire problem space.

eachro · on Jan 24, 2018

Super curious, what kind of problems do you mean when you say "signal recovery over the action of compact groups"?

Xcelerate · on Jan 24, 2018

Consider an object. Now duplicate that object many times, add random noise to each object, and give it a random rotation in space (over the rotation group SO(3)). The goal is to recover the original object from the noisy, rotated copies of it.

faragon · on Jan 24, 2018

In this context:

Tractable problem = generalization from previous knowledge.

Intractable problem = not enough previous knowledge.

purushot · on Jan 24, 2018

(Disclaimer: I am an author of the monograph being discussed)

The authors are very glad at the discussion this monograph has generated and would be very happy if this leads to more research and advances in the foundations and applications of non-convex optimization. Solvers for various problems that arise in machine learning and signal processing applications, have seen great improvement in recent years. This includes MILP solvers that are very useful for scheduling and other problems and LASSO-style solvers. This monograph was an attempt to present a concise and lucid introduction to another set of very useful and highly successful algorithmic techniques for machine learning problems.

As one comment below mentioned, the scope of non-convex problems is quite vast as it pretty much includes any problem that is "not convex". However, the non-convexity that arises in machine learning and related areas is often very structured and this monograph was an attempt at collecting our collective knowledge on how to address these specific instances of non-convexity and develop extremely scalable solvers for them. A key component of this effort turns out to be a general avoidance of "relaxations", a common trick used to convert non-convex problems into convex ones so that well-understood techniques can be applied (LASSO techniques fall into this category).

Frequently, these relaxation-based approaches struggle to scale very well. In this monograph we present several algorithmic techniques that avoid relaxations for this reason. These techniques are very intuitive and offer not just excellent performance in practice (much faster than relaxation based techniques), but provably so (the monograph does indulge in formal analyses of the algorithms). This theoretical aspect of the monograph dates back at least a decade, to the initial advances in compressive sensing due to the seminal works of Candes, Tao and Romberg. These initial results were later shown to be a part of a much more general framework, which this monograph also seeks to present, by presenting applications of non-convex optimization techniques to sparse recovery, matrix completion, latent variable models, and robust learning.

We welcome comments and suggestions form all. This monograph is very much a work in progress, especially given the fast pace at which machine learning is moving forward and we wish to keep this text relevant to the community in the future as well, by making relevant updates and additions.

vladislav · on Jan 24, 2018

Perhaps it's unavoidable, but there are large swaths of relevant literature missing from this survey. In particular there is not a single mention of the non-convex Burer-Monteiro method for solving semidefinite programs, a pretty widely applicable technique.

fmap · on Jan 24, 2018

I wonder how the techniques in this monograph stack up against optimization techniques on manifolds (https://press.princeton.edu/absil). Projected gradient descent seems like an approximation to steepest descent on a suitable manifold, so I would expect conjugate gradient or Newton methods to perform better in practice.

XnoiVeX · on Jan 24, 2018

Why is this getting so few comments?

govg · on Jan 24, 2018

It's a monograph that is pretty dense with not so trivial math. Maybe there aren't many on HN who fully understand the material or have strong ideas about it.

graycat · on Jan 24, 2018

"Non-convex" is really vague: Really that means essentially just everything else. I'm reluctant to believe that much that is general and useful can be said about everything that is 'non-convex'.

There is likely an unclear assumption: The optimization problem is a minimization or we would be talking about non-concave.

To drag out NP-complete is a bit vague and maybe evasive: Sure, usually we think of NP-complete with optimization in the context of discrete optimization or, to be simple, integer linear programming. And for a little more, IIRC, minimizing a concave function subject to linear constraints is NP-complete. But for minimizing a non-convex objective function, the assumption is that we are dealing with continuous variables where the role of NP-complete is less common?

It's not so clear what properties the constraints have -- are they also to be from functions that are non-convex?

Usually in optimization with non-convex objective functions, as a first cut we look to satisfy the corresponding Kuhn-Tucker necessary conditions with also having some suitable constraint qualifications, but in the abstract I saw no mention of Kuhn-Tucker.

IIRC there is some old work on, e.g., pits, peaks, and passes, that might help find sufficient conditions for global optima. IIRC for some special case objective functions, there are some transformations that can help.

It may be that the OP would be more clear if we assume a context, say, image recognition? But the last time I heard about non-convex optimization was for investment portfolio rrebalancing, so maybe the context is not just image recognition?

In the work I've done in optimization, the most powerful approach I found was exploit special structure. So, basically give up on general approaches and attack each problem one at a time.

That the abstract does suggest that what is being considered are evaluations of heuristics suggests that the paper is for some particular class of problems and is more like a report on computational experience than math.

I might get interested in the paper except for the concerns above.

jpfr · on Jan 24, 2018

Can you explain where Kuhn-Tucker enters the picture here? Non-convex optimization problems do not necessarily define constraints for the domain.

Most non-convex optimization with bounds for the convergence speed exploits on some sort of structure. For example Lipschitz bounds for the gradient. There is some recent work that requires only a semi-metric around the optimizer. That opens up a large class of functions that could not be efficiently optimized before.

http://papers.nips.cc/paper/4304-optimistic-optimization-of-...

graycat · on Jan 24, 2018

> Can you explain where Kuhn-Tucker enters the picture here?

If there are no constraints, then, sure, the Kuhn-Tucker conditions do not have a non-trivial role. In that case, commonly we would state up front that we were considering unconstrained optimization.

Sure, such a Lipschitz bound could give us some help on lower bounds on the objective function value but for those bounds we would need some constraints! Then at least in principle, we are back to the Kuhn-Tucker conditions again.

I suspect that the paper is (A) not about all of non-convex optimization but (B) about some of optimization, maybe some part of image recognition, where the objective functions are non-convex. So, we're talking a subtle language issue. Or the paper is about some parts of non-convex optimization!!!

jpfr · on Jan 28, 2018

> Sure, such a Lipschitz bound could give us some help on lower bounds on the objective function value but for those bounds we would need some constraints! Then at least in principle, we are back to the Kuhn-Tucker conditions again.

You don't need constraints for the bounds. That's the beauty of it.

The idea is to sample from the domain and use Lipschitz bounds (or a semi-metric in the cited paper) to upper-bound the target function via the distance to the samples. The question is how take samples in a principled way that garuantees a "good" speed of convergence.

There is no free lunch. So some assumptions have to be made on the non-convex target function. However, a semi-metric around the optimizer is much weaker assumption than a Lipschitz constant. The function doesn't even have to be continuous!

Of course performance is a lot slower than, say, linear programming. But this is non-convex optimization, a much harder problem.

graycat · on Feb 7, 2018

Without some constraints, a Lipschitz bound with non-zero gradient is no bound at all.

kleiba · on Jan 24, 2018

Also, it's not clear to me why this entry makes it to the front page over all the tons of other research paper / monographs. It's significance does not stand out to me, i.e., I cannot grasp why this is something I should look at instead of many other interesting reads from the field.

jpfr · on Jan 24, 2018

This was just released as part of the now series for machine learning. They usually publish superb reference material. Research that is foundational and/or has become a stable body of work, but is recent enough to be interesting.

Read these and you are on the edge of what is currently happening.

http://www.nowpublishers.com/MAL

mturmon · on Jan 24, 2018

There has been a big question about the place of convex relaxations for solving learning problems, and this responds to that question.

That is, you have a learning problem with some fit criterion that is non-convex. Maybe you choose a different objective function, or different parameterization of the problem, so that it is convex. Then you can use a convex solver to find the guaranteed optimum of the "wrong" problem. Using convex relaxations has been a trend for perhaps 20 years - it avoids a whole bunch of nagging "what if my solution is just a local max" questions.

Or, you can use older techniques, like expectation-maximization, or in general, alternating minimization, that are adapted for these kinds of learning problems. Learning problems often have observed variables, and hidden variables, and if you knew the hidden variables, the optimization problem would be easier. These algorithms work on such problems. But, they don't find the global maximum. So you have found a possibly-good solution to the "right" problem.

Lately, some very interesting results have shown that, under some conditions, there are probabilistic guarantees that the second approach will find the global optimum [e.g., https://arxiv.org/abs/1408.2156]. Since the second approach often scales much better, people are naturally interested.