Hacker News new | past | comments | ask | show | jobs | submit login
Why Momentum Works (distill.pub)
503 points by m_ke on April 4, 2017 | hide | past | favorite | 95 comments



Some of the multi-author articles on Distill have a very important (IMO) innovation. They quantify precisely the contribution each author has made to the article. I would like to see this become norm in scientific papers, so on the one hand it'd be clear who to ask questions if they arise, and on the other the various dignitaries won't get their honorary spot on the authors list of papers they were barely involved with scientifically.


As others mentioned, this is common in other fields, it's just not done in machine learning.

We've been reading through the policies of lots of journals that seem thoughtfully run and borrowing good ideas. :P


Other journals, e.g. Nature, already had this. It's not a Distil innovation. But I agree, this should be the norm in journals/confs/etc.


I was fully expecting an article about some braindead product that nobody needs, called Momentum. Imagine my surprise finding physics and a healthily low percentage of BS.


This title, and some others on the main page, made me realize that while a title might make sense in the context of its own site, we are collectively really bad at using titles that are appropriate for passing around on the internet. This article (like many others) could have been about just about anything based on the title alone.


it's not really physics so much as numerical methods.


your contribution is appreciated


I'm curious about the method chosen to give short term memory to the gradient. The most common way I've seen when people have a time sequence of values X[i] and they want to make a short term memory version Y[i] is to do something of this form:

  Y[i+1] = B * Y[i] + (1-B) * X[i+1]
where 0 <= B <= 1.

Note that if the sequence X becomes a constant after some point, the sequence Y will converge to that constant (as long as B != 1).

For giving the gradient short term memory, the article's approach is of the form:

  Y[i+1] = B * Y[i] + X[i+1]
Note that if X becomes constant, Y converges to X/(1-B), as long as B in [0,1).

Short term memory doesn't really seem to describe what this is doing. There is a memory effect in there, but there is also a multiplier effect when in regions where the input is not changing. So I'm curious how much of the improvement is from the memory effect, and how much from the multiplier effect? Does the more usual approach (the B and 1-B weighting as opposed to a B and 1 weighting) also help with gradient descent?


I assume that multiplying by a given factor shouldn't matter since you still have the learning rate as a factor (which is itself a factor of the gradient). This might just mean that the learning rate should be lower or higher with this method.


The question is then really about which method makes it easier to tune parameters or which helps intuition the most.


this is a good way to think about this.


Very good question! I have considered this issue too. This form of weighting is the kind used in ADAM, and is qualitatively different from the updates described here. The tools of analysis in this article can be used to understand that iteration too, (this amounts to a different R matrix) and I would be curious to see if it too allows for a quadratic speedup.

[EDIT] As per halfling's comment, this is just a change of the learning rate by (1-beta)


I'm really loving the choice of articles, especially since you're just getting started.

Edit: I'm referring to the journal, not the author.


Thanks! We're just lucky to have authors like Gabe (@gabrielgoh) come to us with incredible articles. :)


How to know if an article would fit there? For example, I was thinking about adjusting my http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html (already with some interactive components) or writing about RoI polling (https://deepsense.io/region-of-interest-pooling-explained/ - by my colleague, but more interactive).

Would it be on-topic? (After changing style accordingly.)


Please check out our journal policies page: http://distill.pub/journal/

In brief, Distill needs to see three things to publish an article: outstanding communication, advancing the dialogue, and scientific integrity. Distill often works with authors to help them bring their articles up to our standards.

Additionally, as a primary publication, Distill will not republish content already published elsewhere, or publish "translations" of papers where someone rewrites the content of a previous paper. (This relates to advancing the dialogue.)

If you reach out to editors@distill.pub, we're happy to discuss pre-submission inquiries about journal scope and related topics.


Thank you! Of course I read it, just (as it is a new thing) still guessing what is a good fit, and what isn't.

(For some reason I though that this t-SNE article was published elsewhere. Now I see that it was on Distill, but just before its big start.)


Yep, we're still clarifying our policies. If you have questions, please email us!

(Distill needs to be extra careful about a lot of this stuff because we're trying to build legitimacy for a kind of work that many people are inclined to not treat as academic contributions. So on some things, like being a primary publication, we may end up taking a more defensive posture than we would in an ideal world.)


I wish all the best to this wonderful initiative! (As a side note, my co-advisor dreamt about this style of communication in science (though, in physics): https://physicsnapkins.wordpress.com/2012/04/16/a-personal-d...)

I totally understand that you need to be set high reference level at the very beginning, even if at the cost of being a bit "conservative" (I am not sure if the best word here).


I'm curious, did the author write the whole article including figures, or did someone else give life to the figures?

I can see this type of interactive journal becoming very popular in other fields, but not if the author has to create the diagrams him/herself.


Author here - I've created all the diagrams, though I've received really helpful editorial input from Shan Carter and Chris Olah. If you feel like doing some archeology, you can see for yourself the really ugly drafts in the github history - it isn't pretty!

I think these visualizations are deceptively easy to create. Javascript is a powerful language with many libraries, and in my experience, it just took a few nudges at exactly the right spots from Shan to go from an idea in my head to fully fledged diagram in distill. The tricky part has always been figuring out what to visualize, and if you're a researcher with an clever idea for a visualization, I recommend you reach out to the distill team.


Thank you very much for the article, I'm really enjoying it. I have a few comments:

- When hovering over some notes I see the citations perfectly, but when there is math involved like for example the one about spectral decay the math is not rendered (I'm using Ubuntu, I have tested both in Firefox and Chrome).

- I see you can open issues on github to send corrections. I wonder if there are other channels of communication also to, for example, address problems like the previous one and also to discuss the article.

- I see people like a lot the figures. I think too they are great but what I really love is the writing and the math exposition.

- I think that a default animation for the figures, if they are meant to be manipulated, would be great.

I will say it again: great article. I have bookmarked the journal and I will proceed to read everything published which looks very promising too.


As Gabe said, we mostly expect authors to produce diagrams and us to help edit them into outstanding articles. This is one example of the editing:

https://github.com/distillpub/post--momentum/issues/1

We've also had a few designers volunteer to work with researchers on visualization. So, in special cases, we may match-make researchers with designers to produce a great article.


Likewise, I'm very impressed with the early article selection. The site itself is beautiful, and interactive figures like the one at the top of this link are incredibly helpful.

Overall -- huge fan!


It would be nice if the introduction made clear that the "Momentum" that "works" is some algorithm and not at all the physical concept "momentum".


At least it wasn't some random software project. My expectation was about 50/50 between "something to do with actual momentum" and "yet another Javascript framework".


distill.pub is a blog specifically for machine learning. While the poster could have added something to the title, the usual around here is to leave it as is.


Good point. I also had no idea what this is talking about. An introductory sentence about it in the article would have been good.


Reminds me of this paper, "Causal Entropic Forces":

http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf


This is one of my favorite things ever. You should also look up "Jeremy England's Entropic Life" for a good companion thought.



Yep! I encountered both of these within a few weeks of each other, after having come to a similar (but way less educated and rigorous) conclusion to Prof England's.


Relevant to this discussion about 'entropy and life' is the relation between the second law of thermodynamics and the purpose of life that this article presents : https://www.farnamstreetblog.com/2017/03/scientific-concepts...

Original article : https://www.edge.org/response-detail/27023


I have to say that the dynamics of momentum diagram is a thing of real beauty. The whole paper felt a little NYTimes like and then of course, I see that Shan Carter helped a little bit with it!


I can't follow the math but the presentation is gorgeous (Safari on a MacBook retina display). Really great, keep up the good work!


not following the math is understandable since it only takes a little unfamiliarity to make math seem hard to understand.

but if you've studied linear systems a bit, the math is laid out really well for understanding and turning into practical applications (which seems to be the whole point of distill.pub).

many math articles, in contrast, seem to obfuscate the "why" and the "what for" and concentrate on the theory and derivation (e.g., many wikipedia math articles). that might be great for mathematical elegance (or cynically, oneupsmanship), but not so great for people trying to build things using that valuable knowledge.


Which part of the math did you find difficult to follow?


Better question: What background is the reader expected to have?

Until Xi, not a single variable is defined prior to inclusion into an expression. Even Alpha and Beta are only defined in the header diagram rather than the body of the text. Also, why are the iteration notations in the superscript rather than subscript?

And before someone chimes in and says that rudiments are necessary to understand this work, no they aren't. The logical steps here are exceptionally simple (and intuitive - as the introduction might lead you to believe) once you get past the delivery. This is a fantastic article that could become very accessible with the proper notation housekeeping.


These are valid criticisms, thank you very much for this.


But simply not on Firefox


Thanks for pointing that out! We fixed the diagram bug in firefox. There's still bad performance for ~30s after page load -- we're looking into why that's happening -- but after that the page seems to work well.


The generating of the nice formulas from TeX syntax seems to eat most of those 30s for me - maybe you could do that during the build instead of pushing that work to the client?


Cool that you fixed it already! I should have started with some praise: great selection of articles


Wow...pretty amazing how easily that site killed my desktop.


The site killed my Firefox on Ubuntu.


it spun my firefox hard


Curious, has this method been used for solving linear systems? How would it perform e.g. against conjugate gradient?

And how would it perform for non-positive-definite systems?


Author here - yes! It can be used to solve symmetric PSD Systems, as the solution of Ax = b is also the minimizer of 0.5*x'Ax - b'x. Conjugate gradient can be seen as a special form of momentum with adaptive alphas and betas


> Conjugate gradient can be seen as a special form of momentum

Just to be clear, though, CG doesn't use the negative gradients as search directions, as steepest descent would.


For those that don't read materials about optimization as much as they maybe should, what is "w​⋆"? It used without introduction and I don't know what it is. Perhaps this a convention I am not aware of?


Optimal w. It is a convention but still ought to be introduced.


the commenters below are correct - but I will push a change for this right away! It is my fault for not introducing it.


In general, something star is the optimal value.


Pretty cool, it started using stuff about classical control theory. I always kinda missed that classical controls weren't really brought up in the discussion of gradient descent.


People, listen. "Damping" means to reduce the amplitude of an oscillation. "Dampening" means to make something wetter.

</rant>



This is indeed related. See the section on polynomial regression!


Hm. So that helps with high-frequency noise. Any progress on what to do when the dimensions are of vastly different scales? I have an old physics engine which had to solve about 20-value nonlinear differential equations. During a collision, the equations go stiff, and some dimensions may be 10 orders of magnitude steeper than others. Gradient descent then faces very steep knife edges. This is called a "stiff system" numerically.


Author here - I believe the problem of a "stiff system" you're referring to is exactly the problem of pathological curvature!

Some points not touched on in the article. If the individual dimensions are of different scales, this problem can be easily fixed with a diagonal preconditioner. Even something like ADAM or Adagrad (unconventional, I know, in this domain) can be used.

There's also a small industry around more sophisticated preconditioners for the linear systems in PDEs, see Multigrid, for example, or preconditioned conjugate gradient.


The stiffness may be local. It definitely is in a physical simulation for hard collisions. Machine learning data is usually normalized into [0..1], so if you get a really steep slope, something is pathological.


I'm not an expert on anything covered in the article but we have a similar physics based model at my work (complex non-linear equations) we use a technique called Sequential Quadratic Programming (SQP) to find an optimal solution. My understanding is that this gives better results than using gradient descent but will only work if the functions are continuous.

This could be worth looking into for you.


If you are curious to see the code to produce that post you can check it out here: https://github.com/distillpub/post--momentum I was surprised to see that each post has its own html page and javascript library. I was expecting to see some form of rendering engine and a common javascript library.


It would be really, really great if you could somehow hook this up to Discourse so people could comment on and ask questions about the article. Allowing people to ask questions and having others answer like MathOverflow would I think bring a lot more clarity. Many different kinds of people want to understand material like this but may need the math unpacked in different ways.


I don't see how Discourse will be better than something like HN or reddit. There's also a submission to Reddit[1].

With discourse, I think there will be more noise and a lot of time will be wasted scrolling through unnecessary replies. What's good about the thread-like nature of HN/Reddit is that you'll have proper context and the rating system does its job so everyone's time won't be wasted.

Questions can be answered here and on Reddit too. I think Reddit can be sometimes more helpful than HN when it comes to answering 'easy' questions so I would go there if you have any simple questions.

[1]: https://www.reddit.com/r/MachineLearning/comments/63f3uk/r_w...


I disagree. The most useful replies can be upvoted to deal with noise, and using mathematical formulas to help explain responses is absolutely crucial. I can't tell you how helpful it has been to be looking at an obscure proof, post a question to Math Overflow, and have the answer explained in an intuitive way with reference to the symbols and notation used.

These articles on distill I believe could greatly benefit from this. Let the community help distill.


I see where you're coming from. But maintaining another service and the added expenses, community managers, spam control etc. might be a bit much for something that intended to be a publishing platform.

And if there is a question that requires more control, like math formatting, etc. I would actually suggest posting to cross validated[1] and then linking it here.

[1]: http://stats.stackexchange.com/


Understood. It's true that this type of thing could incur additional expenses and effort, I think though that it is truly worth the effort. It's going to take a push from the top to create a community around the idea, a community that can distill the idea to those that do not understand it. I really strongly believe that everything needs to be in the same place, the tooling needs to be good(perfect formatting of both code and symbols), etc. I admire projects like distill, but I can't help but think that an article like this suffers from what I will term 'the symbol grounding problem'(yes a theft from classical AI). When you write an article like this, for some people it is incomprehensible because the symbols used are not grounded in concrete numerical examples. It's been my experience(and just look at some of the comments on this thread), that when you don't provide many analogies and examples of concrete computations to illustrate the inner workings of what the mathematical symbols encode, a very significant portion of those reading do not actually take away any understanding. I truly do not want this to be the case, and I must strongly advocate that building infrastructure around helping the community be able to pitch in is absolutely critical. It should not be only on the author to take on the burden, with a community it can be done much better. It's worth it to build something where you can publish an article and by default it is expected that questions will be asked, and answers will be provided. I work in academia at a technical institute you have definitely heard of and I just want to stress this point as much as possible, I see this problem every day, all day. If someone at distill reads this, please consider it carefully.


HN doesn't work if someone has a question in 2 weeks. Both HN and reddit have an incredible skew towards current things (sort of in the term news aggregator, although current = "whatever was recently submitted", not necessarily = "news"), which doesn't really fit posts like this that are relevant for longer. A subreddit might help, but even there things fall off after a while, regardless of the discussion status.


> The added inertia acts both as a smoother and an accelerator

    momentum = mass * velocity

    force = mass * acceleration

    momentum = (force / acceleration) * velocity
So, it looks to me like momentum is inversely related to acceleration. It doesn't seem right to call momentum an "accelerator".


Hi! This article is about momentum in the mathematical field of optimization. Acceleration also refers to a phenomenon in optimization. While there are deep connections to their physical analogues, they aren't aren't quite the same thing.

If you want to make your analogy work, the momentum algorithm adds mass to an object. In terms of literal acceleration at any point in time it is neutral, but the added mass causes you to get through difficult areas much faster.

That said, much of this article is moving away from that physical analogy. :)


The use of the term "accelerator" is misleading as it's introduced during the physical metaphor. You should consider editing that sentence to clarify your meaning:

The added inertia smooths out variation in velocity, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima. Keeping our speed steadier, we arrive at the global optimum faster.

Note the alliteration :-)

> momentum accelerates convergence

Here it's more clear that momentum is accelerating convergence, not the "heavy ball" itself.

> inertia acts both as a smoother and an accelerator

On the second read, it's more clearly a contradiction. Speed can't be both held steadier and accelerated simultaneously. If you meant that momentum alternately smooths and accelerates, then it's even more strange. For that behavior, some sort of motor would be a more appropriate metaphor.


To the author: I found the sentences quoted above quite clear. Please do not change them. They helped me rapidly comprehend what the article was going to be about.


You don't find the idea that increased momentum causes increased speed a bit strange?


I think I've figured out my confusion. I'm thinking of momentum as mass, not the product of mass and velocity. To an extent, the authors seem to have the same confusion.


But velocity is the integral of acceleration, and for constant acceleration, this means

    velocity = 1/2*acceleration*time^2
    momentum = 1/2*force*time^2 = 1/2*mass*acceleration*time^2
So momentum is actually proportional to acceleration in this case.


Hmm. Think of the response that acceleration would have, all else equal, if momentum increased.

There's two ways to change momentum -- velocity or mass. If mass increases, acceleration decreases. If velocity increases, acceleration remains the same.

Edit: You've got a bit of an endogeneity problem in your equation. And I guess I do, too.

    momentum = 1/2 * mass * acceleration * time ^ 2
Since momentum is equivalent to mass times velocity, one of its components is on both sides of the equation simultaneously. My original equation had the same problem, except with velocity instead of mass.

So... I think "momentum" here is leading us a bit in the wrong direction. What the author should have been saying is "mass". After all, the metaphor was "a heavy ball", emphasizing the weight. The initial velocity was unchanged across the comparison.

It bugs me that so many domains misuse the term "momentum". Business people confuse momentum for velocity. Now machine learning folks are confusing momentum for mass. It's common for people to use a component of the whole as a stand-in term, like "my wheels" instead of "my car". Less common to go the other way: "my arm hurts" instead of "my elbow hurts". In this machine learning case the imprecision of the metaphor harms understanding.


'Inertia' works also.


Is this related to the way bias frequency works in analog audio recording? https://en.wikipedia.org/wiki/Tape_bias


That's exactly what came to my mind as well. Can't answer your question though!


The math here is beyond my reasoning, but I loved playing with the sliders!


Turn the step size and momentum to maximum for some wonderfully glitchy chaos! (and a demonstration of why forward Euler integration only works well with relatively small step sizes)


Am I the only one who drew parallels to real life and how "just do stuff" often works better than the deliberate, slow step by step process?


In your polynomial regression example, I can't follow what you mean by p_i = \xi \rightarrow \xi^{i-1} when you're setting up the model.


I just mean p_i(\xi) = \xi^{i-1}. The notation is a little cumbersome here, but it pays of in the second equation


More polynomial regression questions :) Are you using the optimal step size you derived in the previous section? If so, why don't the first and last eigendirections converge at once? If not, doesn't it suggest that there's a trade-off between speed of convergence and the ability to stop early?

In general, this is a nicely written article, though. Good work!


Good question! The parameter has been set to a touch below the optimum. Your observation is accurate, there is indeed such a tradeoff, though it is less than you might think. The qualitative behavior of the system is very sensitive to changes at that point, and the tiny bit of extra convergence you get by getting the step-size exactly right of offset dramatically by the chances of diverging.


Sorry, got it. The fact that you introduce the data as \xi_i a line earlier might throw the reader a little.


It seems have no improvement for badly conditioned problems, i.e. k >> 1. The convergence rate is 1 with or without the damped momentum.


Figured I'd mention since I saw the author and an editor in here: in footnote 4, the LaTeX isn't rendering (chrome, OSX).


This looks like gradient decent passed through a Kalman filter. Which, on reflection, seems like a good idea for overcoming ripples.


isn't this akin, in effect, to successive over relaxation? https://en.wikipedia.org/wiki/Successive_over-relaxation

under-relax, converge real slowly

over-over-relax, oscillate

just-right-over-relax, get fast convergence


Its not quite the same. SOR is closer to coordinate descent in the way it acts


Is the decreasing momentum related to the temperature in simulated annealing?


Page killed my browser.


So much beauty in the presentation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: