I'm curious about the method chosen to give short term memory to the gradient. T...

halflings · on April 4, 2017

I assume that multiplying by a given factor shouldn't matter since you still have the learning rate as a factor (which is itself a factor of the gradient). This might just mean that the learning rate should be lower or higher with this method.

im3w1l · on April 5, 2017

The question is then really about which method makes it easier to tune parameters or which helps intuition the most.

gabrielgoh · on April 4, 2017

this is a good way to think about this.

gabrielgoh · on April 4, 2017

Very good question! I have considered this issue too. This form of weighting is the kind used in ADAM, and is qualitatively different from the updates described here. The tools of analysis in this article can be used to understand that iteration too, (this amounts to a different R matrix) and I would be curious to see if it too allows for a quadratic speedup.

[EDIT] As per halfling's comment, this is just a change of the learning rate by (1-beta)