> We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients
with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients
Most of the popular variants of SGD use approximations of the hessian in one way or another
Not to be peculiar, but I don't know if approximating the hessian using the gradient counts as a second order method. I was talking about "full-blown" second order methods where you compute de hessian through AD.
Furthermore, I don't think by "moment of the gradients" they actually mean second derivatives.
Also from the paper: We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions...
It's written right in the abstract that the authors consider it a first-order method.