In very large-scale sparse settings, the optimization strategy is tightly coupled to the modeling, not orthogonal to it. The reason why is that "data beats algorithm": meaning that a "dumb" model that can be trained on 1TB of data using a "dumb" optimizer will easily outperform a fancier model for which we only have at our disposal more complex optimizers that can only eat 1GB of data in the same amount of time. This paper is a classic when it comes to the kind of trade-offs you have to deal with in large-scale web settings: http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scal...
The kernel trick uses an implicit mapping into a higher-dimensional feature-space. On the other hand, Factorization Machines uses an explicit mapping into the polynomial kernel space. However it learns jointly the "right" polynom, by mapping the base features into a low-dimensional dense space where the higher-order terms of the polynom are dot-products (we're not taking dot-products in the original feature-space!). FM then learns the right polynomial (a non-convex taks) jointly with learning the original supervised learning task.
Thanks for replying in spite of the fact that the audience would have moved to other threads by now. I am totally with you in the claim that to deploy a solution one has to be fully cognizant of its optimization theoretic implications. What I meant by orthogonal is the the properties of the model does not care what algorithm was used to optimize provided you reach an unique optimizer. Here it is tricky to make the same claim because although convex in a set of variables, its not jointly convex.
Where I am not with you is in the hint that kernels have feature maps that are necessarily opaque. In fact inhomogeneous poly kernels ar great examples where the feature map is known in closed form. Although such a map is available it is not always efficient to use it. But what it lets you do is optimize the weights of each dimension on tje mapped feature space but executed in the native space. In fact, and I am greatly tickled by the coincidence, I had sent a mail to colleagues where I suggested a form where instead of the standard Euclidean dot product in the poly kernel is replaced by a convex combination of other favourite kernel of choice. The training algo remains almost the same. I was not aware of this piece of work but the mechanism is pretty much the same (little more general. Essentially replaces the std dot product that appear in the kernel expression by other kernel expressions. If you are thinking recursion now well that's intended) and can be pushed further.
The kernel trick uses an implicit mapping into a higher-dimensional feature-space. On the other hand, Factorization Machines uses an explicit mapping into the polynomial kernel space. However it learns jointly the "right" polynom, by mapping the base features into a low-dimensional dense space where the higher-order terms of the polynom are dot-products (we're not taking dot-products in the original feature-space!). FM then learns the right polynomial (a non-convex taks) jointly with learning the original supervised learning task.