Factorization Machines

xtacy · on Sept 8, 2015

Isn't the key insight -- to replace `w_{i,j}` by turning it into a vector with an inner product structure -- also known as the kernel-trick in machine learning? (here, they seem to use a polynomial kernel: https://en.wikipedia.org/wiki/Polynomial_kernel).

EDIT: <strike>It's</strike> Kernel tricks in general are nice because they also generalise well to an infinite dimensional space (the RBF kernel) and compute the dot-products in that space without actually computing the embedding. RBFs: https://en.wikipedia.org/wiki/Radial_basis_function_kernel

99decisionstr · on Sept 9, 2015

Not quite - it's different from the kernel trick, which is impossible at this scale (there's no way you can train an RBF kernel in a decent amount of time when your space has 10^8 features and your training set has 10^9 observations). The idea of factorization machine is to learn the polynomial kernel, and it provides a mechanism to do so which scales "linearly" (with some arbitrary constant that you're in control of).

xtacy · on Sept 9, 2015

Oops, sorry, I didn't mean to say you used a RBF kernel. I edited my comment above. I meant to say your embedding resembles the polynomial kernel (not RBF, which was just meant as a generalisation of such neat tricks :)).

What I meant to say was that you didn't need to compute the embedding explicitly. Since you embed into a space that has a nice structure, you can compute the dot product of the embedded vectors without having to compute the embedding explicitly.

srean · on Sept 9, 2015

Could you elaborate on its differences from the kernel trick. It seems to me @xtacy is exactly right on that modeling point. Training the model using scale appropriate optimization is an orthogonal concern, if that is what you were getting at, i.e. an off the shelf kernelization would not be sufficient to run this at scale.

99decisionstr · on Sept 9, 2015

In very large-scale sparse settings, the optimization strategy is tightly coupled to the modeling, not orthogonal to it. The reason why is that "data beats algorithm": meaning that a "dumb" model that can be trained on 1TB of data using a "dumb" optimizer will easily outperform a fancier model for which we only have at our disposal more complex optimizers that can only eat 1GB of data in the same amount of time. This paper is a classic when it comes to the kind of trade-offs you have to deal with in large-scale web settings: http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scal...

The kernel trick uses an implicit mapping into a higher-dimensional feature-space. On the other hand, Factorization Machines uses an explicit mapping into the polynomial kernel space. However it learns jointly the "right" polynom, by mapping the base features into a low-dimensional dense space where the higher-order terms of the polynom are dot-products (we're not taking dot-products in the original feature-space!). FM then learns the right polynomial (a non-convex taks) jointly with learning the original supervised learning task.

srean · on Sept 9, 2015

Thanks for replying in spite of the fact that the audience would have moved to other threads by now. I am totally with you in the claim that to deploy a solution one has to be fully cognizant of its optimization theoretic implications. What I meant by orthogonal is the the properties of the model does not care what algorithm was used to optimize provided you reach an unique optimizer. Here it is tricky to make the same claim because although convex in a set of variables, its not jointly convex.

Where I am not with you is in the hint that kernels have feature maps that are necessarily opaque. In fact inhomogeneous poly kernels ar great examples where the feature map is known in closed form. Although such a map is available it is not always efficient to use it. But what it lets you do is optimize the weights of each dimension on tje mapped feature space but executed in the native space. In fact, and I am greatly tickled by the coincidence, I had sent a mail to colleagues where I suggested a form where instead of the standard Euclidean dot product in the poly kernel is replaced by a convex combination of other favourite kernel of choice. The training algo remains almost the same. I was not aware of this piece of work but the mechanism is pretty much the same (little more general. Essentially replaces the std dot product that appear in the kernel expression by other kernel expressions. If you are thinking recursion now well that's intended) and can be pushed further.

mattwilson · on Sept 9, 2015

The Factorization Machine term can be computed in O(kn) time where k is a hyper-parameter of the model and n is the number of features in the model. The features are embedded in a k dimensional space so there is nothing infinite dimensional going on with FM.

fiatmoney · on Sept 8, 2015

This is the main paper on factorization machines, which discusses some of the advantages:

http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

In particular they're nice for recommendation algorithms.

srean · on Sept 9, 2015

If there are any Adroll folks on this thread, really curious to know about Adroll's use of the D programming language, how it has worked out, anecdotes, stories etc. That you use already makes it an attractive place :)

99decisionstr · on Sept 9, 2015

Sure! Here are some insights: https://news.ycombinator.com/item?id=10136197

D is awesome for what we do with it. It unlocks our small team of data scientists to have a huge impact on the company, as we don't have to rely on frameworks such as your typical hadoop/spark cluster which require engineers just to maintain it. We can also use the same code path offline and online where we have very strong latency and QPS requirements.

D makes us ship more often, with less humans. The runtime is not perfect yet, but the language and compiler are awesome. It keeps our codebase small, simple and very efficient.

srean · on Sept 9, 2015

Now that's just bragging with the express purpose of making me envious. So how fast can you come to India :)

iaw · on Sept 8, 2015

I don't quite understand the benefit of replacing the weighting with a vector. Could anyone offer additional insight?

mattwilson · on Sept 9, 2015

Without factorization machines, you would have O(n^2) quadratic weights w_{i,j} to learn. This is bad in a setting like online advertising where you might have millions of features (i.e. n is large) for two reasons. First, this makes training your model very slow. Second, you won't have enough observations to learn all weights w_{i,j} well. So using all quadratic weights in the online advertising setting is not feasible. But we still want to include feature interaction information. Enter FM.

To answer your question, FM is good for two main reasons. First, as a result of some mathematical magic, we can train the model in a time that scales linearly with the time needed to train a model without interaction terms. Second, each of the n features gets embedded in an inner product space with similar features ending up close to one another in some sense. This allows us to make a decent estimate of the interaction between features even if they do not appear frequently in the training data.

99decisionstr · on Sept 9, 2015

The idea is that FM is a way to learn the polynomial kernel, by representing each high-order term as a low dimensional dot-product. It improves generalization error by learning a better polynomial kernel, because terms in the kernel are learnt jointly instead of separately, which helps a lot on the sparse combinations of features.