In the text domain, bag-of-words, lead to very sparse and high dimensional input data. This often helps to separate the data, but it makes it also very hard to estimate correlations between features. So, if the features are already descriptive, a linear SVM might suffice to train an accurate model for classification. But if the explaining factors of the data are entangled, a non-linear model is usually required for good results.
What we need is a model that benefits from sparse data, scales well and is able to estimate correlations with very minimal data. The surprise is that there is already a model that is both elegant and very easy to train: “Factorization Machines”, FM for short. The objective for the pairwise version can be summarized as:
linear = b + sum(w*x)
corr = 0.5 * sum((V*x)**2 - (V**2 * x**2))
y_hat = linear + corr
In other words, it combines a linear SVM with an objective to capture pairwise correlations in the input data. The parameters are:
"x": input vector of dimension N
"b": global bias, scalar
"w": weight vector of dimension N
"V": factorized correlation matrix of dimension N x K
"K": number of latent factors
The matrix “V” is an approximation of the correlation matrix W = V * V.T with dimension N x N and the overlapping parameters of “V” are the reason why learning is so efficient and it is possible to estimate correlations with very little data. The objective makes it also clear why the model benefits so much from very sparse data, since for the sums, we only need to consider non-zero elements of “x”. In our case, movie meta data, there are 10 non-zero entries on average (with N=1,000).
To learn something about the model with real-word data, we trained a simple classifier with the cross-entropy objective with ratings from a single user:
y_hat_sm = sigmoid(y_hat)
objective = -(y * log(y_hat) + (1 - y) * T.log(1 - y_hat))
the model (I) was trained with “K”=5, with weight decay and RMSprop. For comparison, we also trained a logistic regression model (II):
(I) objective = 37.74, 89.39% accuracy (103 errors)
(II) objective = 16.71, 94.44% accuracy ( 54 errors)
We did not tune any hyper-parameter, because we are were mainly interested in the capacity of the models. As it can be seen, the scores of both models are pretty decent, mainly because of the high-dimensional input space. However, the score of the non-linear model is about 5% better and since the only difference is the “corr” part in the objective function, there is strong evidence that the correlation part actually improved the model.
The incorporation of the model into our EPG system is straight-forward: The output of the model, scores from 0..1, are used as a ranking for all unseen movies. To mark the best matches, we highlight movies with a score of 0.9 or higher. And with the standard interface we use, drop-in replacements for personalized recommendations do not require to change the rest of the code at all.
In a nutshell, Factorization Machines are a pretty clever idea to make the most of your sparse data with a minimal computational overhead, but an notable gain in the model accuracy.