Without a proper method for on-line learning, we are forced to re-train our preference model from time to time. This includes the update of the selection of the top-k features per domain, but also to actually train a new model with the collected labels so far. We settled with Factorization Machines (FMs) in combination with a logistic loss because the model is surprisingly powerful, yet simple, especially for high-dimensional sparse data. The actual optimization is done with L-BFGS, a second order method, which avoids to painfully select hyper-parameters like the learning rate. Thus, we only have to care for regularization values and the number of factors, which are selected by a very simple grid-search.
However, even with a 2nd order method, we have to carefully initialize the weights of the model. We set the bias “b” to 0, also the linear weights “W” of the model and the factor weights are drawn from a normal distribution with mean=0, std=0.01. We further fix the random seed for deterministic results. To study the effect of different weights, we hand-picked a set of movies that consists of top movies, bottom movies and some in between.
The first observation is that the extremes top/bottom predictions are non-sensitive for different weights. A top movie will never leave the 90-99% score range and a bottom movie will never reach 30%. But, especially unseen movies and those who are more difficult to predict have huge fluctuations in the prediction scores, depending on how the weights were initialized. For instance, the final score of an unseen movie can be in range from 0.49 to 0.69. That can be interpreted as selected by chance or a marginal hint to watch the movie which is not very useful.
Bottom line, FMs are still our favorite choice but since the underlying problem is non-convex, it is challenging to train a good model that generalizes to unseen data in a consistent way. Since most of the expressive power comes from the factor model, we asked ourselves if it would help to initialize those weights from some kind of pre-training to select a more proper region of the weight space.