In the previous post, we talked about the necessity to include rare features into prediction models. However, as demonstrated in the literature, linear models might fail to learn a predictor in case of very sparse input data. The issue can be addressed with factorization machines (FM), because the method considers all pairwise interactions of features. Nevertheless, to learn useful correlations, we need to add as much features as possible, because we do not know before what pairs might best explain the labels.
For instance, a single missing plot word might push a sample from the positive region into the negative one. To address the problem, we need a new feature domain that can explain why a user liked the sample. For instance, it is well known that actors/directors are very important to decide if users will watch a movie or not. So, if the plot domain is not conclusive, adding the person domain might help, because now, we consider the importance of single persons, correlations between persons and correlations between persons and plot words. In other words, if the dependency cannot be explained with a linear model, the factorization of the weights might be able to reveal a latent connection between features to better explain labels.
The challenge is now to handle more than 50K additional features for the persons. The new features are very sparse, for instance, usually a film has a single director and actors are only present in very few movies. Even if an actor played in lots of movies, say 300 movies, the sparsity for a 15K dataset is still 98% and on average the number is more below 30 which means >99.8% sparsity.
The good news is that FMs only use dot products and squaring which can be done efficiently with sparse matrices. Furthermore, we can cache values that do not depend on the input data, like the squaring of the factors. We did not push the limits, but it is safe to say that even on a desktop machine, 50K features are no problem if the implementation utilizes the sparsity in the data.
Bottom line, without the ability to understand the true preferences of users, and with the limitation that the data at hand often cannot fully explain them even if we captured the preferences, makes it necessary to add all kind of features we can find, to explain as much as possible of the user preferences. Even if some features are not utilized immediately they might be useful later in case of concept
drifts, or preferences that are not expressed by the labels yet.