To estimate good models with high-dim, but very sparse input data can be challenging. We still favor factorization machines because the method is simple, elegant, powerful and it performs very well on sparse data. Plus, it considers the pairwise interactions of input variables and is therefore perfectly suited for a preference-based model.
Despite all the advantages, we need to find a way to encode the input data that maximizes the gain for the model. In most papers the input data is categorical (1-out-of-K) or indicator variables that are normalized to sum up to one. In contrast to this, our data is binary and often belongs to the a single variable like ‘themes’ or ‘plotwords’.
So, the first approach is just to use the identity encoding that feeds the data as-it-is to the model. Since we deal with features that are power law distributed, we need an optimizer that takes care of rare features. In other words, the learning rate should be adjusted depending on the frequency a feature is encountered. This is related to a scaling of the features with respect to the frequency since an error for a high-frequent keyword is not as severe as for a low-frequent one.
Now that we have found -again- a good model for our data, we need to get back to the original problem. We need to estimate the utility of an item with very few features which also might have a very low frequency in the data set. Nevertheless, some of those rare features are essential for a good prediction which is why we cannot ignore them.
For example, a feature might be only present 10 times in the training set, but it is always correlated with a positive preference. An example could be a tag that describes a specific director and movies with that tag _always_ gets 5-stars by a user. Thus, the feature should have a strong influence on the final decision of the model. Stated differently, the expressive power of such features might be extremely useful to disentangle explaining factors which hopefully helps to get difficult cases right.
A little step further and we arrive at the ‘long tail’ problem that every recommendation system has to fight. Let’s consider an example. A user wants to choose a movie and he decides to start with drama movies. The genre information is important for a first grouping, but because almost 1/3 of all movies belong to the genre, the descriptive power is limited. If we think of a decision tree, we eliminated about 66% of the candidates, but 33% still remain. Without any further information, a rating that only depends on this genre is not very informative and does not explain the preferences well, because drama is a very broad topic. For instance, a genre like ‘scifi-horror-drama’ would definitely reduce the candidates more and contains a lot of more information.
Bottom line, regardless of how we scale a set of common features, if some preferences cannot be explained by those, the accuracy will suffer. However, if we add a set of rare features, even if the frequency is very low, we might be able to explain more preferences, with the additional expressive power introduced by the new features. In a nutshell, better add 10 features to model the input data because the additional computational complexity can be often neglected and the positive impact might be tremendous. Furthermore, if resources are limited, the model can be optimized by pruning features that do not contribute (much) to the final prediction.