The art of hand-crafting features is very old and with the advent of Deep Learning, there is a trend to learn feature instead of crafting them, which is a great thing. But as long as we did not find a way to turn very sparse bag-of-words features into generic “movie features”, we have to be faithful to this black art.
Very recently, we rediscovered Factorization Machines (FM) with a logistic loss to build a fairly good model that estimates the preference of an arbitrary movie that is encoded with sparse textual features. Since the model is non-linear, it delivers a better performance than logistic regression, mostly due to the ability to consider pairwise interaction of features.
Nevertheless, the performance of FMs depends on the quality of the input features. But of course, this limitation is true for any model. Therefore, we have to be careful to find an encoding that allows the model to learn abstract concepts but also fine-grained preferences at the same time. To be more specific, we consider the following domains of features:
– plot keywords
Since for all of them it is possible that a movie has multiple values, no 1-hot encodings can be used. Now the question is what is more important to precisely model preferences? The answer surely is subjective, but if we rephrase the question to ‘what domain carries more information’, we can at least provide a ranking that is reasonable.
The most specific information are encoded in the keywords and the most abstract information are encoded in the genre. For instance, the words “superhero, duo, sidekick” are much more specific than the genre “fantasy”. The themes, are in-between like “heroic mission” tells more than the genre, but less than the plot keywords.
What about the flags? They are also fairly generic to describe attributes that are found in most movies, like:
In other words, they are like limited, condensed plot keywords with the intention to notify users if a broader theme is present, like if parents can watch a movie with their kid or the level of violence.
With those domains, a model is able to capture lots of fine-grained preferences, like sci-fi movies for kids about dogs and superheros, just to name one. However, as usual, the description of movies is often not complete and thus, we might have only partial information and in the worst case, the most crucial domain data is missing. In terms of a trained model it means that we might have a very unbalanced mixture of different domains.
More specific, in case of FMs for a movie vector x, we have a linear term,
y_w = x*W and
y_v = sum(dot(x, V)**2 - dot(x**2, V**2)), to
decide if a movie should be suggested or not:
y := sigmoid(y_w + y_v + bias) > 0.5
In case that all features are weighted equally, let’s say 1.0 if the feature is present, the final decision might be too much dominated by less important features but which are very frequent. In a drastically simplified view, a group of flags alone could lead to a high y-value which means that the movie will be suggested.
To check our assumption, we trained two models:
1) all feature values are binary 1.0 or 0.0
2) the feature value encodes the importance of a domain: keywords: 1.00, themes: 0.75, genres: 0.50, flags: 0.25
For the training, all (hyper-)parameters were equal and only the data was different. We used the total errors as a measure for comparison:
1) 38 errors
2) 31 errors
The gain of +0.71% in the accuracy is not very impressive, but it is also unlikely that the chosen encoding is optimal. We believe that a better encoding results in a higher gain and thus, the next step is to experiment with different weights to find the best combination.
In a nutshell, Factorization Machines already helped us a lot to cope with very sparse data, but our basic feature encoding seems to be far from being optimal. Therefore, we need to further investigate the weighting of features which includes tf-idf schemes and a more systematic search for the best weighting values.