We try to build a system that is powerful and efficient on the one hand, but also as flexible and modular as possible. That is why we try to combine ideas from several domains to use their strengths and to overcome of their weaknesses.
Recently we stumbled -again- about the NMF approach and we decided to give it a try. Similar to image data, our data is very well suited for the approach since there are only positive feature values. Again, it is not our intention to give a detailed introduction here, but to focus on the integration of the algorithm as a black-box for feature construction and topic modeling.
The setup is identical to the one in the other detours. We have a set of movies X, a single movie is described by x and it contains a subset of all possible features. A ‘1’ indicates that is feature is present, otherwise ‘0’ is used. The result is a very sparse matrix X.
Matrix Factorizations have been used to solve various problems, also for recommendations but usually in a collaborative setup. We will use it as a black-box to construct ‘topic neurons’. The idea is to describe the data X as a product of two matrices W and H: X = W*H. Those matrices decompose the data into smaller parts that are easier to work with and better suited to interpret the data. In other words, H can be considered as a matrix, where columns are basis vectors used to describe the data. The reconstruction of the data is then a weighted combination of the different bases.
Since our ultimate goal is to describe high-level concepts of the data, we focus on the matrix H. For our experiments we used the genre ‘romantic-comedy’. It should be noted that over-fitting is very likely because the number of movies are rather limited in this genre.
However, the model is still capable to extract broader topics from the data as we can see in this example:
romance -> 0.29
love -> 0.07
americans-abroad -> 0.03
assumed-identities -> 0.03
espionage -> 0.02
romantic -> 0.02
double-life -> 0.02
The interpretation of this neuron is rather simple. It is the well-known ‘fall-in-love-with-a-spy’ theme. Furthermore, the model also clustered similar words like ‘romance’, ‘love’ and ‘romantic’ together. But as the fast decreasing magnitude of the weights indicate, and also the manual analysis of other neurons, a lot of more data is required to get stable models.
We repeated our experiments with other genres and we found out that the results strongly depend on the keyword distribution and the number of movies. This confirms our initial assessment that much more data is required before we can actually use the model in our recommendation tool chain.
In the second part of the detour, we will describe the results if we use the model to transform the data into feature space to cluster the movies.