The latent topics found by a non-negative matrix factorization of the co-occurrence matrix are pretty impressive. All we need to do is to convert the existing movie data – usually with a top-k keyword encoding – into some matrix X, calculate C = dot(X.T, X) and factorize C, with the number of latent topics selected from the range [50, 100].
As stated earlier, it is difficult to use NMFs on the data because it is very sparse. Thus, we cannot use the NMF to embed the data, but we can use the latent topics for doing that. Furthermore, the approximation of the latent space, given by x*W, where ‘x’ is the encoding of an arbitrary movie and ‘W’ the feature matrix of the NMF, does not lead to a good clustering of movies. That is why we tried to optimize the movie embedding into the latent space of the NMF.
Here is our approach. We use the weight matrix ‘W’ from the NMF, l2 normalize it, column-wise and we seek a separate coefficient vector ‘c’ for each movie sample. In other words, we turn this into an optimization problem that is solved for each movie ‘x’ separately to determine the importance of each topic (coefficient) for this movie. A (too) simple cost function could be cost=(x – W*c)**2.
To analyze the results, we selected movies with clear-cut topics and checked the magnitude of the coefficients, to see if they match with the human judgment. And indeed, if a topic was very relevant for a movie, the magnitude of this coefficient was correspondingly.
The drawback of the procedure is that we need to do inference for each (new) movie and cannot use simple matrix multiplication as in neural networks. However, since the procedure converges on average in less than 30 steps, inference is cheap and can be easily parallelized with minimal memory footprint. Plus, we can use those features as input for other models which is beneficial because the new input space encodes a lot of more information than the raw features.