In the last weeks we experimented with lots of different approaches and algorithms. We tried models based on Restricted Boltzmann Machines, matrix factorizations and less known approaches. During this time we learned a lot, mostly to have patience, since perfect algorithms almost never exist and therefore, time is needed to re-fine and build a suitable model.
We are still not finished with RBMs, especially a possible deeper model, by stacking multiple machines, is on our to-do list, but meanwhile a variant of non-negative factor models caught our attention: the Kullback-Leibler cost function.
KL is well known from probability theory and is a measure of the difference of two distributions. Since the data we work with is very sparse and only contains binary features, at least the numerical range of the data resembles probabilities. And with our approach to scale the data with a term-weight matrix, it could be argued that each sample is something like a probability vector that models the chance that a keyword is present or not.
Long story short, we implemented the cost function for our NMF code and we played with some specific movie genres. The results looked promising, so we decided to learn a model on all genres with the K most frequent keywords, where K is 400. The number of topics was set to 50. Again, we studied the learned topics by examining the corresponding keywords with the largest weights. For a training without any fine-tuning, the learned latent topics seem surprisingly consistent.
To analyze the results in a more realistic scenario, we calculated the feature representation of all movies and stored them in a database. Then we randomly sampled some movies and searched for its nearest neighbors in the feature space. As we discussed in a previous post, the meta data can be sometimes problematic and/or inconsistent, so we never expect perfect matches. For instance, the best matches for the movie ‘Pulp Fiction’ were: Reservoir Dogs, Analyze This, Get Carter, Ghost Dog, Donnie Brasco and Last Man Standing.
Before we assess the performance, it should be noted that the total number of movies in the training set was limited and further that we only learned 50 topics. If we consider these limitations, we can at least say that the results show that the model was able to successfully extract high-level concepts from the data.
In comparison to the other model, with a different cost function, the results were much better, especially for those movies with a ‘difficult’ genre or, if the keywords can be assigned to a broader range of topics.