Without one big fat model that transforms movie data into a single feature space, we need a step-wise approach for clustering. First, a greedy step is required to drive the movie into the right region of the space and then we can use a fine-grained model to find similar neighbors in this region of the space.
A naive approach would be to use extra data, like the genre, to perform the first clustering step and frankly, this works very well. If we want to find similar movies to Star Wars, we first walk into the science fiction region and then we use our combined topic model to find the best matches among all movies in that region. In case of multiple genres, we could combine all movies from the genres and then return the best matches from all of them. However, since we cannot determine weights for each genre the results might be far away from optimal.
That is why we decided to create meta genres by using the learned topics. To assign a movie to a genre, we determine the maximal excitation of the neurons from a specific topic model. The step is repeated for all topic models. At the end we have a list of values, between 0 and 1 that describe the matching of the keywords with a topic. Now, we can sort the list and use the Top-K excitations as our meta genres. Stated differently, we use the meta genres as cluster IDs for a movie and we even went a step further and also stored the excitation values as an indicator how much the movie fits into the genre. The values were then used as weights in the neighboring step.
But of course this is all theory and we were interested how well this approach actually works to predict the original genre of a movie. As a simple test, we predicted the Top-1 genre for an arbitrary set of movies and then we compared the values with the original values. As expected, there were quite a few errors. Especially for movies with limited keywords or keywords which are very ambiguous.
For instance, the first Spider-Man movie was tagged in the genres ‘teen’ and ‘superhero’. This is not too far away from the truth if we consider the topics of the movie, because high-school and teenagers are definitely important parts of it. On the other hand that means that the chance to cluster Spider-Man movies together is lower since later movie parts might focus on other topics and that means other tags are more likely; for example, ‘sci-fi’ and ‘superhero’.
In all, the feature space spanned by our model might confuse users since they expect to cluster Spider-Man together because they recognize a common theme in the movie which is the hero itself! While our model focuses on the latent topics it found, and the keywords are not sufficient to extract a ‘spider-man’ concept in all of them. This is a good example that content-based models largely depend on the given meta data why collaborative models might be able to infer a better hidden structure for Spider-Man if the rating behavior of users is similar.