After the successful training of a larger model using the most frequent keywords, we decided to focus on a simple clustering scheme. To recap the situation: we learned a model on the top-400 feature words and we set the number of latent topics to 50. Furthermore, the input data was combined with the co-occurence matrix.
We thought some time about a function to convert the input space into the feature space. Due to the very high sparsity in the original space, we decided to start with a simple mapping that just calculates the overlap of a topic neuron t with the feature words x: |x AND t| / |t|. This procedure is repeated for each topic neuron. The result is a vector with 50 overlap measurements for each data sample x.
Before we continue, we need to elaborate on some restrictions of the training data. As we mentioned before, each movie is described a small set of keywords to capture the mood and the topics of a movie. In general, it is very likely that the meta data is either incomplete or that the keywords are not sufficient to capture the whole depth of a topic.
If we take ‘Captain America’ as an example, some keywords could be ‘hero’, ‘soldiers’, ‘superpower’ or ‘patriotism’. Since such meta data is usually provided by humans, we expect some variance and different emphasis regarding the topics. For our example, we consider the case that the keywords focus on the ‘soldier’-theme and therefore, the ‘superhero’-theme cannot be clearly
inferred from the data.
As a preliminary step, we analyzed the distance in the feature space following a information retrieval approach to find similar movies. The issue we described above explains, why our approach returned so many ‘war’ movies when we performed top-k retrieval for ‘Captain America’. The situation is no single case, it can happen with every movie because it is possible that a movie was only tagged with a few keywords to describe all topics or that the keywords only describe a single topic.
However, except for some hurdles we found the system works very well on some genres. For instance, the nearest neighbors of the movie ‘Batman’ are: Batman Returns, Batman & Robin, Batman Forever, Blade, Spider-Man and Daredevil.
As we can see, most of the results fall into the superhero genre as one would except. Furthermore, the closest hits are actually Batman films. A perfect match would be all Batman films first but as we mentioned, the actual meta data of the movies is not sufficient to infer such a ranking.
Here we can see a clear limitation of all these methods: A human would be able to extract more specific latent topics for a Batman movie that would allow to cluster all of them, if appropriate, together. However, a machine has to rely on the given meta data and if the data is limited, so are the inferred latent topics. That is one reason why a collaborative approach is sometimes favored because it does not require to provide the meta data explicitly. Instead, the latent factors are derived by the given ratings.
In other words, we need to combine our approach with further meta data to perform a semantic clustering. We did some tests with the genre information but without a proper weighting the influence of the genre leads a very strong bias.
With these results, we do not expect that a ‘perfect’ clustering is possible since some movies themes are not visible if we only consider the meta data.