While we were pondering how to improve our system, we stumbled about a paper that focused on latent topics in short texts. If we compare their starting point, short messages or titles, and ours, both have a lot in common. First, the number of words are rather limited and second, the probability of repeated words is low.
The basic idea is simple, we construct a co-occurence matrix of features (words) and use it to build a term-weight matrix which is then combined with the training data. In contrast to our other experiments, we decided to train one big model based on the top-k keywords without considering the movie genre.
The result was quite promising as we can see if we display the features with the largest weights for each neuron. Here are some examples:
– prison, prisoner, lawyer, heist, arrest, robbery, trial, handcuffs
– school, high-school, teacher, teenager, student, girl, boy, relationship, best-friend
– battle, soldier, combat, warrior, sword, fight, army, military, duel, desert
Each of these neurons focused clearly on a single topic which can be very roughly described as ‘prison’, ‘teenager’ and ‘war’.
The next step is now to use the learned topics to perform a semantic clustering of the movies. There are several methods to ‘transform’ new (movie) samples into the feature space, but our preliminary tests indicated that these methods do not lead to optimal results.
A straight-forward method would be to calculate the ‘overlapping’ of the input words with each topic, then concatenate all those outputs to form a row vector that is used as input for some cluster algorithm. The idea behind this is that a movie can have more than a single topic that is modeled by the output of the neurons. For instance, if a movie is about a prisoner that is also a soldier, the final output vector could be [0.71, 0., 0.35] to indicate that the first and third topic is present, with a clear focus on the first topic, while the ‘teenager’ topic is not present at all.
However, more research is required to implement a suitable method to perform a semantic topic clustering on the movies with our feature data.