Sometimes it is fun to let your mind drift in no particular direction. This way, we were reminded of a concept rooted in neural computation for no good reason: Sparse Coding.
Our ultimate goal is to learn topic neurons that represent latent higher-level concepts in movies. A movie usually consists of several topics, but not all of them have the same depth. This is equivalent to a very simple sparse model where each topic is represented by a base (“dictionary”) and the feature representation of a movie is then the sequence of the associated weights for each dictionary.
We illustrate this with a simple example. For the sake of simplicity, we only use five topic dictionaries. Each represents a single high-level topic: dystopia, space, black-comedy, romantic and action. To represent a movie in the feature space, we treat the keywords of it as a single sample x. Furthermore, let X be the sequence of the dictionaries. Now, we need to solve an optimization problem that minimizes the distance |y – Xw|. The vector w is our weight vector for the dictionaries. There are several methods available, but we used the Orthogonal Matching Pursuit because of its performance and the ability to produce sparse solutions.
Let us further assume that our example movie is a science fiction film that plays in a world where omnipotent corporations control most of the world. A possible encoding with respect to the dictionaries could look like that (0.7, 0.5, 0.1, 0., 0.) which can be interpreted as clear focus on the the first three topics (dystopia, space, black-comedy) while the other topics are not present at all. It should be noted that negative weights are also possible.
This feature representation could be easily used for a semantic clustering since the compressed encoding of the movies focuses on high-level concepts and is robust against minor differences of movies on the keyword-level. In other words, if for “noir” movies the same neurons are active, with similar excitation levels, they would be in the same cluster, even if there are minor differences in their plots.