We thought a lot about possible metrics for the similarity of movies. The problem is that movies combine a temporal component, images and “text”. Even for single domains like images or text, a good metric can be very difficult to define, but we can at least use a basic metric to focus on the main aspects. For instance, if a cat is on the image, other images with a cat should be somehow similar. The same is true for texts; if two texts are about Deep Learning, even if the details are different, they have something in common. In the case of movies, it is pointless to define a similarity per frame or scene and thus, we need a similarity at a higher level.
In a previous post, we argued that genres are way too coarse to use them as good labels to train a classifier and the reason is obvious. Even if two movies are marked as ‘drama’, the possible topics of the movies are sheer endless and thus, it is even possible that those two movies share nothing except being dramatic. We further argued that in case of missing meta data, we could use data from social websites, like tags or ratings to enhance our own training data.
However, instead of using the social data as features, we could also use it as labels to define an improved metric to capture the “aesthetics” of movies. What do we mean by that? Let us assume that we have access to all movie ratings created by some movie rental website. These ratings can be converted to latent features for movies by performing a matrix factorization (NMF, SVD++). In other words, each movie is now described by a n-dimensional feature vector that captures the correlation of ratings for this particular movie and it is well known that “similar” movies have similar representations in this feature space.
Equipped with this information, we can now cluster all movies to assign each movie with a fixed label, the cluster ID. The pseudo label assignment can then be used to train a classifier, for which we use the last fully connected layer as our feature representation, or we can directly train a Siamese network by creating samples of “similar” and “non-similar” pairs with the cluster IDs. We will close with an example why this new labels can be much more beneficial than ordinary labels.
The Resident Evil movies all share a “horror” theme, but nevertheless some of them might be only marked as “action”. We assume that both concepts are very important for *all* movies, but if a movie only has “action” as its genre, the neurons in a network cannot build a connection from the input features to this scheme. With pseudo labels, it is much more likely that most Resident Evil movies will be assigned to the same cluster, or at least there should be a partition with a small number of involved clusters. Why? Because the ratings will form latent features that will reflect both the genres and concepts of a movie. Stated differently, the ratings carry much more information than the genre.
One use case would be, to use the learned embedding to efficiently determine the nearest neighbors of movies in the feature space.