To learn a concept space that clusters similar items together is challenging for many reasons. The first one is definitely that you need to define a similarity in terms of movies (which is part of the initial problem) and a second is that you need an efficient strategy for sampling. But let us start at the beginning. Since we have no labels, except for the very noisy genres, we can only consider pair-wise comparisons of items to build a training set. The ultimate goal is to learn a metric that pushes dissimilar items apart and which pulls similar ones together. Regardless of the used similarity, it is obvious that a movie item usually has more dissimilar than similar item partners. Thus, we need to find a way to balance the positive and negative samples.
Here is an example with the movie ‘Doom’. To label similar items, we use the genre and sub-genre information, enhanced by a simple taxonomy for the latter. We assume that the movie is marked as horror/action, scifi/creature, for the genres and sub-genres. An obvious limitation is that the scifi aspect of it is only encoded in the sub-genre and thus, the similarity to other movies with scifi in the top genre is limited. It would be possible to use ‘scifi-horror’ as a sub-genre or scifi/horror in the top genre to
address the problem, but the task to create a consistent labeling of movies is definitely not trivial problem and usually there are divergences.
We emphasize this because without a consistent label to indicate item similarity, the learned model will be of very low quality. That is especially the case if there are lots of movies that have in general only very few movie partners that are similar. We demonstrate this by comparing the movie ‘Doom’ with ‘Resident Evil’. On the feature level, they are similar: horror/action, action-thriller/creature. Thus, they only differ in the scifi vs. thriller aspect, but they share the creature and horror topic. Without a doubt both movies have a scifi theme and such an annotation would definitely make sense. But the more severe problem is that the pair action-thriller vs. scifi, as a sub-genre, has a noticeable distance in our taxonomy space in contrast to scifi vs. scifi-horror or space-adventure.
Once more, our problem is the very limited amount of data and that we depend on consistent features to learn useful concepts for the rare, non-mainstream items and items with features that are ambiguous or even wrong. In other words, we believe that the right model is out there, somewhere, waiting…