With this post, we want to shed some light of the impact of inconsistent meta data. Let us assume that we have a trilogy of
movies. The overall genre of this movies is supposed to be ‘horror’ and a main topic of all of them is ‘animal’. For instance, Jaws. As a simple pre-processing, we weighted each feature according to the importance within the genre. We further assume that the first movie is not tagged as ‘horror’ but something different, maybe ‘thriller’. In trilogies it is not uncommon that the genre of the first is different because the director of the sequels might have changed, or because the sequel only shares a subset of topics with the first movie.
But regardless of the chosen genres, the majority of people would probably agree that the selection of the genre ‘horror’ makes sense for all of the Jaws movies. Maybe in combination with different genres, but the additional ‘horror’ genre would not hurt. Since we are interested to analyze the impact of such inconsistencies, we also utilize the genre information as features and combine the meta data and genre genre data in a single model. The learned latent space is then reduced to 2D to allow a plotting of the movies. Then, we checked each position of each movie in this space and analyzed its local neighborhood. Not surprisingly, the position of the first movie was not even close to the positions of the other two movies. While the neighbors of those two movies clearly fall into the horror/animal-horror genre, the neighbors of the first one were more diverse.
The lesson we learned is not new, but it drastically confirmed that such models require consistent data to work or they will not make much sense at all. This confirms one more time that we definitely need a concept space that allows us to cluster similar movies without any supervision or labels, or the model should at least ignore the teacher when he is obviously wrong.