We experimented a lot with Siamese networks and taxonomies and usually a crucial factor is to determine the similarity of two items, or at least if two items can be considered as similar. A common feature for movies is the genre. For instance, Star Wars is without a doubt considered as SF, while Annie Hall is a bit of a Romance and other genres. Now the question is, if the distance of Horror and Romance, as a genre, is bigger than the distance of Action and Romance? This can be hardly answered objectively, so we go another way.
We ask the question how much two genres are pairwise correlated for a data set. In other words, we interpret the co-occurrence matrix of all top genres. The result is highly biased because if a data set only contains action thrillers, the similarity of those two genres is one. Here is a quick example for the genre ‘Horror’. The output, converted to percent, is the number of times the two genres were seen together for a movie in our data set: 58% Thriller, 23% Mystery, 28% Sci-Fi, 19% Drama 17% Action, 12% Comedy. It should be noted that some genres like Thriller and Drama are very strong, so we should not interpret too much into them. However, the correlation of Horror and Sci-Fi/Mystery is clear a pattern. A more obvious pattern can be found for Romance: 56% Drama, 51% Comedy, 10% Adventure, because the correlation is decreasing very fast and the correlation is very strong for well known common pairs like romantic comedy/drama.
To test our results, we repeated the experiment with a different data set that uses slightly different genre names, but the outcome is almost the same. With this very simple measure, we can at least answer the question if two genres “make sense” together. This is not the same as a distance, but it helps to decide if two samples should be pulled apart in the concept space or be closer together.