The more we work on concepts, the clearer it is that our problem cannot be tackled with algorithms only. Even for humans it is very challenging to describe precisely what two movies have in common or at least to write down the shared concepts that are important.
Thus, it would be much easier to start with high-level annotations and work your way down to the details. Actually, this is what some movie portal sites do. They start with general ‘themes’ to describe movies and at the next level, they provide some descriptive details. For instance, the theme could be ‘heroic-mission’ and the keywords could be ‘treasure’, ‘jungle’ and ‘cave’.
In other words, if two movies share the themes ‘war-in-space’ and ‘heroic-mission’ they should be closer in the concept space than two movies that share only two keywords. We can think of themes as regions in the concept space. The themes mark the boundaries of a larger region of the space and inside these regions, the keywords are used to build a similarity layer. Despite the theoretical advantages, there are still lots of challenges due to possible overlapping of themes.
To turn this into a naive similarity measure, we could calculate the intersection of themes in movies and then, we would repeat the procedure for the keywords. A hierarchy could be formed by different weights or by using thresholds. But the limitation of the approach is obvious. Without some kind of co-variance, relations between either themes or keywords are completely ignored. For instance, ‘shark’ and ‘tiger-shark’ are not treated as similar and because of the sparsity a co-var matrix might not help either.
However, we can improve the situation with some simple heuristics. We mark two terms as related, if a combined term (two or more terms separated by ‘-‘) contains a term from the vocabulary that has no separator, like ‘tiger-shark’ -> ‘shark’. This approach models simple relations between words and we make sure that the right side is always part of our vocabulary.
In a nutshell, movie themes are very useful to model the conceptual similarity between movies on a higher level. Furthermore, it can be directly used for clustering, or as a building block to incorporate the knowledge into a deeper model.