With only keywords to describe movie contents, we obviously do not have much context and in combination with fact that movies usually have very few keywords, even a human judgment can be hard. In domains like NLP, the solution is to train on a big text corpus to learn general statistics about words. With movies, it is much harder to find a good source because websites with plot summaries strongly vary in length and depth, and especially for foreign movies, the tendency is to provide very little or no information at all. In other words, we have to distill as much information as possible from the data at hand.
A popular approach is to consider the co-occurrences of words to learn something about the “semantic neighborhood”. For instance, if we consider the movie Doom and the word ‘demon’, we get the following list of related words: evil, possession, satan, monster, hell, … Those words clearly fit into the horror context and are also semantically “close” to the original word. However, a major drawback of the approach is that we ignore the IDF of neighbor words. In other words, very frequent words contribute little to the entropy while words with a lower frequency usually carry lots of more information. With the combination of the frequency and IDF values, we can determine local neighborhood for each word and with a function of those values, we can convert them into a distance to the original word.
To evaluate the features, we use a kernel function to score pair of words of a reference and a destination movie. More precisely, for each word in the “anchor” movie, we determine if any of the words in the other movie is in the semantic neighborhood of this word. For example, with the movie Doom as the anchor, we start with the word “demon”. If the word is present in the other movie, the pair gets the highest score. If the word is not present, we check for each word, if there is a match in the neighborhood, for instance, if “demon” is not present, but “satan” is, we get the score of kernel(‘demon’, ‘satan’). In case there are multiple matches, we rank all scores by some mechanism (average, by position, …) to get a single value. All those scores are added up to a final score which is treated as the similarity between the anchor movie and an arbitrary movie. Sorted by value, the top of the list represents the best matches for the anchor movie.