The Data Knows Best

The recent activity with neural language model gave us some new ideas. With the help of the genre of a movie, we were already able to build a pretty good semantic space for movies. However, the genre information is not a very accurate measure for the similarity of movies and furthermore it is very subjective and thus not very consistent. We briefly discussed the idea of micro genres to do a better task at discriminating movies but building them is not trivial without additional data. So, what if we could estimate a genre distribution for each movie that better reflects the topic described by the used keywords?

For this, we borrow a lot from the neural models. For example, if we see a context of (hell, demon), most of us would probably tag a movie with ‘horror’, while (police, robbery) would probably lead to a ‘crime’ tag. And that is the idea: We predict the genre by considering a 2-word context plus the actual genre of a movie. No doubt that a combination of (hell, demon) can be also present in an action movie, but if the data is consistent, it is reasonable to assume that most of the time, it is present in movie with a strong horror them. If a movie has more than one genre, we create training samples with different genre labels.

To learn something about the capabilities of such a model, we created a training set by considering all feature bi-grams of a movie. The goal is to maximize the likelihood of the genre, given the two feature words. To assess the quality of the outcome, we predicted the genre for each bi-gram and used a normalized histogram to get the probability vector.

There is definitely room for improvement, but the results are at least reasonable. Here, we have the outcome of two 007 movies with the most important genres:
Sci-Fi: 2.2%, Comedy: 9.9%, Thriller: 7.6%, Action: 74.7%, Spy: 5.5%
Sci-Fi: 0.9%, Drama: 6.7%, Thriller: 5.7%, Action: 74.3%, Crime: 4.7%, Spy: 7.6%
As we can see, the key ingredients to a 007 movie are all there: A lot of action, paired with thriller and crime. And not to forget the futuristic gadgets that are never missing which probably lead to the notable sci-fi portion.

Bottom line: Instead of using a one-hot genre vector or a vector that contains a 1 at a position whenever a genre is present, the new genre vector carries a lot more information. First, we have a rough estimation how much of a genre is present in a movie and second, we do not rely on a single vote to set the genre, but we are using the data itself to average many votes to determine genres.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s