Recently we whined a lot about the fact that good meta data is very scarce for movies. Since this will not help at all, we thought about ways to get of of the misery and as usual, the devil is in the details. Let us rephrase our goal: Find a good feature space for movies to cluster similar concepts that can be also used for other tasks, like ranking, classification and so forth.
Maybe we focused a little too much on the big data and too little on the small one. What do we mean by this? This is best explained by an example. If we consider sci-fi movies, we can model lots of pair-wise relations with simple predicates. For instance, followed-by(A, B) to indicate if movie A is followed by movie B: followed-by(X-Men, X-Men 2)=1, or same-director(A, B). The outputs can then be converted into a (sparse) matrix that is used as extra knowledge. That is very similar to a correlation matrix, but with an arbitrary function to evaluate pair-wise interactions. The drawback is that the dense matrix is quadratic in the size of the movies and thus a sparse version is preferred (since there are lots of ‘unknown’ entries).
And that is only one example, since uses cases are endless. Ratings are a good example because a similar rating behavior indicates that two movies share something similar, maybe CGI effects, good actors, or a nice story. The distribution of ratings could be also used as additional features and among many other things. And even if we don’t like the fact, money is influencing how movies are perceived. Thus, we could also use features like budget or box office gross as latent factors. Awards are similar to the budget since it also tells us something about the popularity.
The first challenge is to find good weights for all these factors because it is obvious that not all of them should be weighted equally. The impact of the weights can be demonstrated if we consider blockbuster and independent movies. For the first type, the budget is big and it is likely that special effects are top notch and they often have well known actors. The latter is often more artistic with a focus on the story and little or even no special effects.
Thus, it is more likely that movies from each type are closer together which is good to capture the concept ‘indie movie’ but it might be harmful in other cases. However, with a sufficient number of neurons and hidden layers, a network should be able to capture many different aspects of movies and combinations of them.