Measuring Affinity With Missing Data

It comes all down to pairwise similarity of items. The problem is that similarity is often highly subjective and especially for the domain of movies. For instance, is the movie Doom more similar to Resident Evil or to Outpost? Clearly, all those three movies share some themes, but to turn this into a ranking is not trivial. Furthermore, we already know that meta data of movies is incomplete and in the worst case even incorrect. Therefore, a judgement based purely on the meta data might not be possible at all.

If we take the Resident Evil films as an example, the sub-genre ‘creature film’ might not be present as meta data for all of them. However, if one agrees that all these movies share a strong creature theme, a supervised clustering would not lead to the expected results because of the missing data. Plus, it is possible that the theme is not missing, but that a human decided that the theme is not distinctly present in the movie and thus, should not be in the meta data.

Regardless of the meta data at hand, most people would probably agree that all Resident Evil movies share enough themes to be put in the same cluster, or stated differently that those movies should be somehow treat as similar.

The idea is to build a similarity matrix for all movies that captures simple relations between pair of movies, but also higher-order connections, like information from the titles. Since titles do not follow any structure, we need to rely on heuristics, but often, a shared prefix is used to indicate that movies belong to the same “universe”. Examples are “Resident Evil” and “Resident Evil: Apocalypse”. To improve the quality of such a matching, we could also measure the overlap of actors of two movies, or directors, in combination with a bi-gram match of the titles. Just to name a few possibilities.

After we built the affinity matrix, we can use it to re-fine possible models. For instance, we could use Spectral Clustering to convert the matrix into a set of labels which better reflect semantic relations between movies. Then, we could train a classifier to assign unknown movies to the learned clusters, or directly use the feature space from the clustering step to find neighbors for movies.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s