In the last post, we advocated to start with simple models. To give a concrete example, let us assume that we want rank all movies that are aired today given a reference “query” that is a user-selected movie. For the sake of simplicity, we assume that the query movie has a ‘vampire’ theme and thus, we only want retrieve relevant movies with a similar topic. A good choice is a linear ranking SVM. Why? Because the model can be trained as a standard SVM classifier if we convert the training samples into pairs and thus, off-the-shelf software can be used. Plus, the parameters only consist of a single vector that weighs each feature dimension and is thus perfect to understand what the model learned.
Here is a sketch of the training procedure:
– mark all vampire movies with +1 and -1 otherwise
– choose two movies (a,b) from +1 and -1
x = (a – b) if a in +1 and b in -1, y=+1
x = (b – a) if a in -1 and b in +1, y=-1
– continue to sample pairs until a given threshold
– train a linear SVM with the x, y data, output: weight vector W
At test time, a ranking is induced by projecting each movie vector m into the feature space, s=W*m, and then sorting the distance from the query to all movies: argsort(W*query – W*movies). An easy interpretation of the learned model is a hyperplane that separates vampire (+1) and non-vampire (-1) movies. However, since the model is linear, a perfect separation might not be possible!
For the vampire theme, the ranking is pretty good which is not really surprising because the theme contains very discriminative keywords that are weighted higher than all other keywords. However, for tougher themes the results are often disappointing because a linear separation of the keywords is not possible.
Bottom line. Despite the fact that the ‘learning to rank’ approach is incredibly popular, the accuracy of simple, linear, models largely depends on the quality of the input features. In our case, the expressive power of the raw features often did not suffice, because rare, but precious keywords were often missing and furthermore, we did not use any prior knowledge regarding the context of a keyword. The latter could be addressed by the use of the co-occurrence matrix, but for movies with very few keywords, only 2-3 words, deriving a context is still difficult if the words do not belong to the same topic. Then, a clear judgment to what topic a keyword belongs is not reliable.