In a previous post, we used priors to explain how we could improve a reconstruction-based model. Of course, priors are not limited to a special kind of learning, but can be used practically everywhere. The challenge is more to condense those knowledge and encode it somehow into the cost function.
As a simple example, let us assume that we want to predict if a movie belongs to a very special sub-genre and for the sake of simplicity, we consider a linear SVM model. In that case, it would be very beneficial, if words that frequently co-occur together also have similar weights in the final model. This prior knowledge can be encoded in a simple co-occurrence matrix P by analyzing all training samples and therefore, does not require any labels. Since sparsity in P is beneficial, we set all entries below a certain threshold of P to zero to encourage the model to focus on stronger relations.
For starters, we use a simple model that was published at NIPS 2008. The model parameters 0 consists of a weight vector “w” and a bias “b”:
loss(w) = svm_loss(x, y, 0) + w.T*M*w
M is a matrix and can be written as:
M=alpha * (I – P).T * (I – P) + beta * I
The last part is simple ridge regularization and the alpha part penalizes how much a weight “j” differs from its neighbors or stated differently, how “j” differs from other related weights specified by the co-occurrence encoded in P.
The advantages can be best explained with an example. If we want to classify “alien films”, it is very likely that there is a group of movies that combines sci-fi and horror themes. Thus, instead of just focusing on one aspect, words that occur in both contexts should have a higher weight and further similar weights, which should improve the classification accuracy.