Very often, models can be improved by incorporating external knowledge which usually results in a lower training loss and better generalization. In case of bag-of-words data, we could use, for example, co-occurrence data to guide the training which we demonstrate with a simple example. For the input data, we use movie keywords with a simple inverse weighting scheme and we only consider the most frequent keywords.
The model is an auto-encoder with with a single hidden layer and tied weights. Without any extra knowledge, the auto-encoder learns a compressed representation of the data which is good at reconstructing it. With a co-occurrence matrix as the prior information, we can encourage the model to learn representations that are semantically more useful. The prior can be written as:
P ~ exp(x_hat * C * x_hat.T)
In case of gradient ascent, the cost function can be extended as follows:
L_new = L + log(exp(x_hat * C * x_hat.T))
If we think of x_hat as labels, P is higher for more likely label combinations while P is low or zero for pairs of labels that rarely or never occur in existing data. Therefore, we encourage the model to focus on reconstructions with correlated pairs of keywords and hopefully avoid pairs that are extremely unlikely or even useless.
In general, the aim of priors is to encode possible patterns or structure in the data. In case of the co-occurrence matrix, we tell the model how likely the occurrence of two keywords is. For instance, a good match for “vampire” could be “supernatural”, while “newspaper” is much more unlikely, or “cowboy” and “ranch” as a good example, but not “cowboy” and “car”. Of course the latter might be possible, but more likely combinations should be weighted higher.
In terms of training it means that if x_hat, the reconstruction with the current model parameters, contains a lot of “cowboy” & “car” pairs, the loss is much higher than for pairs of “cowboy” & “ranch”. Without the prior, the model would only focus on the reconstruction aspect but would have no chance to favor reconstructions which are more likely according to the co-occurrence data (the prior).
In a nutshell, prior data can be extremely powerful if correctly incorporated into a model. For instance, in a multi-label setting, like movies genres, we can encode disjoint and coherent groups of labels to avoid useless models outputs, or at least penalizing them accordingly. This way, combinations like “war/western” or “family/crime” are less frequent, while “horror/scifi” or “musical/romance” are more common.