# Topic Regression

The factorization of the co-variance matrix for sparse movie meta data works pretty good to discover so-called latent topics in the data. For instance, there might be a “basis” for movies with a strong supernatural touch, like zombies or vampires, or one for all kind of music themes, romance, military and so forth. In other words, the model clusters semantically related keywords into a “basis”. If a factor model is trained on the whole data, a movie is described as a linear combination of those topics, for instance 0.2 romance, 0.9 music and so forth.

However, the embedding of a factor model is fixed and does not allow to infer encodings of unseen movies. Furthermore, simple dot products of the data and those bases do not work, since we also need consider the overlap of a sample with a basis. For instance, if the overlap of a movie and a basis is just a single keyword, it is hardly appropriate to infer that the movie has a romantic component. This issue is especially important for very sparse input data, because otherwise, the “romantic” neuron gets activated, for any movie that has at least one keyword with the basis in common.

To tackle this issue, we train a regression model to infer how much a movie is matching the latent topic. This is done by encoding the overlap of keywords for any movie and the basis as the value to regress. We also use a threshold to treat an overlap of < T as 0 to avoid the issue described above. We normalize the overlap to use a cross-entropy cost.

The procedure can be summarized as follows. For a specific topic, we extract the anchor words, those with the highest magnitude. We only consider movies where at least one of those anchor words is present. Each anchor word has a weight of 1.0. The features of movies are also binarized, so we can measure the overlap of a movie x and topic t as y=sum(dot(x, t). We select T=1 which means there has to be an overlap of at least two words. At the end, all y values are normalized by y = y / max(y) and we store the corresponding x values as the training set.

The final model is a logistic regression model with sparse weight connections, 1 for anchor words and 0 else, with the aim to determine the weight for each anchor and the bias. With the trained model, we can determine how much a movie matches a specific concept, for instance “romance”, in the sense that a value of f(x)>0 means that there is an overlap with the topic and if f(x) is higher, there is a larger overlap with the topic.

The novelty of this approach is that we explicitly encode the magnitude of the overlap in combination with the relevance of each anchor word. The method has the advantage that the encoding leads to more sparsity, because a minimal overlap with a topic, for instance due to frequent keywords, does not lead to an activation of the topic.