In the previous post about sparse topic modeling we discussed how to learn a sparse, distributed representation of topics in data. For the training, we used the co-occurrence matrix and in case of the sparse NMF method, only the sparsified matrix W is used and the output H is thrown away. The dimensions of the matrix W are (#words, #topics) which means we can encode data in terms of latent factors, but also words in terms of features. That means, each word is represented by #topics dimensions or stated differently, the word is encoded as the distribution to each topic. For example in case of #topics=4, an embedding of a word might look like [0.00, 0.91, 0.00, 0.21], indicating that the word is not present in topic 0 and 3, but strongly contributes to topic 1 and marginally to topic 4.
With this encoding, we can determine the cosine similarity of words to see what the learned representation looks like. For instance, the next neighbors of “rock-music” are:
– rock, rock-music, rock-band, rock-star, performer and concert
or for the word “coach”:
– team, coach, baseball, underdogs, football, sports
We tried several words and as expected, the next neighbors look plausible with regard to the co-occurrence data. However, for rare words, the score is decreasing very fast because there is only few data to sufficiently model the relation with other words. In contrast to other embedding models, the rare word problem cannot be addressed by adjusting the sampling strategy, because NMF is working on the whole matrix. The only chance to address the problem is some kind of pre-processing of the raw data, to influence the output of the co-occurrence matrix.
In a nutshell, the learned factor model can be also used as a word embedding which might be useful for information retrieval tasks, like suggesting relevant keywords in search queries. Furthermore, if the word embedding of all present words in a sample is averaged, the approach also allows to embed whole documents or movies into a single vector representation.