# A Little Bit Orthogonal

In the post Learning with diversity we already emphasized that it is very beneficial for models not to waste capacity to encode information twice. For instance, the redundancy of a NMF factor model is the pairwise correlation of base vectors, which should be low for optimal models. In other words, base vectors should have not much in common to avoid to model a topic twice. In general, it is reasonable to assume that a model with low correlation is better at disentangling explaining factors in the data.

To get an intuition about the correlation of base vectors in NMF models, we compared the ordinary NMF (I) method with Orthogonal NMF (II). We factored a co-variance matrix of 1,000 keywords and determined the pairwise cosine similarity after the base vectors were L2 normalized. Then we calculated the mean for an easier comparison. And indeed the results are very different:

(I) correlation: mean ~0.15, var ~0.025

(II) correlation: mean ~0.04, var ~0.021

We repeated the experimented a couple of times and the results were always similar. That means, the correlation of base vectors, for the orthogonal NMF (ONMF), is about 4 times lower.

There are several ways to use those topics to project the raw data into the feature space which usually involve to solve an optimization problem, but a much faster way is to treat each base as a simple linear neuron with a threshold. The threshold ‘b’ prevents that a neuron gets activated if the overlap of the topic between the input ‘x’ and the base ‘w’ is too low: y = x * w – b. With additional sparsity constraints for the base vectors, the projection y = y_1||y_2||…||y_n contains very few activated neurons, namely those where the topic is very distinct in the input, which leads to representation that is both very sparse but also easily interpretable for humans.

Bottom line, regardless if we train an RBM, a neural network or a factor model, it is always beneficial that neurons of a model learn orthogonal concepts. For the domain of text, it means to learn as much topics as possible, but the same arguments also apply to non-text data, like digits, faces or speech.