A Little Bit Orthogonal

In the post Learning with diversity we already emphasized that it is very beneficial for models not to waste capacity to encode information twice. For instance, the redundancy of a NMF factor model is the pairwise correlation of base vectors, which should be low for optimal models. In other words, base vectors should have not much in common to avoid to model a topic twice. In general, it is reasonable to assume that a model with low correlation is better at disentangling explaining factors in the data.

To get an intuition about the correlation of base vectors in NMF models, we compared the ordinary NMF (I) method with Orthogonal NMF (II). We factored a co-variance matrix of 1,000 keywords and determined the pairwise cosine similarity after the base vectors were L2 normalized. Then we calculated the mean for an easier comparison. And indeed the results are very different:
(I) correlation: mean ~0.15, var ~0.025
(II) correlation: mean ~0.04, var ~0.021
We repeated the experimented a couple of times and the results were always similar. That means, the correlation of base vectors, for the orthogonal NMF (ONMF), is about 4 times lower.

There are several ways to use those topics to project the raw data into the feature space which usually involve to solve an optimization problem, but a much faster way is to treat each base as a simple linear neuron with a threshold. The threshold ‘b’ prevents that a neuron gets activated if the overlap of the topic between the input ‘x’ and the base ‘w’ is too low: y = x * w – b. With additional sparsity constraints for the base vectors, the projection y = y_1||y_2||…||y_n contains very few activated neurons, namely those where the topic is very distinct in the input, which leads to representation that is both very sparse but also easily interpretable for humans.

Bottom line, regardless if we train an RBM, a neural network or a factor model, it is always beneficial that neurons of a model learn orthogonal concepts. For the domain of text, it means to learn as much topics as possible, but the same arguments also apply to non-text data, like digits, faces or speech.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s