To take the wind out of sails, in this post we won’t use Symmetric NMF to build a similarity matrix over the data that is used to cluster items. Instead, we stepwise train a non-negative factorization model to project our data into a latent feature space. The reason why we use a stepwise approach is because the original data is very sparse.
First, we have the raw data X, which is a matrix of “n” rows (items) and “m” columns (features). In our case, the features are binary, either 0/1 or a scaled IDF value to down-weight popular keywords. Then we build the covariance matrix C=X.T*X and some post-processing to transform the values of C to [0, 1]. The first model we learn is a symmetric NMF: min |C – S*S.T|. The output matrix S, has “m” (features) rows and “k” columns, where “k” is the number of latent topics.
Since we experimented a lot with factor models based on the co-variance matrix, we expected some good latent topics and an inspection of each topic, by considering the keywords with the highest magnitudes, confirmed that. Next, we need to project the raw data X into the learned feature space spanned by S. This is done by learning an NMF model: min|X – S*V|, where S is fixed and V is the projection matrix for the data with “n” rows (items) and “k” columns (features).
The question is why not learning S and V jointly? The answer is related to pre-training in deep networks. In general training a model is much easier if the weights have been moved to “good” region of the parameter space, especially when the model consists of lots of parameters. In case of NMF and very sparse data, it is challenging to push S and V simultaneouly into such a good region. However, if S is already sensibly initialized, the learning is much easier. Nevertheless, V still contains most of the parameters. For a simple example, we assume m=1000 features, k=50 topics and n=20,000 examples. Then V has n*k=1e6 parameters, while S has only m*k=5e4 or stated differently, about 95% of all model parameters.
The idea is not new and was also used to learn latent topic from short texts. But since in our data, keyword frequencies are at most ne, it is more challenging to do a proper pre-processing of the raw data. Still, the learned projection does a pretty good job, not only on movies with easy themes, like “Jaws”, but also on movies like “1984” or “Fargo” that are more difficult to categorize.