More than half a year ago, we spent quite some time with matrix factorization approaches. Non-negative matrix factorizations are pretty easy to implement and very powerful when it comes to the interpretation of the results, because positive factors are often easier to interpret.
However, a major drawback of the NMF approach is the fact that it is transductive, which means that we cannot easily project unknown examples into the new feature space. For that reason, we tried
to find a replacement without this limitation and we stumbled about a special variation of an RBM. The non-negative RBM penalizes negative weights and thus encourages weights to be positive.
The paper focused on decomposing objects into parts, but the whole approach is also very useful to find latent topics in documents and text data in general. And because we improved our data pre-processing in the last month quite a lot, we decided to test the model again. As usual, we used a top-k keyword binary encoding and additionally, we condensed sub-genres into a fixed-size taxonomy vector for better discrimination.
To check the results, we analyzed if there are neurons that are sensitive for dedicated topics like ‘superhero’ or ‘creature film’. And indeed, a specific neuron, for instance, has its highest activation for movies with a strong superhero theme, like X-Men, Batman or Spider-Man, while another one focused on a natural horror scheme in combination with creatures, like Jaws, and dozens of other ‘terror-in-the-water’ movies.
There is still room for much improvement, but we are very confident that the improved feature encoding helped a lot to train models that are semantically more powerful. However, we are still not done, because right now we completely ignore one-time keywords which are a valuable source of information.