We still think about ways to improve our model and we never expected that complicated data like movies can be entirely explained by a single factor. That is why the keywords are just a start to build a hierarchical model that is able to use multiple factors (features) to explain the data. As we mentioned before, it is very likely that the reason why somebody likes a movie is a combination of the actors, the genre, the story and many other factors. For instance, the countryside, special effects or even the cars that are driven in the movie.
To convert all the data into a suitable model can be computationally very expensive. For a complete movie database, it is very likely that there are a couple of thousands actors, hundred of directors and dozens of story words. Not to mention the genres, country information, ratings and other information and last but not least, all the data is very, very sparse. Even if an actor played in 200 movies the total movie coverage should be below 0.5%. Thus, to train a complete model is very unrealistic but smaller steps in the right direction also lead to progress.
As a basic setup, we wanted to learn more about the possibilities to combine keywords with genre information. The idea is very simple: On the one hand, we are interested to learn a good model of the data, but on the other hand, we also want to learn good disjoint topics. For this experiment, we used the genre as a “one hot” encoding (Y) and the keywords as the data (X). Even so we are not interested in a classification, we used a Discriminative RBM to train the model. This particular RBM is able to model the data and the “labels” which means that we can use the genre information as a lever to better separate latent topics in our data. The negative drawback is that we now have a supervision criteria that might act as a bottleneck because the ultimate goal is not to explain the data anymore, but to explain it according to the labels.
The model was trained with CD-1 and mini-batches for a faster convergence. Whenever a movie had more than one genre, we duplicated the feature vector x for each genre. As usual, we checked the weights of the hidden nodes to find expressive keywords and we used the spread of the keywords as a simple indicator for the quality.
The results were very promising. Even though some neurons did not seem to represent a human-readable topic, most of them succeeded to capture some concepts in the data: cowboys/wild-west, hig-hschool/teenagers, romance/love, pet/dog/family, war/soldier/prison and many more. The next step is to tune the parameters of the model and since the one hot encoding is not suitable for some features, we think about to adjust the model to use binary units instead of the softmax units.