Right now, we are feeling a bit like we do a world journey. Very early, we started with vanilla neural networks to build simple models to classify movies, then we switched to Support Vector Machines (SVMs). Both approaches worked well, but soon there was a demand to learn a semantic feature space to cluster movies and to better explain the data. That is why we switched to unsupervised learning algorithms, like Auto-Encoders, RBMs and non-negative matrix factorizations to build a model that is able to capture high-level concepts of the underlying data.
Due to the nature of our data, some excursions were extremely daunting and thus we soon rejected one approach in favor of another. Now, we are almost again back where we started. However, this is not a bad thing as we will illustrate with an example.
Since our data is very sparse, the training with some algorithms can be very tricky, but with the recently explored ReLU units things improved a lot. This type of unit can be both used in RBMs and neural networks, so we decided to focus on neural networks first, because the Auto-Encoder (AE) model is simpler than an RBM. The results of AE models with ReLU units are already very good, but there is definitely some fine-tuning required to better discriminate movies from different genres. That is why we decided to use the genres of movies as labels to perform supervised fine-tuning of our model.
The whole process can be describes as follows. First, we pre-train an AE model purely unsupervised with our data. Then we throw away the reconstruction layer and use the weights and the biases of the AE to initialize a neural network with a softmax layer on top that uses the genre as label information. The whole network is then trained until convergence. Since we are not interested in the genre classification, we again throw away all layers except the hidden layer that represents our AE with adjusted weights. Finally, we have a weight matrix ‘W’ and a bias vector ‘h’ that can be used to transform movies into the new semantic feature space.
To assess the quality of the transformation, we selected a set of well-known movies and determined the K nearest neighbors in the new feature space. For comparison, we repeated the procedure with the AE weights without the fine-tuning. A short analysis of the results clearly show that the fine-tuned model outperforms the raw model. For instance, it contains much fewer outliers and it clustered movies more explicitly with respect to the genre.
However, as also described in the literature, networks with ReLU units are very efficient when lots of labeled data is available. In other words, a pre-training might not be required at all if we have enough data at hand. And since a genre is usually always available, we have a win-win situation and the only challenge is to collect proper meta data for movies.
But regardless of how we train a model, our ultimate goal is to create a semantic space that can be used for clustering, retrieval and finally for preference-based learning. Stated differently, even without pre-taining the first network would be an intermediate model that is only used as a feature extractor. The next model would utilize directly user ratings of movies.