In the last weeks, we tested a lot of different approaches. Different algorithms, different feature transformations, algorithms with supervision and lots of unsupervised learning. We tuned parameters for one model and we discarded another. The lessons learned read like that:
– a good unsupervised model requires a sufficient amount of data
– the quality of the model largely depends on the ability of the raw features to describe the data properly
– great care must be taken for the selection of the hyper-parameters
– the fact that the data is very sparse needs to be incorporated into the model building process
– shallow models are not expressive enough to describe higher-order correlations of the data
With the limited data at hand, it is very unlikely that the training of a larger model would succeed, but without it the final model is very limited. That is the reason why we decided to go another way. Similar to the greedy pre-training in other domains, we decided to train smaller models on subsets of the data. For now, we use the existing genre information of movies to train useful ‘concept neurons’. These smaller models are then combined into a larger network. Next, the network is used to transform the input data into a hidden representation of the data (similar to an ordinary neural network with one hidden layer). Similar to stacked RBMs, we use the output of the model as the input to a new model (layer) with the difference that the previous layer does not consists of a single RBM but many.
This detour is required because the sparsity of the input data does not allow to capture nuances of topics at a large scale. Therefore, we train models on specific genres to extract these details and then we train a larger model to find relations between these topics.
A schematic of this approach looks like this:
– train an RBM model for each genre
keep N hidden nodes (arbitrary, largest L2-norm, …)
save the top-K features with the largest weights from each node
store the reduced model
– transform each movie into the new feature space
each RBM model contributes N input dimensions
inference is done on the reduced features/weights
– train a new RBM model with |genres| * N input dimensions
store the model
To transform a new movie into the feature space, we first use the reduced models to get an input for the final model. Then we can infer the final feature representation. The whole process is very fast since it only consists of matrix vector multiplications and element-wise sigmoid operations.
A preliminary analysis was done on randomly sampled movies and indeed the new model seems to capture much better latent concepts of movies than the shallow ones. But an obvious bottle-neck is still the quality of the features at the lowest level; if a complex movie is only described by very few keywords, the best model in the world is not able to infer latent topics from such a coarse representation.
The other challenge is that deeper models have much more parameters to tune and to learn and thus, more time is required to study the dynamics of the new model. However, the numbers speak for itself, we are positive that we haven’t reached the full potential yet.