In a previous post we talked about possibilities to train deeper models. Especially in our case, where movies are composed of several latent topics and many highly non-linear interactions between them, the benefit of models with multiple layers are obvious. Now the question is how to tackle the problem? A straight-forward answer would be, to learn one layer at a time, also known as greedy pre-training, and to fine-tune the stack of layers to create one big model. However, a successful training of such a model requires patience and lots of data. With our rather limited data at hand, meta data for a couple thousand movies, the outlook is not promising. Therefore, we decided to focus on less-complex approaches with fewer layers.
We already tried RBMs to build simple models for movie genres and we also tried to train global models. At least the local models looked very promising to derive useful features, but not as a drop-in replacement for a similarity distance. A related algorithm to RBMs is the Deep Boltzmann Machine and since there exists some literature on how to efficiently train a 2-layer model, we decided to give it a shot.
Because a proper description of a DBM requires to introduce lots of different concepts to understand all the details, we refer to the existing literature for an introduction. In a nutshell: From the outside, the model looks like two stacked RBMs and it should be mentioned that the last layer does not need to have the same number of nodes than the first one.
To get to know the approach, we started with a simple setup that tries to learn topic neurons for a single movie genre. The evaluation is as usual: For each trained neuron we sort the weight connections to the input data and output the largest K features. Since we now have a 2-layer model, we combined the weights of the two layers (W, V) into a single weight matrix W’: W’ = transpose(W)*transpose(V).
The evaluation of the results is a little difficult, since we have no real criteria to measure the success. Nevertheless, the results of the 2-layer seems to be perfectly reasonable and even more sophisticated than the single layer model. Here are two examples
– dancer, competition, music, art, singer
– school, prom, geek, opposites-attract, triumph
where each of these neurons learned a high-level topic from the training movies. If we only consider the weights from the first layer, the trained neurons also capture latent topics but they contain much more noise in form of irrelevant raw features (keywords) with larger weights from the data.