While we experimented with the DBM algorithm we found out that our chosen model is very sensible to parameter selection, specially for the learning rate. There is a lot of literature about carefully tuning the learning rates or to use some 2nd order optimization to get rid of the rate altogether. However, since our data set is very small and sparse we are limited in our possibilities and we still think that stochastic gradient descent is the way to go.
The crux is that if we have trouble to train a single layer without too much fluctuations, a joint training of two layers seems to be almost impossible. That is why we thought about alternatives. One, that is very straight-forward, is to use layer-wise pre-training to train single models and to combine them finally into one big model. But with our experiences from the past, we decided to take smaller steps.
In previous posts we mentioned that the large variance -L2 norms- of the individual neurons are probably responsible for the unstable models we got in the past, or at least a part of the problem. In the literature about DBM training, it was suggested to use a regularization that penalizes large deviations of individual neurons from the mean value of all neurons. Since we can confirm -by experiments- that models with lower variance tend to be more useful, we decided to use the described regularization
for our layer-wise training.
We started with a shallow -one layer- RBM with the regularization added to the objective function. The other parameters were equal to our previous experiments. For simplicity we started to test the model with data from a single movie genre. Again, we used the keyword frequency as a rule-of-thumb to estimate the precision of the trained model. Furthermore, we repeated the training several times to learn about possible fluctuations in the precision.
The results were quite promising. In comparison to earlier results, the variance of the L2 norms was much lower and the diversity of the learned concepts was also higher than before. Stated differently, the number of keywords that occur in most neurons were lower.
Nevertheless, there is still a lot of work to do. First, it is not reasonable to believe that a fixed set of parameters will work for all movie genres. The problem is that some genres have a very limited number of movies and/or features. Second, we need a a reliable stopping condition, especially for genres with very few movies in it. There is an indication that the L2 norm stabilizes at some point, but this depends on many factors, for instance a weight decay, the learning rate, the data itself and thus is not very reliable.
Bottom line: Maybe we need to slow down a little and return to the shallow world to master some problems before we can climb down into the deep ravines.