Recently we switched over to PCD for training our RBM models. The phase to get to know each other took some time, but then it worked like a charm. As mentioned in the previous post, we favor models that are as simple as possible to focus on tuning the parameters and as we will see this is definitely required.
We illustrate this with the training of topic neurons for a specific movie genre. With about 200 samples and 300 features, we started by figuring out a proper batch size. In all experiments we linearly decayed the learning rate from L to zero. With full batch size, the learning signal disappeared very soon and none of the neurons were able to learn something useful. A batch size of about 10% (or lower) of the samples lead to stable results. We verified this by monitoring the ascent of the L2 weight norm of the weight matrix and also used it as an exit condition if difference has fallen below a threshold.
The precision of the model was measured by determining the spread of the words in all learned neurons. Stated differently, if a word is present in all neurons its discriminative power is very low. The goal is to train latent topics which are unique among each other and with minimal intersection of the keywords.
A sufficient requirement for this is a continual increase of the L2 norm of the weight matrix which was proven empirically. If the norm grows very slowly during training, the learned topics are usually poor. In case of a good set of parameters, the norm continually grows but the difference of it between two epochs begins to shrink until it converges. The inspection of the learned topics then usually confirmed that the spread of the keywords was sufficient.
However, the accuracy comes at a price. The convergence of such a small training set is very slow. To ensure that we do not destroy the weak learning signal, we need many iterations. We found out that it works best not define a maximum epoch at all, but to let the training continue until the model converged. And even then, it is still possible that the model get stuck in a poor local optima.
Nevertheless, we consider these insides as a huge leap forward to learn a joint latent topic model for movies. The trend is clearly that we need enough hidden nodes and lots of time and of course more training data.