In the last days we went back and forth to find a suitable model for our data. The Restricted Boltzmann Machine still seems to be a very good choice in the sense that it naturally works with binary data and is also a generative model. However, the successful training of an RBM model can be challenging when the data at hand has special restrictions and is very limited.
For that -and other reasons- we went back to the most basic model. A simple RBM with no regularization at all. Plus, we used a fixed learning rate. To make sure our implementation is correct, we used a smaller version of the MNIST training data which allows us to visually inspect the learned filters. After a successful test, we implemented a new feature and repeated the procedure.
At the end, we had a basic RBM model with weight decay, momentum, mini-batches and support for decaying the learning rate over time.
We started the experiments with our data with the most basic version of the RBM. We then trained a model, checked the results and tuned-in the next feature. This way, we can better control the influence of each feature to the final model. Of course the final tuning of all parameters is still very challenging. A major problem of RBM training with non-visual data is that you cannot easily inspect the filters to see what they have learned. There are other indicators to check, like the monitoring of the L2 norm of the weight matrix over time, but that is not as reliable as the visual inspection.
In recent experiments, we monitored the total variance of the L2 norms of the neurons but such a measure is only possible for a specific cost function and not useful in general. In other words, for our data it is probably best to interrupt the training periodically to show the learned topics of a subset of the neurons. But since the neurons adjust over time, it is hard to tell at which intervals this produces useful results. Bottom line: the task is subject to further research.
Now, let us talk how to train models. There are several ways to train an RBM and Contrastive Divergence (CD) is probably the one that is best known. However, there are other methods and one of them is Persistent CD or PCD for short. As the name indicates, there is no reset between sampling and the state of the chain is preserved. Details about the procedure can be found in the literature. We only mention that the learning rate should to be smaller compared to CD.
We plan to evaluate our implementation with different cost functions and training methods and we will start with PCD since it is easy to implement.