It is no secret that we definitely need more data and one that is more heterogeneous than the one we currently use. However, there are some approaches that can handle even smaller data sets with very good results and that gives us some confidence that we can still improve our results a little.
Since we focused on energy-based models, we were able to continually improve the precision of our model step-by-step. A method that has gained a lot of attention recently is called ‘Dropout’. The idea is simple, but very elegant. In order to get a result that is similar to the average over a large set of trained models, half of the hidden nodes, chosen by random, are omitted from the model in each step. In a very simple nutshell: This method avoids that one hidden node relies on the others to clean up its mess. The random selection forces each node to do the job as good as possible without relying too much on others.
As noted in the literature over-fitting in smaller data sets is a serious problem. The dropout method acts as a strong regularizer that helps to minimize over-fitting. Because the implementation for an RBM is trivial, we decided to compare the results of dropout with our normal model.
However, since we cannot visualize the learned filters and the reconstruction error is not very reliable with our data, the question is how we can measure the difference in the outcome of both models? One possible measure is the spread of the learned keywords and related to that how often a keyword occurs per node and then of course how conclusive the learned topics are.
For our experiments, we used a dropout rate of 50%. To ensure that the implementation is correct, we first learned a model for a small set of handwritten digits and used the visualization of the learned filters to verify the correctness.
We started to learn a model for a single genre. The parameters were fixed for both models (weight decay, momentum, learning rate, batch size, …) and we repeated the training several times. A very prominent difference was the total number of keywords learned by the models. On average the dropout model learned about 5% more keywords than the normal model which means dropout usually leads to a higher spread. Furthermore, repetitions of keywords were often lower both in the total occurrences of a keyword but also regarding to the magnitudes of each keyword.
More work is definitely required for a sound conclusion but dropout definitely showed lots of potential and the precision of the trained models were at least as good as the one of our normal models. Therefore, we plan to continue our experiments with dropout.