In our last post, we wrote about the chances and opportunities if we switch from the ordinary logistic units, to something else. Since the introduction of rectified linear units (ReLU), they were used in a broad range of problems and most of the time with success, indicating that they have indeed potential. Furthermore, they are very elegant and easy to implement since the activation of a ReLU unit is just f(z) := max(0, z) which means that the derivative is either zero for negative terms or constant otherwise. As a side effect the unit promotes sparsity in the output since the value is actually ‘0’ and not just a very small value that even so is still positive.
That means, whenever the dot product of an input X and the weights W of a neuron is negative, the node is not considered in this step. In other words, the ReLU function is responsible that only a subset of all neurons is selected for a particular input. This natural sparsity might further help to interpret the activation patterns of some input.
The reported advantages of ReLU units in the literature are very versatile. They range from faster learning, to filters that are more interesting in case of audio/image data or better results when used as a discriminative model. For this and other reasons, we decided to take a closer look at the units.
We created a fixed data set that was used for training and then we compared logistic and relu units. As usual, we used the spread of the keywords as a performance indicator, but also the reconstruction error and the plausibility of latent topics learned by a neuron. The results were pretty stunning. The ReLU units always lead to a lower reconstruction error and fewer repeated keywords. Furthermore, training with ReLU seems to require fewer epochs than the logistic counterparts. With these results, which we frankly did not expect, we tried once more to train a larger model.
Instead of training a model for a genre, we used the most frequent keywords, four hundred to be precisely, and we set the number of topic neurons to one hundred. We started with 100 epochs and we annealed the learning rate from 100% to 20%. To assess the quality of the model, we used a fixed set of movies and we analyzed the K nearest neighbors in the feature space. Since the meta data of some movies is very sparse and probably incomplete, we sometimes got results that were a little confusing but most of the time, the results were plausible and in accordance with the crowd.
So, for a first shot, this was not too bad and in the real-life test, with data from the electronic program guide (EPG) and a mapping to the feature data, we were pretty astonished, in a good way, a couple of times what similar movies our approach suggested.