We believe that a reasonable model to capture user preferences requires a solid understanding of all aspects of the underlying data which can be hardly done with any existing stand-alone approach, regardless of how powerful or cool it is. To combine several models has been shown to improve the quality, but we prefer a unified approach to map knowledge from different modalities into a single feature space. However, before we start walking, we need to learn how to crawl, or equivalently, first baby steps, then giant steps.
This is why we collected all the wisdom we gained in the last year and started with a simple unsupervised model, the auto-encoder. Just a quick reminder, our data is high dimensional, binary (between 0..1) and very sparse. This is why we used ‘reconstructive’ sampling which recovers all “1”s, but only a sampled portion of “0”s to speed up the training. We use nesterov momentum and small mini-batches and we exponentially decay the learning rate. The objective is the cross-entropy function.
Our first goal was to get a better understanding what different types of neurons learn. To do this, we selected a specific activation function and then we compared the final loss, the magnitude of the weight matrix and the learned topics, by printing the connections with the highest values per neuron.
The first neuron type we analyzed is a rectified threshold function:
f = np.maximum(0, np.dot(x, W) - b)
where “x” (1 x n) is the input vector, “W” (n x u) the weight matrix and “b” the bias (u x 1). The function is similar to a ReLU but additionally, it only fires if a certain threshold is exceeded. The idea is that a feature is only activated, if “enough” of a topic in the input data “x” has been found.
The second neuron type is a winner-takes-all (WTA) function. The function requires that neurons are ordered in groups and only the winner, the one with the maximal value, is allowed to pass the output and all others are set to zero. In Theano it looks like that:
def wta(X, W, bias, groups=2)
units = W.shape / groups
pre_output = T.dot(X, W) + bias
reshaped = pre_output.reshape((X.shape, units, groups))
maxout = reshaped.max(axis=2)
max_mask = (reshaped >= maxout.dimshuffle(0, 1, 'x'))
return (max_mask * reshaped).reshape((X.shape, units*groups))
One advantage of WTA is the explicit sparsity, because there is always a looser and with a group size of 2, 50% of the output is zero and these neurons are biologically more plausible.
What is striking is the very different decline of the cost value during the first four epochs:
wta: 1.0306 -> 0.9259 -> 0.7971 -> 0.6969
thr: 1.0249 -> 0.9462 -> 0.9034 -> 0.8604
Because both networks are identical, the type of activation function seems to have a high impact how fast the model fits the data. Both models start with roughly the same loss, but after four epochs, the wta model decreased the initial loss by 32% in comparison with the 16% of the thr model.
The final objective value is also very different:
and it is interesting to note that the wta model passed the 0.3817 final value of the thr model at epoch 12!
If we take a look at the norm of the weight matrix, that encapsulates the learned topics, the difference is also noticeable:
However, since the gradient updates are proportional to the error, it is no surprise that the magnitudes of the updates for the thr model are larger, because the error is higher than the one of the wta model. Plus, for the wta model only about 50% of the weights get updated.
Another statistics is the number of “unique words” among the top-k words for each neurons:
wta: 807 (53.8%)
thr: 676 (45.0%)
As we can see, the diversity learned by the wta model is larger which theoretically means, it could capture more latent concepts.
But the most interesting part is the analysis of the learned topics, which was very disappointing for the wta model, because the most important words per neuron did not form obvious topics. The dump of the thr model was more interesting, but also lacked a clear topic structure. This does not mean the learned features are useless, because the model learned a distributed representation.
The last step was to find nearest neighbors for specific movies in the new space, to see how the model handles movies with different topics. We started with the movie “Doom” that combines topics from military sci-fi and horror. And again, the results were rather ambiguous, also for other tests, because both models were able to find movies with similar topics but often the movies only contained one of the topics, or both, but with different importance, or movies where the topic is present as part of a different context.
Bottom line, the choice of a loss function in combination with the activation function is essential to learn a good model, but a low value at the end of the training can be deceptive, since it does not automatically mean a good solution for the problem at hand. For instance, we were quite surprised that a shallow model generated so decent output, but at the same time the limitations were obvious. First, the models often failed to recognize the context, because it just relies on a set of unordered words and some words are used differently depending on its neighboring words. That is related to the problem that a single layer does not suffice to capture higher-order correlations and third, a model should use as much data as possible and not just the keywords.