In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.
With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label:
maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.
If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.
Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.
The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.
After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).
We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:
sigmoid = torch.nn.functional.sigmoid
dot_p = torch.dot(anchor, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n = torch.dot(anchor, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)
The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.
One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:
value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)
With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.
It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.
Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).
We believe that a reasonable model to capture user preferences requires a solid understanding of all aspects of the underlying data which can be hardly done with any existing stand-alone approach, regardless of how powerful or cool it is. To combine several models has been shown to improve the quality, but we prefer a unified approach to map knowledge from different modalities into a single feature space. However, before we start walking, we need to learn how to crawl, or equivalently, first baby steps, then giant steps.
This is why we collected all the wisdom we gained in the last year and started with a simple unsupervised model, the auto-encoder. Just a quick reminder, our data is high dimensional, binary (between 0..1) and very sparse. This is why we used ‘reconstructive’ sampling which recovers all “1”s, but only a sampled portion of “0”s to speed up the training. We use nesterov momentum and small mini-batches and we exponentially decay the learning rate. The objective is the cross-entropy function.
Our first goal was to get a better understanding what different types of neurons learn. To do this, we selected a specific activation function and then we compared the final loss, the magnitude of the weight matrix and the learned topics, by printing the connections with the highest values per neuron.
The first neuron type we analyzed is a rectified threshold function:
f = np.maximum(0, np.dot(x, W) - b)
where “x” (1 x n) is the input vector, “W” (n x u) the weight matrix and “b” the bias (u x 1). The function is similar to a ReLU but additionally, it only fires if a certain threshold is exceeded. The idea is that a feature is only activated, if “enough” of a topic in the input data “x” has been found.
The second neuron type is a winner-takes-all (WTA) function. The function requires that neurons are ordered in groups and only the winner, the one with the maximal value, is allowed to pass the output and all others are set to zero. In Theano it looks like that:
def wta(X, W, bias, groups=2)
units = W.shape / groups
pre_output = T.dot(X, W) + bias
reshaped = pre_output.reshape((X.shape, units, groups))
maxout = reshaped.max(axis=2)
max_mask = (reshaped >= maxout.dimshuffle(0, 1, 'x'))
return (max_mask * reshaped).reshape((X.shape, units*groups))
One advantage of WTA is the explicit sparsity, because there is always a looser and with a group size of 2, 50% of the output is zero and these neurons are biologically more plausible.
What is striking is the very different decline of the cost value during the first four epochs:
wta: 1.0306 -> 0.9259 -> 0.7971 -> 0.6969
thr: 1.0249 -> 0.9462 -> 0.9034 -> 0.8604
Because both networks are identical, the type of activation function seems to have a high impact how fast the model fits the data. Both models start with roughly the same loss, but after four epochs, the wta model decreased the initial loss by 32% in comparison with the 16% of the thr model.
The final objective value is also very different:
and it is interesting to note that the wta model passed the 0.3817 final value of the thr model at epoch 12!
If we take a look at the norm of the weight matrix, that encapsulates the learned topics, the difference is also noticeable:
However, since the gradient updates are proportional to the error, it is no surprise that the magnitudes of the updates for the thr model are larger, because the error is higher than the one of the wta model. Plus, for the wta model only about 50% of the weights get updated.
Another statistics is the number of “unique words” among the top-k words for each neurons:
wta: 807 (53.8%)
thr: 676 (45.0%)
As we can see, the diversity learned by the wta model is larger which theoretically means, it could capture more latent concepts.
But the most interesting part is the analysis of the learned topics, which was very disappointing for the wta model, because the most important words per neuron did not form obvious topics. The dump of the thr model was more interesting, but also lacked a clear topic structure. This does not mean the learned features are useless, because the model learned a distributed representation.
The last step was to find nearest neighbors for specific movies in the new space, to see how the model handles movies with different topics. We started with the movie “Doom” that combines topics from military sci-fi and horror. And again, the results were rather ambiguous, also for other tests, because both models were able to find movies with similar topics but often the movies only contained one of the topics, or both, but with different importance, or movies where the topic is present as part of a different context.
Bottom line, the choice of a loss function in combination with the activation function is essential to learn a good model, but a low value at the end of the training can be deceptive, since it does not automatically mean a good solution for the problem at hand. For instance, we were quite surprised that a shallow model generated so decent output, but at the same time the limitations were obvious. First, the models often failed to recognize the context, because it just relies on a set of unordered words and some words are used differently depending on its neighboring words. That is related to the problem that a single layer does not suffice to capture higher-order correlations and third, a model should use as much data as possible and not just the keywords.