In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.
With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label:
maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.
If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.
Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.
The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.
There are some algorithms that seem to be very powerful on the first look. For instance, we stumbled about a matrix factorization that is semi-supervised to cluster arbitrary data. To introduce a supervised signal, there are pairwise constraints to express that a pair of documents should be in the same cluster (W_like) or not (W_dislike). The model uses a similarity matrix S, plus W_like and W_dislike to build a matrix A: A = S + W_like – W_dislike.
We will illustrate an example from the movie domain. Let us assume that we have n=30K movies, which means S is a n * n matrix and so is W_* and we use k=50 clusters. When we assume that each value is encoded as a float32 value, we need about 3.5 GB memory. The required memory for the model parameters, is negligible, since we only need to store n * k for the cluster assignment and k * k for the centers, which is less than 6 MB memory.
Regardless of the efficiency and precision of models that are using a full similarity matrix, the bottleneck is the memory consumption during training. For 5K samples we need ~100 MB, but for 50K we need about ~9.5 GB, since the increase is not linearly. Combined with the fact that such models are transductive and do not allow to project unseen samples into the new space, the use is limited for datasets with lots of samples.
The impact could be alleviated if the similarity matrix could be stored as a sparse matrix since then the memory consumption could be drastically reduced. However, this requires that the similarity for lots of pairs of movies is zero. Plus, even if most of the full-fledged optimization frameworks support sparse matrices, they can be problematic.
In the last week, we tried a lot of different approaches. Most of the time, the results were okay but none of them knocked us from our feet. We speculated a lot about the reasons since there are some known approaches in the literature that work well even for data that is very sparse and limited. However, one big difference is the semantic of the underlying structure of the data. In case of micro-blogs the text still follows some patterns, even if it is different from natural language. When we consider our meta data, the situation is different. If a movie has only very limited keywords and one or two of them are even wrong, or the spelling is not correct, the record is close to be useless. In combination with the fact that our training set is also very small, it might happen that a weak learning signal gets too much noise and it does not learn anything useful at all.
Stated differently, we need an approach that is more robust in case of errors and that can cope with very sparse data. While we did some research, we stumbled about an old idea called “Siamese Networks”. This kind of network can be used for many things and we will investigate it to learn a similarity measure for our meta data. But first, we need to address some problems. The idea is to learn a projection from the keyword space to a concept space that allows to compare items with the help of latent topics instead of a direct comparison of the keywords. That is exactly what we want, but the drawback is that we now have a supervised model that requires labels to express the similarity of pairs of items.
Let us start with a brief description of the approach. The training is done on randomly sampled items f, g and a label y that describes how similar those items are. For simplicity, we use the cosine similarity. Furthermore, we have a projection matrix W that allows to map the items f, g into the new concept space by some function F; we use the dot product for now (h = f*W). A possible cost function is the squared loss 0.5*(label’ – label)**2, where label’ is the cosine similarity of f, g in the topic space. That was the easy part.
Now, we need to find a label that already exists and can be used to discriminate items at a basic level. In our initial setup, we used the genres for the label domain. We used the Jaccard coefficient as the label to determine the overlap of the genres which has the advantage that the result is bound by [0, 1], exactly as the cosine similarity. We implemented the model with Theano to avoid to manually juggle with the gradients. Since the output is compatible with our previous models, we used our existing machinery to check the similarity of some representative movies. Altogether, the results were at least as good as with the other models, with a tendency to be more consistent and fewer outliers.
Of course, more tests needs to be done here but the new approach seems to have a lot of potential and if we can improve the label generation the results can be hopefully further improved. Especially the situation where two movies have no overlap of the keywords, but the keywords are semantically related, can be modeled more efficiently with the pair-wise training.