In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.
With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label:
maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.
If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.
Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.
The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.