When the size of the vocabulary is very large, learning embeddings with negative log-likelihood and thus, the softmax, can be pretty slow. That’s why negative sampling has been introduced. It’s very fast and can lead to competitive results while it utilizes much fewer resources. However, there is a serious drawback. Similar to the triplet loss, the performance largely depends on the generation of proper positive and negative pairs. This is nothing new, but has been recently confirmed again by an analysis reported in the StarSpace [arxiv:1709.03856] paper.
The problem is that selecting “trivial” negative candidates might result in no loss or a low loss, since those items are likely to be already well separated from the positive item. Furthermore, there is often no clear strategy what the inverse of something is. For instance, let’s assume that we have two positive items related to “cooking” and now we need one or more negative items as a contrastive force. The question is are items from “cars” better than from those in “news”? Are they more inverse? A solution could be to perform hard negative mining by finding items that clearly violate the margin and thus lead to a higher loss which means some learning occurs. But the procedure is computationally very expensive and not feasible if we have thousand or more of candidates.
So, if we restrict the norm of each embedding and not using a L2 weightdecay scheme that always pushes down the weights, the model will eventually “converge”, but we don’t know how many steps are required. In other words, often a straightforward (linear) model might suffice, but we should instead invest more time in finding clever ways to perform the positive and negative sampling step.
It is astonishing and a little sad that the issue did not find more attention in research and often, just trivial examples are given that cannot be used in real-world problems. Without a doubt the issue challenging, but since it can often decide about the performance of a model, it should be worth the time.