The popularity of (word) embeddings has not come to an end yet. Especially for information retrieval lots of ideas from embeddings are borrowed and now called ‘Neural IR’. We like the idea and also conducted several experiments with word embeddings to group items into folders, or equivalently, to predict a set of arbitrary labels. Recently a new paper [arxiv:1608.06651] described very similar ideas to our folder approach, but instead of folders they want to assign documents to experts. But despite the different vocabulary, the goal is the same. They say that all experts -folders- are equally important which means they predict every folder with a probability of 1/#experts. Instead of using a noise-contrastive/neg sampling loss, they use the NLL loss to predict all experts (see blog). The idea to re-shape the probability of the experts to be multi-modal instead of uni-modal is clever, but has been used before.
However, for our data, we got the impression that the learning is not stable. For instance, if we assume that a movie should be assigned to two folders: horror and scifi, we get predictions of horror=0.54, scifi=0.45 one epoch, but horror=0.85, scifi=0.13 the other. Stated differently, there are strong forces that compete for the equilibrium but the model is not powerful enough to satisfy all constraints simultaneously. As a result, we get predictions that vary a lot, even if the model gives high probability to the correct folders.
The next step is to find out what is going on here. Maybe we just need more training steps to allow the model to settle down? Or maybe we need more dimensions for the embedding? It is also not unlikely that we need to perform adjusted sampling to address the issue of the
long-tail for the feature distribution.