Usually embeddings for words are trained without strong supervision which means the context is automatically derived from a sentence or neighbors. The signal for the training is discriminative, but there is no need to explicitly derive labels for samples. This allows to train embeddings for arbitrary datasets which is definitely an advantage. However, for a very large vocabulary, the learned embedding might be too general for some tasks.
For instance, to group movies into virtual folders, the embedding consists of those folders, abstract tags, and word embeddings that are combined by averaging them to represent a movie (cbow). A naive approach would train an embedding first, by using the co-occurrence, and then learn some classifier to model the relation between movies and folders. However, as demonstrated in [arxiv:1605.07891], a “global” embedding often lacks the context to encode topicality of words. To quote the example of the paper, the word “cut” has a different meaning in the global context than in a local context of taxes. The same is true for movies.
With the ability to assign a movie to multiple folders, there might be ambiguities, because not every word can be clearly assigned to a single topic. To some degree this is addressed by averaging all words to encode a movie which also forms a local embedding that is driven by the assigned folders. In other words, if a word has different meanings depending on the folder, the interaction with words from specific folders lead to very different embeddings. Since all these information are encoded in the learned space, we must ensure that we have enough capacity to model all these relations.
Bottom line, instead of training a global embedding first and then the classifiers, we jointly train the embedding and the classifiers to address the issue of local contexts.