The reason why word2vec, which includes cbow, is so successful and versatile is that it is usually trained on a huge text corpus. Therefore, a pre-trained model often suffice to work on a broad range of problems. However, in our case, the vocabulary is very special because it consists of specific plot words which rarely occur in a normal vocabulary. Thus, we are forced to train a model from scratch which is problematic because for low frequent words, it is very hard to learn a context or even an embedding at all.
So, the crux is that if we use a small vocabulary, let’s say 1,500 words, the model achieves a very low error and good embedding results, but it is not able to sufficiently model movies as a whole. But, if we use a larger vocabulary, let’s say about 5,000 words, the training is much harder because for low frequent words, estimating a context is often too noisy. At the end, we have the dilemma that one model is too small and the one that would be able to represent whole movies requires much more data. The solution for the problem seems obvious, we just need to collect more data, but because of the “long tail” distribution of words, it is not guaranteed that we find sufficient pairs of words for the weakly present words in new samples.
Bottom line, the experiments we conducted so far confirmed that cbow is suitable for our data, but in contrast to natural language, it is much harder to collect enough training data. However, since cbow does not consider the order of words, we could enrich our dataset with any kind of data as long as it fits into the co-occurrence terminology. For example, we could use user-generated tags, names of actors or any other meta data as additional “words”.