The continuous bag of words, CBOW for short, is an extremely simple but still very powerful method to learn relations of words. However, since our domain is not natural language, but sets of words to describe specific content, in our case movies, we need to check if the method is really appropriate for our data. But before, we want to take a closer look at the steps of the method.
The aim of the the method is to predict an arbitrary word with the help of surrounding words, which are called the context. For instance, if we consider the context words (“crime”, “vigilantes”) the word “police-officer” should have a high prediction probability. Intuitively this makes sense, because if we all words are drawn from the same topic, other words from this topic are very likely to co-occur.
The CBOW method learns two embedding matrices. First, W that encodes the the typicality, like “police-officer” “cop” by pushing them together in the embedding space and W_out that encodes the topicality of a word which means words that occur in the context of this specific word, like “police” -> (“crime”, “vigilantes”), will be close in the embedding space. To be more specific, let us consider an example. The movie Doom might have the following, but incomplete, plot words:
(“mars”, “marines”, “hell”, “demon”, “creature”)
which can be easily summarized by a human as a sci-fi, military horror movie. Therefore, we expect that “hell” and “demon” are close in the W-space, and so is “marines” to other military-related words in the W_out-space.
So, despite the fact that we do not consider natural language, there seems no reason why the model should not work for the domain of movies which are expressed as unordered word lists. To verify this, we trained a small CBOW model with a context-size of 2 for a vocabulary of 1,500 feature words with an embedding dimension of 50. We used negative sampling (k=3) and gradient descent with momentum. To simplify things, we stored a sparsified co-occurrence matrix to draw negative words for “j”, which means for the pair (“j”, “i”) word “i” is never observed in the training set.
Similar to our other embedding experiments, the results are pretty impressive, because the model was able to learn both typicality and topicality for pairs of words. However, the capabilities of the model to summarize a whole movie, which is done by averaging over all word embeddings, is limited. Why is that? First, a lot of descriptive words are not chosen for training because of the low frequency. Second, on average a plot list for a movie contains only 8 words with a very high variance. Thus, it is often very challenging to put a word into the right context. For instance “survivor” in a pirate movie has a different meaning than for movie about an assault, but without a proper context of words, the averaging will likely fail to handle this correctly.
To address the issue, we definitely need a larger vocabulary, but also higher-order relations like “themes” to better describe a movie as a whole. Furthermore, we plan to train an embedding model on generic movie descriptions from the EPG data to add some semantic our search engine.