In a previous post we described a simple method to convert an arbitrary set of feature words, a movie, into a fixed representation. We trained an embedding of all those words which are then averaged to get the final representation of a movie. The training was done by modeling the relation of words as a graph. The learned embedding worked very good despite the fact that the model was quite simple.
Very recently, some researchers came up with a clever idea [arxiv:1602.01137] that is based on the continuous bag-of-words (CBOW) method to match a query with documents. The idea of CBOW is to average the embedding of context words to predict the “center” word. For that, words from the “past” and the “future” are used, but because of the averaging, the actual order of the context words is not important. The novelty is that they use the hidden-to-output weight matrix, W_out, that is usually discarded right after the training. It should be noted that the original approach is used in the domain of text documents which is only distantly related to our problem, but the idea is still useful for our movie domain.
Now, it is time to talk about movies. But first, a quick reminder: A movie is represented by a set of words that somehow describes the content of a movie. All words have a frequency of one and usually very few words are used to describe a movie. Therefore, we need to clarify how to train a CBOW model with this kind of data. Let’s start with a context size of C=2 which means, all movies need to have at least 3 words, C + 1, to predict. For example, if a movie is represented by
– cowboy, law, gunfighter, bad-guy, revenge
we need to sample the word to predict, w=cowboy for instance, and the context, which could be c=(gunfighter, revenge). That means, given the words in c, we wish to maximize the probability to predict the word w and the order of the words in c does not matter.
A central idea of [arxiv:1602.01137] is to differ between the two cases, where a document just mentions a specific word and the document is actually about the word. In our case, with very short feature vectors and no term frequency, this is not really applicable, but W_out is nevertheless very useful to better utilize the context of words. To be more specific, let us consider the two matrices W, for the input embedding and W_out for the “output” embedding. The paper used the words “typical” for W and “topical” for W_out. For instance, if the word is the name of a university, like “nyu”, and we use W to find the nearest neighbors, we usually find other universities, like “stanford” or “yale”. However, if we use W_out, we expect words that co-occur frequently with a university, like “department” or “faculty” or “graduate”.
In case of movies, with a very limited vocabulary, it is not obvious what typical and topical really means. For instance, if the input word is “(police-)officer”, we would expect “cop” or “detective” to be typical and words like “underworld”, “crime” or “gang” topical, but since there is no empirical data, the interpretation is very blurry. However, the experiments we did so far, backup the theory.
Bottom line, we decided to try CBOW because it is fast and agnostic to the order of the context words and with the evidence from the paper, it seems more appropriate for us to use W_out to model whole movie vectors and because the experiments so far confirmed our intuition, we plan to continue this path.