CBOW: Encoding Movie Descriptions As Graphs

In contrast to NLP with texts and real sentences, the meta data of movies is not ordered. A set like {photograph, magination, daydream} can be shuffled without changing the meaning of it, which is in contrast to ordinary sentences. However, to train a word embedding, we need an anchor word and a context to predict this particular word. But without an order there is no consistent way to do so and this is why we are treating the data as a graph.

To be more precise, a set like {a,b,c} can be treated as a clique where all nodes are pairwise connected and the idea can be generalized to larger sets. Furthermore, the graph is not directed and the edges between two nodes have a specific weight that equals the co-occurrence of these nodes. With this notation, we can sample an anchor word, uniformly from all nodes, and build the context based on the co-occurrence statistics of the anchor word. This can be done by sampling a uniform random value [0, w_sum], where w_sum is the sum of all edge weights from the anchor word to the other nodes and then we check what node corresponds to the randomly chosen interval. With this method, we take the co-occurrence of two words into consideration to focus on frequent pairs which is likely to help to capture the semantic and encode it into the embedding.

The training procedure is then straightforward. We just need to build the co-ocurrence matrix C. Then, we shuffle all descriptions and iterate over the data, one by one. For each data, we select, randomly, an anchor word “a” and build the context “c” according to the method we previously described. A negative sample “n” is selected from C where only nodes with a frequency #(n,c) < T are considered, or stated differently, words that rarely or never occur within the context.

Bottom line, by treating movie descriptions that are sets of words as graphs , it is possible to define a consistent method to train a word embedding on this data, despite the fact that the selection of the anchor word and the context is stochastic.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s