# movie2vec – Adapting Word Embeddings For Movies

The word2vec approach is an excellent example that a powerful method does not always come with high computational complexity and because the algorithm does not require any labels, training data can be obtained very easily, even for non-English languages. With asynchronous SGD, such a model can be trained on a corpus with millions of words in the vocabulary without spending weeks to wait for a good accuracy. Since there are already very good posts about the method, we do not describe it any further. But instead of focusing only on word embeddings, we use a more general approach [arxiv:1503.03578] to learn an embedding of data that can be represented as a graph.

In the last months, we progressively encountered problems because of the sparsity in the input data and missing descriptions. Thus, we decided to move our focus from feature-based learning to relation-based learning and more specifically, embeddings, because relations of tuples can be modeled efficiently even with sparse input data. More precisely, we treat each movie as a vertice in a graph that has edges to other vertices (movies) that model arbitrary relations between those movies. For instance, an edge could mean two movies have the same genre, theme or keywords in common. However, as a first step, we decided to use a more lightweight model that is using words to get to know the new method.

The model can briefly describes as follows. The graph consists of vertices that represent plot words and an edge describes that the word pair (i,j) occur together. We use the inverse word frequency as weights for the edges (i->j): w[i], (j->i): w[j]. This is similar to a matrix factorization of the co-occurrence matrix, but with SGD the method scales much better to larger vocabularies. At the end, the model learns an embedding for each word in the feature space that resembles the proximity of words in the original space. To convert the word vectors to movies, we start with averaging all vectors for words which are present in a movie. This serves as a baseline for further experiments. Our base model consists of 4,450 words/vertices and 833,005/edges (* 2 to make the graph directed). With randomly drawn word samples, we tested that the embedding is reasonable (word + most similar words):

– class-reunion => faculty, reunions, campus, football-star

– possession => evil-possession, demonic-possession, supernatural-forces

– stock-car => nascar, car-racing, speed, racing, race-car

– tiger => elephant, lion, crocodile, animal

The results confirm that for words with a sufficient frequency, the embedding delivers a solid performance, but for some words, which are less frequent, it is noticeable that the number of edges was insufficient which lead to bizarre neighbors in the feature space. Since our training corpus is rather small, the issue can be addressed by collecting more movie meta data.

Bottom line, even with a small dataset we can learn a good model to relate words. However, in contrast to sentences, the meta data of movies has no order which is no problem, but requires to generate the training data differently. For instance a sentence that consists of four words

s=(w1, w2, w3, w4)

implies that word w2 is adjacent to word w3 and w1 but not to w4 and therefore there is no edge between w2 and w4. In case of a movie m={w1, w2, w3, w4} we need to consider **all** pairs of words as edges:

(w1, w2), (w1, w3), (w1, w4), (w2, w3), (w2, w4), (w3, w4)

but otherwise the training remains the same.

The problem with averaging the embedding vectors is that not every word contributes equally to the concepts of a movie. Thus, we evaluated the quality of our baseline with a simple classifier that predicts the genre with a simple soft-max model without any hidden layers. The first thing we noted during training was that the model converged very quickly, regardless of the chosen hyper-parameters. This is in indicator that the embedding already did a good job in disentangling various explaining factors in the input data. To check the expressive power of the embedding, we added a single hidden layer to the model but we found out that this does not improve the model which means that the embedding is descriptive but unable to explain some higher-order correlations. This is not surprising if we consider the simplicity of the method. Nevertheless, the soft-max model still delivers a good performance considered its simplicity.

Even though we can now better utilize the existing data, we still need a larger training set to tackle issues like the rare word problem. Furthermore, even if the averaging of vectors provides a good baseline, it is far from being optimal and needs to be replaced with a more powerful model. Despite these issues, we are pretty happy with the results because even such a simple model is already very versatile with applications in clustering and classification.

At the end, we can say that the new direction in our research already helped a lot to see the current issues from different perspectives and we are positive that we can extend the basic model into a more powerful one that improves the representational power of movie vectors. On top of our wish list is to use recurrent networks to convert a sequence of word vectors into a fixed representation of a movie.