Conceptual Similarity With Embeddings

For quite a while, we studied methods how to project movie descriptions, represented by short lists of keywords and themes, into some feature space that preserves a semantic distance between movies. We tried classic models like auto-encoders, restricted Boltzmann machines, siamese networks and lots of other neural network-based models to learn such an encoding, but not any model was able to capture enough information to learn a good feature space. The major issues are low-frequency keywords, incomplete descriptions and that the data set is likely too small. For these reasons, we changed our focus to embed words rather than full movies which eventually lead to a scheme that allowed us to aggregate these embeddings to movie embeddings. We started with classical word embeddings trained with CBOW that learns an embedding matrix and a context matrix.

To illustrate one problem of movie embeddings, let’s consider a simple example of a siamese network. The oracle to decide if two movies are similar might be the genre. In case of very narrow genres, like western, such a model might likely work, because the vocabulary for western movies is very specific. However, even then the embedding of a new movie could fail, because the movie might have too many ambiguous words and since no further information is used, the distance in the feature space is likely to be closer to some non-western movies.

To be clear, the word embedding approach does not solve all problems, but it helps to tackle down some of the known problems. For instance, a word might not be really specific for the western genre, but it relatively often occurs with terms from this genre. Thus, the embedding of the word, in the context space, is close to some western-specific words. This helps a lot if a movie is encoded as a centroid by averaging all embedding vectors of its present words.

To be more specific, let us assume that a movie consists only of a single word “w_a” and the context “W_out”, the neighbors of this word, is related to western. However, in the embedding space “W”, non-western words are more similar. As mentioned before, a movie is represented by the average of word embeddings of present words:

W_movie[i] = 1/n * (W_out[w_1] + W_out[w_2] + ...)

To find the best matches, we use the cosine similarity of “w_a” with all movie centroids. The intuition is that instead of using the typicality of a word, we use the topicality to find related movies. In terms of documents, the analogy is to find text that not only mention this single word, but is actually about it. This is why the context is so important, because if a word is not directly related to a genre, we still want to find movies where the word fits into the broader topic. To be fair, we got lots of inspirations from [arxiv:1602.01137] where they used a similar approach to find matching documents for a query.

However, as noted in the previous post it is not sufficient to model the co-occurrence of words which is why we use the movie itself as the context. More precisely, instead of selecting “surrounding” words and predicting the “center” word, we use all words that are present in a movie as the context and some (abstract) concept as the center. The notation, despite this slight adjustment, remains the same. At the end, we want to bring the average vector, the movie, closer to valid concepts and push it further away from non-valid concepts.

To see if the model actually leads to a useful feature space, we calculated the embeddings of all movies that are aired in the next days and selected some random samples and analyzed their neighbors.

#1: Red Dog -> {Dr. Dolitte, Shaggy Dog, Despicable Me, Dr. Dolitte 4, Peter Pan}
#2: Police Python 357 -> {Miami Vice, Columbo, Chaos, The Son of No One}
#3: Star Trek Insurrection -> {S.T First Contact, S.T. Motion Picture, Apollo 13}

We can clearly see that semantically related movies are grouped together. For #1, the movies are for children and family, #2 has a strong crime theme and #3 is sci-fi with a space theme. But it should be noted that the results are also best effort, because if no similar movie is aired in the next days, the next neighbor might be very far away.

In a nutshell, the training of an embedding with a more appropriate loss function for the movie scenario already leads to very good results despite the fact that the approach is quite simple. Furthermore, the model is very versatile, since we can use it to predict arbitrary tags for unseen movies and we can use the learned embedding to find semantically related neighbors.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s