Tagged: word2vec

Generalizing word2vec

The idea of the word2vec method is quite simple, but nevertheless very elegant and, as lots of recently published papers confirmed, very versatile for a broad range of problems. Thus, a lot of problems can be transformed and solved by a generic word2vec implementation. The advantage is that we can reuse a mature and optimized method to solve various problems, but the drawback is that this might not be very flexible, if we have, for instance, special requirements. Like kludges for the data preparation, or the abuse of parameters to emulate a certain behavior.

However, if we just extract the core component of word2vec, we have a fairly generic black box to solve a problem. In the skip-gram case, the input consists of a source word that is used to predict a sequence of context words. This approach also works, if we don’t have actual words, but tokens that are somehow related. Especially for unordered data, like sets, where the context is not well defined, an adaptable preparation step makes a lot of sense.

For example, let’s assume that we have a set of titles and each title has a set of corresponding tags. The intuition is that if two titles share a lot of tags, they are related. In other words, the title is the source and all assigned tags form the context, which can be seen as a local neighborhood. In case we also consider tags from titles that are reachable through shared tags, we gradually move away from a local to more global neighborhood.

This also has been explored in the literature where the local neighborhood can be described as a breadth-first search and the global one as a depth-first search. This is also related to a random walk, since it makes sense to stochastically decide what node to traverse next. For instance, we start at an arbitrary title, then we sample from the corresponding tags, then we sample from the connected titles and so forth. The whole sequence is then the walk. In contrast to a sequence a set is not ordered and thus, we need a different kind of notation for the window size. In [arxiv:1603.04259] it was proposed to consider all pairs in the set, or to shuffle each training sample.

Bottom line, word2vec can be used far beyond the field of NLP which includes graph embedding, collaborative filtering or personalization, just to name a few. Furthermore, in most scenarios, a well-matured implementation[python:gensim.models.Word2Vec] can be used to train an embedding without the necessity to adapt a single line of the code. In other cases, the input data might need to be encoded in a special way, but this is often straightforward and also does not require to change the code.

Predicting The Next ‘Thing’

It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.

For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.

Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.

The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.

Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.

However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.

Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With
P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.

CBOW: Don’t Trust In Small Numbers

The reason why word2vec, which includes cbow, is so successful and versatile is that it is usually trained on a huge text corpus. Therefore, a pre-trained model often suffice to work on a broad range of problems. However, in our case, the vocabulary is very special because it consists of specific plot words which rarely occur in a normal vocabulary. Thus, we are forced to train a model from scratch which is problematic because for low frequent words, it is very hard to learn a context or even an embedding at all.

So, the crux is that if we use a small vocabulary, let’s say 1,500 words, the model achieves a very low error and good embedding results, but it is not able to sufficiently model movies as a whole. But, if we use a larger vocabulary, let’s say about 5,000 words, the training is much harder because for low frequent words, estimating a context is often too noisy. At the end, we have the dilemma that one model is too small and the one that would be able to represent whole movies requires much more data. The solution for the problem seems obvious, we just need to collect more data, but because of the “long tail” distribution of words, it is not guaranteed that we find sufficient pairs of words for the weakly present words in new samples.

Bottom line, the experiments we conducted so far confirmed that cbow is suitable for our data, but in contrast to natural language, it is much harder to collect enough training data. However, since cbow does not consider the order of words, we could enrich our dataset with any kind of data as long as it fits into the co-occurrence terminology. For example, we could use user-generated tags, names of actors or any other meta data as additional “words”.

Using CBOW With Short Descriptions

The continuous bag of words, CBOW for short, is an extremely simple but still very powerful method to learn relations of words. However, since our domain is not natural language, but sets of words to describe specific content, in our case movies, we need to check if the method is really appropriate for our data. But before, we want to take a closer look at the steps of the method.

The aim of the the method is to predict an arbitrary word with the help of surrounding words, which are called the context. For instance, if we consider the context words (“crime”, “vigilantes”) the word “police-officer” should have a high prediction probability. Intuitively this makes sense, because if we all words are drawn from the same topic, other words from this topic are very likely to co-occur.

The CBOW method learns two embedding matrices. First, W that encodes the the typicality, like “police-officer” “cop” by pushing them together in the embedding space and W_out that encodes the topicality of a word which means words that occur in the context of this specific word, like “police” -> (“crime”, “vigilantes”), will be close in the embedding space. To be more specific, let us consider an example. The movie Doom might have the following, but incomplete, plot words:
(“mars”, “marines”, “hell”, “demon”, “creature”)
which can be easily summarized by a human as a sci-fi, military horror movie. Therefore, we expect that “hell” and “demon” are close in the W-space, and so is “marines” to other military-related words in the W_out-space.

So, despite the fact that we do not consider natural language, there seems no reason why the model should not work for the domain of movies which are expressed as unordered word lists. To verify this, we trained a small CBOW model with a context-size of 2 for a vocabulary of 1,500 feature words with an embedding dimension of 50. We used negative sampling (k=3) and gradient descent with momentum. To simplify things, we stored a sparsified co-occurrence matrix to draw negative words for “j”, which means for the pair (“j”, “i”) word “i” is never observed in the training set.

Similar to our other embedding experiments, the results are pretty impressive, because the model was able to learn both typicality and topicality for pairs of words. However, the capabilities of the model to summarize a whole movie, which is done by averaging over all word embeddings, is limited. Why is that? First, a lot of descriptive words are not chosen for training because of the low frequency. Second, on average a plot list for a movie contains only 8 words with a very high variance. Thus, it is often very challenging to put a word into the right context. For instance “survivor” in a pirate movie has a different meaning than for movie about an assault, but without a proper context of words, the averaging will likely fail to handle this correctly.

To address the issue, we definitely need a larger vocabulary, but also higher-order relations like “themes” to better describe a movie as a whole. Furthermore, we plan to train an embedding model on generic movie descriptions from the EPG data to add some semantic our search engine.