Generalizing word2vec

The idea of the word2vec method is quite simple, but nevertheless very elegant and, as lots of recently published papers confirmed, very versatile for a broad range of problems. Thus, a lot of problems can be transformed and solved by a generic word2vec implementation. The advantage is that we can reuse a mature and optimized method to solve various problems, but the drawback is that this might not be very flexible, if we have, for instance, special requirements. Like kludges for the data preparation, or the abuse of parameters to emulate a certain behavior.

However, if we just extract the core component of word2vec, we have a fairly generic black box to solve a problem. In the skip-gram case, the input consists of a source word that is used to predict a sequence of context words. This approach also works, if we don’t have actual words, but tokens that are somehow related. Especially for unordered data, like sets, where the context is not well defined, an adaptable preparation step makes a lot of sense.

For example, let’s assume that we have a set of titles and each title has a set of corresponding tags. The intuition is that if two titles share a lot of tags, they are related. In other words, the title is the source and all assigned tags form the context, which can be seen as a local neighborhood. In case we also consider tags from titles that are reachable through shared tags, we gradually move away from a local to more global neighborhood.

This also has been explored in the literature where the local neighborhood can be described as a breadth-first search and the global one as a depth-first search. This is also related to a random walk, since it makes sense to stochastically decide what node to traverse next. For instance, we start at an arbitrary title, then we sample from the corresponding tags, then we sample from the connected titles and so forth. The whole sequence is then the walk. In contrast to a sequence a set is not ordered and thus, we need a different kind of notation for the window size. In [arxiv:1603.04259] it was proposed to consider all pairs in the set, or to shuffle each training sample.

Bottom line, word2vec can be used far beyond the field of NLP which includes graph embedding, collaborative filtering or personalization, just to name a few. Furthermore, in most scenarios, a well-matured implementation[python:gensim.models.Word2Vec] can be used to train an embedding without the necessity to adapt a single line of the code. In other cases, the input data might need to be encoded in a special way, but this is often straightforward and also does not require to change the code.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s