CBOW: Issues With Sampling On Sets

To train a proper word embedding model is, thanks to the publicly available software and hints from papers, not a big deal. You just need lots of training data, which is usually no problem, and to choose reasonable hyper-parameters. To convert a text corpus into training data is pretty easy. For instance, the sentence “i love horror movies”, with a context size of 2, could be converted to the samples:

(love, [i, horror])
(horror, [love, movies])

This is just one and not the best way, but it is enough for the example. And of course, the order of the words in a sentence are not arbitrary and we therefore need to preserve them for the sampling.

Now, remember that our data consists of unordered lists of words:

{i, horror, love, movies}

That means we have a lot of more choices to build samples:

(i, [horror, love])
(i, [love, movies])
(i, [horror, movies])

and this were just the samples for the word “i”. But similar to the NLP task, each pair of words (a,b) has a co-occurrence frequency that describes the importance of the context “b” for the anchor “a”. Thus, if we sample low-frequency context words in the same amount as high-frequency context words, the imbalance has a noticeable impact on the model quality.

The issue can be addressed by importance sampling where the neighbors of an anchor “a” have weighted edges and we select an edge by sampling a uniform random number [0, sum(“weights of edges”)] and check what edge falls into the chosen interval. Stated differently, if an anchor word has two neighbors and the weights are [0.9, 0.1], it is easy to see that roughly 90% of the sampling choses edge 1 and only 10% edge 2.

With this modification, we can create training samples by randomly selecting a movie description and then we iterate over all words and choosing a context with importance sampling, which requires an existing neighbor list. The procedure introduces a stochastic element because the selection of the context is non-deterministic. Altogether, we have everything to train a model to analyze the results.

But, before we can actually start, we need to decide how to select the vocabulary. For NLP tasks, this is no big deal because with all the publicly available text, it is usually sufficient to select words by using a threshold. However, our data is much sparser and the distribution does not follow those of natural language. For instance, a word that occurs five times might still have a maximal co-occurrence of 1 which means the word is useless. In other words, a word needs to be well connected to allow to infer useful relational patterns.

We started with all words that occur at least 10 times in the data. Then, we set the minimal co-occurrence frequency to 5 and removed all orphan words. With these settings, the training got stuck after the loss function reached a certain value, or stated differently, the model did not improve any further. We assume that the value 5 is too low to learn good relations on the training, which is why we increased the frequency to 15 and started the training again. With the new value, it passed the old loss value but the model got still stuck, only at a later stage of the training.

Because for very early experiments, where we just used the co-occurrence matrix and sampled from the “top-k” neighbors of a word to create a context, the model actually converged, as indicated by a very low loss value and a visualization of the embedding, we come to the conclusion that the sampling to create training examples is still not optimal and needs some thinking.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s