Faster Training of Embedding Models With Theano

We are still working to build a larger dataset for our cbow-based model, but also to adjust the sampling method to select the context for a specific word. The reason is that, in contrast to ordinary NLP-based datasets, we have unordered lists of features and that means if we have a set {a, b, c, d}, we cannot say that “a” is more distant to “d” than “b”. Because of this issue, we are currently using a very simple method that is sampling uniformly from the top-k neighbors of a word. The results are decent, but we definitely need more control to avoid to emphasize very frequent words and ignore rare words. But before, we have to optimize the training procedure to handle much bigger datasets.

The first issue is that a straight-forward implementation with Theano is not very fast because of the advanced indexing that is required to train a lookup table. Thus, we decided to use Theano just to get the gradients for a single training step and to take care of the rest ourselves. As a first step, we implemented simple momentum to speed up the convergence rate with quite astonishing results. For both experiments, we used exactly the same parameters by seeding the random generator with a fixed value, or stated differently momentum with 0.9 was the only difference:

#samples loss/sgd loss/mom
50,000 2.7553 1.2054
100,000 2.6937 0.3228
200,000 2.0521 0.1113

It is a known fact that vanilla gradient descent can start very slow which means convergence can take a while and thus, the progress can be awfully slow. With momentum, the procedure is often much faster because each step gathers additional velocity, as long as the direction of the gradient does not change. This helps especially in very flat regions of the space, because with the gathered velocity the procedure can “slide” through this region much faster to hopefully get to a better region of the space.

In a nutshell, we define a Theano function without any “updates” which means we have to provide shared variables for all parameters. In our case, it is the word to predict “j”, the context words “i1”, “i2” and negative words “n1”, “n2”, “n3”: f(j, i1, i2, n1, n2, n3) -> grad. The returned gradient is an array of 6 elements, one for each parameter. All we have to do is to update the embedding of i1/i2 in the ‘W’ matrix and j/n1/n2/n3 in the ‘W_out’ matrix which is simple since all the parameters are integers.

So, with the introduction of momentum in combination with just using the gradients from Theano, we are now able to train models much faster with a larger vocabulary. Nevertheless, the method is still kind of proof-of-concept since it is not very elegant because it cannot be easily adjusted for more negative samples. But for now, we better focus to build a larger corpus because the benefits are likely to be more beneficial.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s