To train a word embedding model is a pretty old hat and maybe this is nothing new, but we nevertheless thought it would be a good idea to summarize our experiences to avoid possible headaches in case somebody is not so familiar with Theano. To be fair, the FAQ of Theano is mentioning the issue, so we are using the post as a more detailed summary.
Let’s assume that we want to train an embedding model for words with a
vocabulary size of N. Then we have a lookup matrix “W”
W = theano.shared(np.random.normal(0, 0.1, size=(N, dim)))
where dim is the number of dimensions we use for the embedding.
For the sake of simplicity, a training step consists of a pair (ref, pos) that should be moved together and a pair (ref, neg) that is supposed to be pulled apart. In other words, for each step, we only change three row of W.
This is a possible loss function with the update:
cost = -T.log(T.nnet.sigmoid(T.dot(W[ref], W[pos]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W[ref], W[neg])))
grad_W = T.grad(cost, W)
updates = [(W, W - lrate * grad_W)]
So, what is the problem here? Even if we only adjust three rows of the matrix, we calculate the gradients for the whole matrix. This is very wasteful and also awfully slow.
The solution is to use advanced indexing and only calculate the gradients for the subset of weights that are actually used in the step:
idx = T.ivector() #ref, pos, neg
W_part = W[idx]
cost = -T.log(T.nnet.sigmoid(T.dot(W_part, W_part))) + -T.log(1 - T.nnet.sigmoid(T.dot(W_part, W_part)))
grad_W_part = T.grad(cost, W_part)
updates = [(W, T.set_subtensor(W_part, W_part - lrate * grad_W_part))]
For those who are not familiar with Theano, let’s see what is going on here. The variable “idx” is a vector that holds integers, in our case ref, pos, neg which are used to retrieve particular rows of “W”. In other words, W_part contains references to these rows and is therefore something like a view of “W”.
The trick is to call T.grad with the subset and not the whole matrix to avoid unnecessary computations, because the gradient of all rows, except the ones referenced in W_part, are zero anyway. But since we are working with a view now, we need a different way to just update the three rows in W which can be done with set_subtensor(). First, we are updating W_part as usual with gradient descent, but then we need to use advanced indexing to just replace the referenced rows. We can think of the new update statement as:
W[ref] = W_part - lrate * grad_W_part
W[pos] = W_part - lrate * grad_W_part
W[neg] = W_part - lrate * grad_W_part
And that’s it. It is not really rocket science, but it requires some deeper understanding of how Theano works under the hood.
The lesson is that if you just use a subset of weights during learning, you should only derive gradients for this subset. The notation might look unfamiliar at the beginning, but in case of a lookup matrix with several hundred thousands rows, we are talking about savings in the range of hours and this is worth the little extra effort.
And last but not least, updating the whole matrix leads to very strange side-effects (also briefly mentioned in the FAQ). Let’s assume that we train an embedding model and we use RMSprop as the optimizer. In each step, we update the moving average:
avg_new = 0.9 * avg + 0.1 * grad**2,
where avg_new has the same dimension as the embedding matrix. In case of a batch-size of one, most of the gradients are zero and thus, the update for a row with a zero gradient looks like this:
avg_new[i] = 0.9 * avg[i] + 0.1 * 0,
which means the average is pushed to zero.
Therefore, even if the row was not referenced in the pairs, the dynamic of the optimizer is changed for all rows, regardless if they were referenced or not. This is true for all kind of optimizers that keep a history of values, like momentum, adam or adagrad.
Bottom line. While using only the subset means a huge gain in performance for classical SGD, but it is not strictly required, it is mandatory for all sophisticated optimizers.