# Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"

i = T.ivector()

avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()

avg = T.mean(W[i], axis=0)

func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.