# Tagged: embedding

# Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale

x = [i1, i2, i3, ..., ik]

U = E[x]

h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]

U = E[x]

g = tanh(np.dot(U, v) + bias)

a = softmax(g)

h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:

`h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2]`

.

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

# Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With

P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.

# Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"

i = T.ivector()

avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()

avg = T.mean(W[i], axis=0)

func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.

# Efficient Embeddings With Theano

To train a word embedding model is a pretty old hat and maybe this is nothing new, but we nevertheless thought it would be a good idea to summarize our experiences to avoid possible headaches in case somebody is not so familiar with Theano. To be fair, the FAQ of Theano is mentioning the issue, so we are using the post as a more detailed summary.

Let’s assume that we want to train an embedding model for words with a

vocabulary size of N. Then we have a lookup matrix “W”

W = theano.shared(np.random.normal(0, 0.1, size=(N, dim)))

where dim is the number of dimensions we use for the embedding.

For the sake of simplicity, a training step consists of a pair (ref, pos) that should be moved together and a pair (ref, neg) that is supposed to be pulled apart. In other words, for each step, we only change three row of W.

This is a possible loss function with the update:

cost = -T.log(T.nnet.sigmoid(T.dot(W[ref], W[pos]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W[ref], W[neg])))

grad_W = T.grad(cost, W)

updates = [(W, W - lrate * grad_W)]

So, what is the problem here? Even if we only adjust three rows of the matrix, we calculate the gradients for the whole matrix. This is very wasteful and also awfully slow.

The solution is to use advanced indexing and only calculate the gradients for the subset of weights that are actually used in the step:

idx = T.ivector() #ref, pos, neg

W_part = W[idx]

cost = -T.log(T.nnet.sigmoid(T.dot(W_part[0], W_part[1]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W_part[0], W_part[2])))

grad_W_part = T.grad(cost, W_part)

updates = [(W, T.set_subtensor(W_part, W_part - lrate * grad_W_part))]

For those who are not familiar with Theano, let’s see what is going on here. The variable “idx” is a vector that holds integers, in our case ref, pos, neg which are used to retrieve particular rows of “W”. In other words, W_part contains references to these rows and is therefore something like a view of “W”.

The trick is to call T.grad with the subset and not the whole matrix to avoid unnecessary computations, because the gradient of all rows, except the ones referenced in W_part, are zero anyway. But since we are working with a view now, we need a different way to just update the three rows in W which can be done with set_subtensor(). First, we are updating W_part as usual with gradient descent, but then we need to use advanced indexing to just replace the referenced rows. We can think of the new update statement as:

W[ref] = W_part[0] - lrate * grad_W_part[0]

W[pos] = W_part[1] - lrate * grad_W_part[1]

W[neg] = W_part[2] - lrate * grad_W_part[2]

And that’s it. It is not really rocket science, but it requires some deeper understanding of how Theano works under the hood.

The lesson is that if you just use a subset of weights during learning, you should only derive gradients for this subset. The notation might look unfamiliar at the beginning, but in case of a lookup matrix with several hundred thousands rows, we are talking about savings in the range of hours and this is worth the little extra effort.

And last but not least, updating the whole matrix leads to very strange side-effects (also briefly mentioned in the FAQ). Let’s assume that we train an embedding model and we use RMSprop as the optimizer. In each step, we update the moving average:

avg_new = 0.9 * avg + 0.1 * grad**2,

where avg_new has the same dimension as the embedding matrix. In case of a batch-size of one, most of the gradients are zero and thus, the update for a row with a zero gradient looks like this:

avg_new[i] = 0.9 * avg[i] + 0.1 * 0,

which means the average is pushed to zero.

Therefore, even if the row was not referenced in the pairs, the dynamic of the optimizer is changed for all rows, regardless if they were referenced or not. This is true for all kind of optimizers that keep a history of values, like momentum, adam or adagrad.

Bottom line. While using only the subset means a huge gain in performance for classical SGD, but it is not strictly required, it is mandatory for all sophisticated optimizers.

# Embeddings + Experts

The popularity of (word) embeddings has not come to an end yet. Especially for information retrieval lots of ideas from embeddings are borrowed and now called ‘Neural IR’. We like the idea and also conducted several experiments with word embeddings to group items into folders, or equivalently, to predict a set of arbitrary labels. Recently a new paper [arxiv:1608.06651] described very similar ideas to our folder approach, but instead of folders they want to assign documents to experts. But despite the different vocabulary, the goal is the same. They say that all experts -folders- are equally important which means they predict every folder with a probability of 1/#experts. Instead of using a noise-contrastive/neg sampling loss, they use the NLL loss to predict all experts (see blog). The idea to re-shape the probability of the experts to be multi-modal instead of uni-modal is clever, but has been used before.

However, for our data, we got the impression that the learning is not stable. For instance, if we assume that a movie should be assigned to two folders: horror and scifi, we get predictions of horror=0.54, scifi=0.45 one epoch, but horror=0.85, scifi=0.13 the other. Stated differently, there are strong forces that compete for the equilibrium but the model is not powerful enough to satisfy all constraints simultaneously. As a result, we get predictions that vary a lot, even if the model gives high probability to the correct folders.

The next step is to find out what is going on here. Maybe we just need more training steps to allow the model to settle down? Or maybe we need more dimensions for the embedding? It is also not unlikely that we need to perform adjusted sampling to address the issue of the

long-tail for the feature distribution.

# Neural Pondering

We did not give up our old dream of purely unsupervised learning, but right now, we neither seem have the right data nor a powerful enough model. However, we recently stumbled about a research paper about neural variational inference for text and some ideas caught our attention.

The idea is to encode a whole document as a latent representation and then a softmax is used to independently generating the words the document consists of. The document is encoded as a bag-of-words with an additional integer index for the non-zero words. This sounds very much like our data, very high-dimensional, very sparse and not ordered. So, we shamelessly took their ideas from the paper and turned them into our context, but of course we give credit to the authors[1511.06038].

Since we mainly use the softmax to ‘reconstruct’ the individual words, our new model looks a lot like CBOW, continuous bag of words, with the exception that now a context is not used to predict a single word, but several words. In Theano notation:

x_idx = T.ivector() # the list of word IDs

W = weight matrix of |V| x embed_dim

b0 = bias with embed_dim

R = weight matrix of embed_dim x |V|

b = bias with |V|

h_lin = T.mean(W[x_idx], axis=0) # mean of all input words

h = elu(h_lin + b0)

y_hat = T.nnet.softmax(T.dot(h, R) + b)

loss = -T.sum(T.log(y_hat[0, x_idx]))

which means we average the embedding of all words that are present in the document, optionally with a non-linearity at the end. Then we use this combined representation of the document to maximize the probability of all present words while pushing down the values for all others.

Again, we do not deny a strong resemblance to CBOW and we are pretty sure the novelty is limited, but maybe this is a stepping stone to something greater. The analysis of word neighbors in the learned space at least confirms that some semantic is definitely captured, but more work needs to be done.

# Local Embeddings

Usually embeddings for words are trained without strong supervision which means the context is automatically derived from a sentence or neighbors. The signal for the training is discriminative, but there is no need to explicitly derive labels for samples. This allows to train embeddings for arbitrary datasets which is definitely an advantage. However, for a very large vocabulary, the learned embedding might be too general for some tasks.

For instance, to group movies into virtual folders, the embedding consists of those folders, abstract tags, and word embeddings that are combined by averaging them to represent a movie (cbow). A naive approach would train an embedding first, by using the co-occurrence, and then learn some classifier to model the relation between movies and folders. However, as demonstrated in [arxiv:1605.07891], a “global” embedding often lacks the context to encode topicality of words. To quote the example of the paper, the word “cut” has a different meaning in the global context than in a local context of taxes. The same is true for movies.

With the ability to assign a movie to multiple folders, there might be ambiguities, because not every word can be clearly assigned to a single topic. To some degree this is addressed by averaging all words to encode a movie which also forms a local embedding that is driven by the assigned folders. In other words, if a word has different meanings depending on the folder, the interaction with words from specific folders lead to very different embeddings. Since all these information are encoded in the learned space, we must ensure that we have enough capacity to model all these relations.

Bottom line, instead of training a global embedding first and then the classifiers, we jointly train the embedding and the classifiers to address the issue of local contexts.