# Tagged: embedding

# The Opposite of Something

When the size of the vocabulary is very large, learning embeddings with negative log-likelihood and thus, the softmax, can be pretty slow. That’s why negative sampling has been introduced. It’s very fast and can lead to competitive results while it utilizes much fewer resources. However, there is a serious drawback. Similar to the triplet loss, the performance largely depends on the generation of proper positive and negative pairs. This is nothing new, but has been recently confirmed again by an analysis reported in the StarSpace [arxiv:1709.03856] paper.

The problem is that selecting “trivial” negative candidates might result in no loss or a low loss, since those items are likely to be already well separated from the positive item. Furthermore, there is often no clear strategy what the inverse of something is. For instance, let’s assume that we have two positive items related to “cooking” and now we need one or more negative items as a contrastive force. The question is are items from “cars” better than from those in “news”? Are they more inverse? A solution could be to perform hard negative mining by finding items that clearly violate the margin and thus lead to a higher loss which means some learning occurs. But the procedure is computationally very expensive and not feasible if we have thousand or more of candidates.

So, if we restrict the norm of each embedding and not using a L2 weightdecay scheme that always pushes down the weights, the model will eventually “converge”, but we don’t know how many steps are required. In other words, often a straightforward (linear) model might suffice, but we should instead invest more time in finding clever ways to perform the positive and negative sampling step.

It is astonishing and a little sad that the issue did not find more attention in research and often, just trivial examples are given that cannot be used in real-world problems. Without a doubt the issue challenging, but since it can often decide about the performance of a model, it should be worth the time.

# Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:

`torch.sum(E(in_words) * M(in_words), 0)`

where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:

`M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())`

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding

`torch.nn.Embedding(#tokens, #dims)`

which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to

`torch.nn.Embedding(#tokens, #dims, sparse=True)`

and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing

`torch.sparse.FloatTensor' object has no attribute 'addcmul_'`

but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

# Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale

x = [i1, i2, i3, ..., ik]

U = E[x]

h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]

U = E[x]

g = tanh(np.dot(U, v) + bias)

a = softmax(g)

h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:

`h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2]`

.

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

# Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With

P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.

# Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"

i = T.ivector()

avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()

avg = T.mean(W[i], axis=0)

func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.

# Efficient Embeddings With Theano

To train a word embedding model is a pretty old hat and maybe this is nothing new, but we nevertheless thought it would be a good idea to summarize our experiences to avoid possible headaches in case somebody is not so familiar with Theano. To be fair, the FAQ of Theano is mentioning the issue, so we are using the post as a more detailed summary.

Let’s assume that we want to train an embedding model for words with a

vocabulary size of N. Then we have a lookup matrix “W”

W = theano.shared(np.random.normal(0, 0.1, size=(N, dim)))

where dim is the number of dimensions we use for the embedding.

For the sake of simplicity, a training step consists of a pair (ref, pos) that should be moved together and a pair (ref, neg) that is supposed to be pulled apart. In other words, for each step, we only change three row of W.

This is a possible loss function with the update:

cost = -T.log(T.nnet.sigmoid(T.dot(W[ref], W[pos]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W[ref], W[neg])))

grad_W = T.grad(cost, W)

updates = [(W, W - lrate * grad_W)]

So, what is the problem here? Even if we only adjust three rows of the matrix, we calculate the gradients for the whole matrix. This is very wasteful and also awfully slow.

The solution is to use advanced indexing and only calculate the gradients for the subset of weights that are actually used in the step:

idx = T.ivector() #ref, pos, neg

W_part = W[idx]

cost = -T.log(T.nnet.sigmoid(T.dot(W_part[0], W_part[1]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W_part[0], W_part[2])))

grad_W_part = T.grad(cost, W_part)

updates = [(W, T.set_subtensor(W_part, W_part - lrate * grad_W_part))]

For those who are not familiar with Theano, let’s see what is going on here. The variable “idx” is a vector that holds integers, in our case ref, pos, neg which are used to retrieve particular rows of “W”. In other words, W_part contains references to these rows and is therefore something like a view of “W”.

The trick is to call T.grad with the subset and not the whole matrix to avoid unnecessary computations, because the gradient of all rows, except the ones referenced in W_part, are zero anyway. But since we are working with a view now, we need a different way to just update the three rows in W which can be done with set_subtensor(). First, we are updating W_part as usual with gradient descent, but then we need to use advanced indexing to just replace the referenced rows. We can think of the new update statement as:

W[ref] = W_part[0] - lrate * grad_W_part[0]

W[pos] = W_part[1] - lrate * grad_W_part[1]

W[neg] = W_part[2] - lrate * grad_W_part[2]

And that’s it. It is not really rocket science, but it requires some deeper understanding of how Theano works under the hood.

The lesson is that if you just use a subset of weights during learning, you should only derive gradients for this subset. The notation might look unfamiliar at the beginning, but in case of a lookup matrix with several hundred thousands rows, we are talking about savings in the range of hours and this is worth the little extra effort.

And last but not least, updating the whole matrix leads to very strange side-effects (also briefly mentioned in the FAQ). Let’s assume that we train an embedding model and we use RMSprop as the optimizer. In each step, we update the moving average:

avg_new = 0.9 * avg + 0.1 * grad**2,

where avg_new has the same dimension as the embedding matrix. In case of a batch-size of one, most of the gradients are zero and thus, the update for a row with a zero gradient looks like this:

avg_new[i] = 0.9 * avg[i] + 0.1 * 0,

which means the average is pushed to zero.

Therefore, even if the row was not referenced in the pairs, the dynamic of the optimizer is changed for all rows, regardless if they were referenced or not. This is true for all kind of optimizers that keep a history of values, like momentum, adam or adagrad.

Bottom line. While using only the subset means a huge gain in performance for classical SGD, but it is not strictly required, it is mandatory for all sophisticated optimizers.

# Embeddings + Experts

The popularity of (word) embeddings has not come to an end yet. Especially for information retrieval lots of ideas from embeddings are borrowed and now called ‘Neural IR’. We like the idea and also conducted several experiments with word embeddings to group items into folders, or equivalently, to predict a set of arbitrary labels. Recently a new paper [arxiv:1608.06651] described very similar ideas to our folder approach, but instead of folders they want to assign documents to experts. But despite the different vocabulary, the goal is the same. They say that all experts -folders- are equally important which means they predict every folder with a probability of 1/#experts. Instead of using a noise-contrastive/neg sampling loss, they use the NLL loss to predict all experts (see blog). The idea to re-shape the probability of the experts to be multi-modal instead of uni-modal is clever, but has been used before.

However, for our data, we got the impression that the learning is not stable. For instance, if we assume that a movie should be assigned to two folders: horror and scifi, we get predictions of horror=0.54, scifi=0.45 one epoch, but horror=0.85, scifi=0.13 the other. Stated differently, there are strong forces that compete for the equilibrium but the model is not powerful enough to satisfy all constraints simultaneously. As a result, we get predictions that vary a lot, even if the model gives high probability to the correct folders.

The next step is to find out what is going on here. Maybe we just need more training steps to allow the model to settle down? Or maybe we need more dimensions for the embedding? It is also not unlikely that we need to perform adjusted sampling to address the issue of the

long-tail for the feature distribution.