# Tagged: embedding

# PyTorch: Faster Embedding Lookups With Padding

There are quite a few helper functions when it comes to recurrent nets, but in our case we just wanted to speed up the forward step of a model that is just using Embedding layers. Maybe there is also a helper for our problem, but in any case it’s a good idea to manually implement these steps to see how it works under the hood and to learn about possible side effects. Our setup is pretty simple: We have a batch of lists that contain individual tokens and our network shall return the sum of the corresponding embeddings for each sample.

The naive implementation only works if all those token lists have the same size, otherwise we are not able to build a LongTensor:

torch.LongTensor([[0, 5, 10], [3, 33, 333]]) [okay]

torch.LongTensor([[0, 5, 10], [3, 33]]) [error]

Since this is a common problem, the nn.Embedding module of PyTorch supports padding with “padding_idx=PAD”. Whenever PAD is found in the long tensor, the output is filled with zeros:

torch.LongTensor([3, 33, PAD]):

x_3_0 ... x_3_d

x_33_0 ... x_33_d

0 ... 0

In other words, this acts like a dummy embedding that does not change the gradient because no actual parameters are used. With this approach, we are able to return the aggregated embeddings (sum) for a batch of samples with different lengths, instead of forwarding each sample separately through the network.

batch = torch.LongTensor([

[0, 5, 10, 15],

[3, 33, PAD, PAD],

[17, PAD, PAD, PAD]])

batch_emb = net(Variable(batch))

We measured the runtime for both approaches and as expected, there is a notable performance gain by using batching: naive=14863 msecs. vs. batched=8294 msecs. which is an improvement of more than 40%.

Actually there is not any magic involved and you just need to make sure that you are working with the correct axis if you perform per-sample transformations. In our case, we normalized each aggregated vector (sum) so it has a unit-norm.

As a last step, let’s go through an example: If we assume that our embed_dim is 10 and we use batch as the input to the network, we get the following output shape: (3, 4, 10) which means we have 3 samples, each with 4 embeddings and each with 10 dimensions. Now, we want to calculate the sum of the embedding for each sample in the batch: batch_emb_sum = torch.sum(batch_emb, 1) with a resulting shape of (3, 10) and finally the normalization step: batch_emb_final = batch_emb_sum / batch_emb_sum.norm(dim=-1, keepdim=True) and that’s it. Thanks to the padding, the zero vectors do not interfere with any steps, since adding zero to something does not change anything.

But we need to be careful when we use an operation that depends on the number of elements, like torch.mean since the padding changes the size of the shape. To be more concrete, if we only have one token, but three PAD entries, the shape is (4, 10) and the mean would be: torch.sum(x, 1) / 4 even if the last three entries do not hold any values. Thus, we need to re-calculate the shape if padding has been used: actual_len = #rows – #pad_rows.

# Ask Your Neighbors For Advice

Since we have a rather unusual setup, ordinary classification often delivers a performance that is not satisfying. We tried to address the issue with a large-nargin approach that uses a triplet loss to model local neighborhoods, but even this approach fails to solve our problems for some kind of data. To put it simple, some samples with a rare combination of features might not find their place in the learned embedding space and thus, probably end up at some “random” place. To guide the way of those little rascals, we introduced a memory component like in [arxiv:1703.03129].

First, the memory is filled uniformly with samples with different labels. After that for each query, we find the nearest neighbor with the same label, but also one with a different label. The idea is to ensure that memory slots with different labels are well separated in the embedding space. Since we do not backprop through the memory slots, we adjust the embedding parameters of the query to ensure the margin. This is done by a simple triplet loss:

`max(0, margin + dot(query, nn_neg) - dot(query, nn_pos))`

where all data is unit-normed.

The memory slot with the matching label is then updated:

`mem[idx_pos] = l2_norm(query + mem[idx_pos])`

where idx_pos is the position of the memory slot.

The argumentation why this helps to improve the performance is similar to the one in the paper: The additional memory helps to remember combinations of features that are rarely seen and thus is often able to infer the correct label even if the embedding has not been “settled” at all.

Furthermore, the memory can also help to improve the embedding space by concatenating embeddings of samples and slots which leads to a cleaner separation of class boundaries: h_new = hidden||nn(mem, hidden).

Still, there are quite a few questions that need to be addressed by further research:

– The more a memory slot gets updated, the more likely it is that it will be closer to an arbitrary query. This will likely lead to lots of orphaned memory slots. How can we ensure that distinct feature combinations won’t get mixed into those slots?

– Shall we introduce some some non-determinism to introduce some noise to improve the utilization of the memory (to better preserve some rare patterns)?

– Shall the memory be either a circular buffer or shall we average memory slots to convert to a stable state?

As long as there is no way to learn in an reliably, but unsupervised way from sparse input data, we believe that external memory is a promising path to pursuit. However, even with this adjustment there are still lots of challenges that need to be addressed before we can come up with a working solution.

# Challenges With Real-World Embeddings

To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.

The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.

The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items

but this is not the concern of this post.

In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away

those “words” is no solution which is why we need to adjust the learning scheme.

In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.

The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.

Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:

def scale_grad_by_freq(parameters, lookup):

parameters = list(filter(lambda p: p.grad is not None, parameters))

for p in parameters:

g,grads = p.grad.data, g._values()

for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])

The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.

# The Opposite of Something

When the size of the vocabulary is very large, learning embeddings with negative log-likelihood and thus, the softmax, can be pretty slow. That’s why negative sampling has been introduced. It’s very fast and can lead to competitive results while it utilizes much fewer resources. However, there is a serious drawback. Similar to the triplet loss, the performance largely depends on the generation of proper positive and negative pairs. This is nothing new, but has been recently confirmed again by an analysis reported in the StarSpace [arxiv:1709.03856] paper.

The problem is that selecting “trivial” negative candidates might result in no loss or a low loss, since those items are likely to be already well separated from the positive item. Furthermore, there is often no clear strategy what the inverse of something is. For instance, let’s assume that we have two positive items related to “cooking” and now we need one or more negative items as a contrastive force. The question is are items from “cars” better than from those in “news”? Are they more inverse? A solution could be to perform hard negative mining by finding items that clearly violate the margin and thus lead to a higher loss which means some learning occurs. But the procedure is computationally very expensive and not feasible if we have thousand or more of candidates.

So, if we restrict the norm of each embedding and not using a L2 weightdecay scheme that always pushes down the weights, the model will eventually “converge”, but we don’t know how many steps are required. In other words, often a straightforward (linear) model might suffice, but we should instead invest more time in finding clever ways to perform the positive and negative sampling step.

It is astonishing and a little sad that the issue did not find more attention in research and often, just trivial examples are given that cannot be used in real-world problems. Without a doubt the issue challenging, but since it can often decide about the performance of a model, it should be worth the time.

# Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:

`torch.sum(E(in_words) * M(in_words), 0)`

where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:

`M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())`

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding

`torch.nn.Embedding(#tokens, #dims)`

which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to

`torch.nn.Embedding(#tokens, #dims, sparse=True)`

and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing

`torch.sparse.FloatTensor' object has no attribute 'addcmul_'`

but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

# Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale

x = [i1, i2, i3, ..., ik]

U = E[x]

h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]

U = E[x]

g = tanh(np.dot(U, v) + bias)

a = softmax(g)

h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:

`h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2]`

.

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

# Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With

P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.