Learning Local Neighborhoods With NCA

Despite the simplicity, Nearest Neighbors (NN for short), is still a very powerful baseline for classification. Except for the number of neighbors, there are no parameters to tune or learn, but the accuracy depends solely on the quality of the feature representation. For domains with highly entangled representations, the results might be very disappointing. Thus, it seems reasonable to combine some neural net encoder to disentangle the input data first and then to jointly optimize the NN loss on this representation.

We experimented with a soft NN loss function [1] a while and we also stumbled again about NCA [2] (Neighborhood Component Analysis). Especially the note on the last slide [2] “This does not force all points of the same class to be similar.” caught our attention for several reasons. First, in case of clustering or softmax classifiers a single template is learned for the whole class which can be burdensome if the a single class contains a lot of variance. In this case, the network needs a sufficient depth to disentangle those aspects to learn the final template prototype that allows a linear separation of classes.

However, in case of NCA which is a soft version of NN, some arbitrary sample just need to be closer to some other sample with the same class, but not all of them. And a bit further away from a sample with a different label [2]. So, instead of wasting energy to collapsing all samples from the same class into a single point, we just need to ensure that the probability mass of all samples with the same label is maximal.

k, i = torch.arange(len(batch)), 0
dists = ((batch[i] - batch)**2).sum(dim=1)
dists[i] = 1e9 # mask out 'self'
probs = torch.softmax(-dists**2, dim=1).view(-1)
probs_leave_one_out = probs[0, (labels == labels[i]) & (k != i)]
loss = -torch.log(probs_leave_one_out.sum())

In words: We select the zero-th sample from the batch, calculating the distances to all other examples, we mask the i-th out since dist(x, x) is always zero to avoid a contribution to the softmax and then we further only consider probabilities from the same class and we also leave out the i-th example.

The classical softmax would push the whole mass to a single sample, while we are happy if the whole mass belongs to samples with same label class. This makes learning much easier, since it does not matter if a single positive sample is nearby or a dozen, as long as the negative samples are a bit more unlikely neighbors according to the softmax output.

To illustrate this with an numerical example: Let us assume that we have 6 examples where 3 are positive and three are negative. Then the all the following softmax outputs would all lead to a loss of zero

#1 [0.98, 0.02, 0.02, ~0, ~0, ~0]
#2 [0.33, 0.33, 0.33, ~0, ~0, ~0]
#3 [0.50, 0.25, 0.25, ~0, ~0, ~0]

since the sum of the positive samples is always close to one.

As stated in [2] this means the manifold can be fractured since there might bedifferent regions for the same label that are not connected. But frankly we do not care, as long as any new example is mapped into a region where at least one correct example, with respect to the label, is present and any sample with a different label is just a bit more unlikely.

A drawback of NCA is that the optimal solution requires all training examples which does not scale due to the quadratic complexity to calculate all distances. The problem is that if there exist a neighbor for some input x_i that would preserve the local neighborhood of x_i, but there is no such neighbor in the (mini-)batch, the neighborhood will be either distorted or even destroyed. A solution for the problem is memory + momentum encoders, like [3], where it is possible to scan much more samples to find nearest neighbors that might preserve the local neighborhood. For those samples, the gradient is not derived which means it is only propagated through the mini-batch data and the momentum samples are considered as constants.

Bottom line, the strength of NCA that it is often able to decompose categories into sub-categories, even if the labels are already fine-grained. This works, because the network is not forced to summarize a category with a single prototype template, at the cost that an inference step now requires to determine the k nearest neighbors from memory.

[1] raberrytv.wordpress.com/2019/07/06/implementing-loss-functions/
[2] CSC2535 Lecture 11: “Non-linear Dimensionality Reduction”, Hinton – Toronto
[3] Improving Generalization via Scalable Neighborhood Component Analysis

Simple Unsupervised Learning of Contextiualized Bag-of-Word Representations


The current setup we are working with is more information retrieval than NLP, but of course we also work with textual data. But in contrast to most NLP problems, we treat input not as ordered sequences, but as sets or bag-of-words. This is no problem, since we neither generate output that needs to be meaningful in a grammatical sense, nor do we have any problems were a swapped pair of words would be problematical in any way. Why do we mention this? For the popular Transformer architecture there is a position embedding which is exactly required to turn the input that is a set into a sequence but this is exactly what we do not want. However, without the position embedding, the token can be only used once, since otherwise it would degenerate into the exact same representation.

Long story short, what we want and need is an encoder that turns a variable-length set into a fixed representation that is supposed to summarized the ‘relevant’ parts of the set. Of course the relevance is subjective and is defined by the task and or loss function we use to train the representation. In other words, what we want is a simple model to derive a ‘sentence’ representation but without taking the order of the words in consideration. Furthermore, we do not want to use any supervision or labels, since it is very hard work to keep a dataset up-to-date so it is clean, without too much errors or contradictions, not to mention the time that is required to create one!


We could use the masked language modelling (MLM) approach, but as we mentioned before, without the position embedding, we cannot use multiple tokens which is a show stopper. We also do not want to use any auto-regressive approach since those approaches usually induce a high computational complexity. So, what’s left? We could use contrastive learning but then we need (lots of) negative samples and depending on the task there might be a good chance that a negative sample is also related but nevertheless pushed away from the ‘anchor’. Plus, for some setups a lot of negative samples are required which also induces a high computational burden, even with very efficient matrix multiplications.

So, is there something else? Yes, there is a simple approach that we described in [1] and the beauty of the method is that we do _not_ need any negative samples. All we need is function that returns different views for the same sample which is easy for images, but a bit harder for text. But instead of spending too much time to come up with a (not so) clever idea, we tried a the following baseline first.

To turn a set of words into two different views, we converted the set into a list, shuffled it and split the list into two halves to get a “left” and a “right” part. The split is done at offset len(input)//2, which ensure that left + right are almost of the same length. The randomization is some kind of augmentation to prevent that the model learns a shortcut in case the two halves are always the same. With this method, the model is forced to to learn a representation that is useful for any permutation of the input. However, even with this approach the importance of ‘common’ words on each side might be overestimated.


It fllows a a quick recap how the method works. The augmentation is done outside the model, so for the input x, we get x_left and x_right, where x = x_left || x_right. Both parts are each fed into some encoder f(x) which returns a fixed length vector. All output is l2 normalized. The alignment loss is then just the squared difference of the two ‘views’:

loss_align = ((f(x_left) - f(x_right))**2).sum()

The procedure is repeated for every sample in a mini-batch. So if our mini-batch consists of 64 samples, the output H contains 2*64 representation which can be reshaped into (64, 2, dim) to efficiently ‘batch’ the loss:

loss_align = ((H[:, 0] - H[:, 1])**2).sum(dim=1).mean()

In other words, the loss forces the two views to be close in the representation space which makes sense since the are obviously related.

The uniformity loss is not so simple but still very intuitive. The aim is to spread all representations on the unit circle to avoid collapsed regions and to make use the the whole space. Thus it follows immediately that the two loss functions are competing and if one wines, the representation is useless. For example, it is trivial to collapse all samples into a simple point and then loss_align is always zero. But then, the uniform loss is maximal. On the other hand, if all points are maximally spread, the distance between x_left, x_right of any view is also maximal.

To derive the uniform loss, we need every pairwise distances of the output. PyTorch provides a dedicated function torch.pdist for exactly this task, but since we use a 2nd order optimzation [2] we had to implement our own variant. But for the simplicity, let us use the PyTorch function:

loss_uniform_l = torch.exp(-3*torch.pdist(H_left, p=2)**2).mean().log()
loss_uniform_r = torch.exp(-3*torch.pdist(H_right, p=2)**2).mean().log()
loss_uniform = loss_uniform_r * 0.5 + loss_uniform_l * 0.5

For more details, see the website [3] of the authors. But if all distances are close to zero, the term evalulates to exp(-3*0).mean().log() to zero which is the maximum and we try to minimize the term.


We have to admit that our contributions are minor but let’s take a closer look at the details. First, by avoding negative samples we do not accidentally pushing relevant content away. This is nothing new and exactly the method of [1]. But for text, simple augmentations are not trivial and we empirically verified that the randomized split learns representation that is able to group related content without supervision and those groups often are matching the corresponding ground truth labels. This means that even a very simple ‘view’ suffices to learn representations that capture both semantics but also topically aspects. This is the first contribution.

The second and last one is that we only used very simple building blocks, like hash-based word embeddings [4] and a very shallow model to encode the set into a fixed length representation. Actually, we just used kernelized attention with a learnable token plus some non-linearity which again suffices to learn a useful representation for a vocabulary of ~100K tokens and ~50K samples. We tested the generalization of the method by applying the encoder on a dataset with ~6K tokens and ~1K samples. This test set contains labels to allow a visual summary of the representation based on t-SNE which easily allows to analyze how well related content is grouped into the
feature space.


We surely did not find the holy grail, but it is satisfying that we don’t need a Transformer model with billions of parameters to learn something useful with respect to a real-world problem. In our case, the semantic retrieval of related content by providing a simple query. The clou is that the model was never trained to rank query + documents, since it just learned to relate content in a local scope, namely the sample itself. The success is likely owed to the limited model complexity that acts as a regularizer and the weight sharing that forces the model to learn a word embedding that captures lots of different regularities.

[1] raberrytv.wordpress.com/2020/06/19/1330/ [Alignment + Uniform]
[2] raberrytv.wordpress.com/2020/11/07/adapsgd-no-learning-rates-hessian-trace/
[3] ssnl.github.io/hypersphere/
[4] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/

It is Important to Use Your Heads

We experimented a lot with attention recently and not only self-attention, since we still think that attention and context is a key to success to build robust and versatile neural nets.

In the literature, like [1], the question is asked how many heads do we really need which is a very good question and there is no definite answer to. Especially for larger transformers with dozens of layers, it is hard to determine how much a specific head contributes to the final task and also since this might be different depending on the task. But since our resources are very limited, we focus on a very simple example to analyze the different settings.

The task is to classify some textual content into one of eight classes. The challenge is that the vocabulary is very dynamic due to spelling errors, proper nouns and slight variations to mean the same thing. This is the reason why we use dynamic word embeddings [2], so we can embed all words, also the ones that are not seen during training and there are no out-out-vocab tokens. Because it is unlikely that we need a full-fledged transformer, we started with a simple baseline. The network consists of the embedding layer and a single attention layer, plus a linear output layer to predict the classes. In contrast to most implementations, we use a kernelized attention layer [3] with a quadratic kernel as the feature map.

To avoid tuning the learning rate, we use our default optimizer AdapSGD [4]. Except for label smoothing and gradient clipping, no other tuning or regularization is done. Now, we treat the number of heads as our only hyper-parameter. Since it is burdensome and often impossible to analyze what a single attention head really looks for, the pattern, we first study the effect on the loss and the prediction errors.

The dimensions of the embedding is 100. We evaluated the number of heads: 1, 5, 10, 20, 50. But because multiple heads are implemented by reshaping tensors, a higher number does not introduce additional parameters. In PyTorch this is done by using x.view(1, n_seq, n_head, -1). However, the induced subspace gets smaller and smaller: 1 => 100, 5 => 20, 10 => 10, 20 => 5, 50 => 2.

It might be reasonable to assume that there is a peak at some point and then the performance decreases, but astonishingly this is not visible for the training(!) loss + accuracy:

 1: 2.6498 errors=868 of 1024
 5: 2.5302 errors=847 of 1024
10: 2.4477 errors=831 of 1024
20: 2.3499 errors=804 of 1024
50: 2.2531 errors=784 of 1024

For the last setting, we have 50 heads, but each head only consists of a two dimensional subspace which is finally concatenated into a 100 dimensional vector again. Still, we get the best performance.

We trained every setting for five epochs to see how it develops and the trend stays that more heads lead to better results. At the end, the number of errors is:

 1: 447 of 1024
 5: 411 of 1024
10: 371 of 1024
20: 363 of 1024
50: 355 of 1024

During training, the advantage of the bigger number of heads shrinks, but despite the tiny subspaces they still perform very well compared to the other settings.

Of course this is no proof that more heads are always better, since we did not consider the generalization of the network at all, and we also consider only a single task, but to us the results are a bit surprising. We expected that an attention subspace must be ‘sufficiently’ large to positively contribute to the performance, since for the default transformer settings, with 12 heads and 768 dims, each subspace has a size of 64.

Bottom line, for our real setting, we ended up with 100 dims and 10 heads and even if some heads learned the same pattern, or no pattern at all, the final performance, and also on the test set, was noticeable better when we doubled the number of heads from 5 to 10. What is disappointing is the fact that there is no easy way to explain what is happening inside the model that causes the advantage. Even for our small model, dissecting each head to derive a pattern seems not feasible. Maybe we could do a visual exploration, but this is also burdensome and there is no guarantee to success.

[1] NIPS 2019: Are Sixteen Heads Really Better than One?
[2] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/
[3] raberrytv.wordpress.com/2020/07/01/linear-attention-different-kernels-and-feature-representations/
[4] raberrytv.wordpress.com/2020/11/07/adapsgd-no-learning-rates-hessian-trace/

Query Expansion For Sets

NLP, as the name ‘natural language’ indicates, is concerned to generate sequences that are syntactically and grammatically correct. This is no piece of cake and is further complicated by the problem that one needs a lot of data to learn a representation that captures the statistical regularities of the language itself. An important issue is that a sequence of words is in correct order and thus follows a certain structure. For instance, all the Transformer models out there use a position encoding to achieve exactly this, since self attion by default is permutation invariant which means regardless of the input order, the output representation of each word does not change. Without this constraint there is no difference between ‘Jack likes Python’ and ‘Python likes Jack’. In other words, the sequence turns into a set which has no order {A,B,C} = {C,B,A}.

Since most of our problems are much closer to classical IR (information retrieval) than NLP, we don’t care much about the order and treat the input as bag-of-words, or multisets, since words might be repeated in the bag. As noted, self-attention, is a perfect match since it uses no position information by default but still encodes the context.

So, we would like to operate on sets [1, 2] and use the approach to expand a pre-defined context, set, with related words they are conceptually or topically relevant. This is a bit like the prediction of Bert, but without considering the position of each token. However, without this information, we cannot mask more than a word since it would always produce the same output. This is related to [3], but there, the position information is used and thus the token gets an ‘random offset’ to avoid the problem.

The problem we try to solve is also known as set expansion which means, giving some input set X={x1,..,xn} we would like to find new tokens T={x_i, x_i+1, …} that are ‘maximally’ related to the the content of X. For instance, if X = {pizza,hamburger,fries}, then T could be {pasta,bacon,potatoes} because those tokens match the topic of X which could be food. In case of textual input, where we only consider nouns, we could use 80% of a sentence or paragraph as X to predict the remaining 20% words as T. At the end, the idea is to encode some context X with some set encoder f(X)=H and use a decoder g(H)=T to predict relevant tokens.

We can turn this into a generative approach by repeated the following steps and for simplicity, we assume that we just predict a single token, so T is of shape (1, dim):

Y = []
for i in range(10):
 T = g(f(X))
 X = cat((X, T))

The output Y then contains all the ‘expanded’ words that are hopefully related to the context given in X. By default the function g() uses argmax to select the most likely word, but we could also sample from the scores of each token to introduce randomness.

We will elaborate on the technical details in later posts, but as usual we use PyTorch and we are thankful that the community provides more and more reference implementations for their papers, like [1] which shared their code on github [4].

[1] Deep Sets, 1703.06114
[2] Set Transformer, 1810.00825
[3] Mask-Predict, 1904.09324
[4] github.com/juho-lee/set_transformer

AdapSGD: Twenty Nine Days Later

The tuning of hyper-parameters takes often a considerable time and is also anything but energy efficient, since you train some time with different settings to decide what combination works best. However, this step does not guarantee that the final training converges, it just proved that the model starts learning something at all. And a further problem is that the chosen settings might not work for the next problem you try to solve and then, the process starts again. After all the years of research there is surely no simple solution to it, since otherwise one or more people would have been stumbled about it. But what about a 80/20 solution? A solution that works more than ‘okay’ in most cases without tuning anything would suffice. We tried to implement such a solution with an optimizer that uses the Hessian trace [1], but the idea is not ours, but a mixture of two existing proposals [1, 2, 3].

We evaluated the method with more experiments and as we expected, after the euphoria came the bleak reality. Well, that is a bit harsh, but some tuning was required to avoid instabilities in some situations. The base is still the dynamic learning rate, but since the value is not bounded, very large values are possible which is a problem. This is the first issue we. We tried different methods to bound the rate close to [0,1] and ended up with a very simple solution: f(x) = x / sqrt(1 + x**2). This helped a lot to stabilize the training. The second issue is that gradient clipping, where a default of one works reasonably well, is also required and we did not stumbled about the problem, since for all our NLP problems, we clip gradients. The last issue is that weight decay must be used and the amount needs to be rather high, like 0.1 and for this issue, a bit tuning depending on the problem might still be required.

So, in a nutshell what did we do:
(1) bound learning rate by f(x) = x / sqrt(1 + x**2)
(2) clip gradients to max-norm 1
(3) set weight_decay to 0.05-0.1

Bottom line, with these settings + AdapSGD, we were able to train models for various NLP tasks without tuning any hyper-parameters. This is surely no guarantee that it works for the next problem, but this setup worked reasonably well for all experiments we conducted. The only drawback is that even with an Hessian update every n-th steps, the procedure is noticeable slower than first-order methods and for every parameter in the network, the optimizer requires four state variables of the same shape.

[1] raberrytv.wordpress.com/2020/11/07/adapsgd-no-learning-rates-hessian-trace/
[2] No more Pesky Learning Rates, ICML 2013
[3] ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning, arxiv:2006.00719

AdapSGD: No Learning Rates + Hessian Trace

Back propagation is the key to success if you have to train a neural network. It is combined with an optimizer to miminize some loss function. Especially for images, the problem is well understood since years of experiences and the NLL loss function. In other domains, even if normalization methods and carefully chosen random parameters, the situation looks different. The training of a transformer requires a careful setup, with learning rate schedules, warmups and other tricks. And even then, the training migh get stuck in some local minima.

Despite years of research, there is no easy way out of it. Analyzing high dimensional loss functions is an active topic, but still no one came up with a method that allows to guarantee that the training converges to a state that generalizes well. At least not, with a realistic budget with respect to time and space. Methods like a naive implementation of a second order optimization requires a matrix of shape (num_params, num_params) and since num_params is usually in the millions, this is out of scope. However, there is a lot of research how to approximate the Hessian matrix efficiently and those methods only require a vector of shape (1, num_params) which is manageable.

The reason why we are care right now is that we are very much frustrated about how long our tiny network needs to learn a useful representation from a rather small dataset. To be more precise, our network has about 10K params and the dataset consists of about 1K rows. We spent quite some time to tune the hyper-parameters which includes the optimizer, but if we do not handcraft the scheduling of some parameters, the network takes much steps to converge.

Since our expertise of numerical optimization is rather standard, we did a quick literature research. We stumbled about [2] which approximates the Hessian with the Hutchinson method which is quite simple to implement but nevertheless very efficient at the same time. Thanks to the code of the authors [3], we could conduct our own experiments very quickly. The results were rather disappoint since with the default settings, compared to an optimizer like Adagrad that is also much faster to train. We again spent quite some time to understand the hyper-parameters and the best settings for our problems, was a simplified version that can be summarized as follows.

grad = gradient(loss_function)
hessian = get_trace(grad)
avg_g = avg_g * momentum + grad
avg_h = 0.99 * avg_h + hessian
std = sqrt(avg_h + eps)
param = param - lrate * grad/std

The difference to the original proposal is that we do not use a moving average for the gradient, but momentum variant without damping. But as noted before, even with this fine-tuned variant, the learning is not as fast as expected and every new dataset also needs a notable amount of time to tune hyper-parameters.

We remembered that quite a while ago, LeCun and colleagues from NYU came up with a method [1] that both combines an approximation of the Hessian and also do not require any learning rates. Sounds like a dream comes true, right? The method is both elegant and simple and can be summarized as follows.

grad = gradient(loss_function)
hessian = get_trace(grad)
t = 1/mem
avg_g = (1 - t) * avg_g + t * grad
avg_g2 = (1 - t) * avg_g2 + t * grad**2
avg_h = (1 - t) * avg_h + t * abs(h)
lrate = avg_g**2 / (avg_g2 + avg_h)
mem = (1 - g**2/avg_g2) * mem + 1
param = param - lrate * grad

Looks like RMSprop with a self-adjusting average scaling, but otherwise not very scary when you are familiar with Adam and friends. The only thing we did was to set an upper bound for the learning rate: clamp(lrate, max=1) which might not be always required, but in our case it was necessary. We initialized mem with ones. The implementation in PyTorch is also straightforward if you are a bit familiar with the Optimizer interface.

As suggested by the authors of [2], to speed up learning we determine the Hessian not every step, but we delayed the calculation for five steps. This does not hurt the final performance and often even helped to converge in fewer steps. We call this variation AdapSGD. The extra memory requirements during backprop are 4 * |params| since we need a buffer for the memory, the gradient, the squared gradient and the Hessian trace.

Bottom line, this is surely not the holy grail, but in several experiments, the method helps to avoid that our network got stuck in some local minima or did not improve at all.

[1] No more Pesky Learning Rates, ICML 2013
[2] ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning, arxiv:2006.00719
[3] github.com/amirgholami/adahessian

Simple First: When 10K Parameters is All You Need

Without a doubt we also appreciate the recent NLP progress, but we are also a bit skeptical if the ‘more-data-and-bigger-models’ is the only way to go, especially when one consider the energy consumption of training and inference. Frankly, we have no solution, but we always try to also consider simpler and more lightweight approaches, because they often deliver a astonishingly strong baseline, like [1].

An old problem we have is that we need to classify the type of items based on a short title, plus an optional sub-title. While some items contain very easy patterns, others are more vague, plus the vocabulary is very dynamic, due to typos and new combinations of characters. Thus, word embeddings do not work very well and character embeddings need a very powerful model. Again, the simple approach in [1] comes to rescue. With this a fixed number of weights, we can embed any token we want.

The next problem, also a classic, is that not all token contribute equally to the final prediction and often most of them can be ignored. So, we need attention to ignore those tokens and focus only on the relevant ones. But since the word embedding projection is very simple, some extra context can’t hurt which is why we apply self-attention after the embedding step. But that’s it and we have with about 10K weights a complete classifier that works astonishingly well.

There is no fancy stuff at all, just some dropout between the layers and label smoothing to avoid that the model gets too confident with its predictions. This step is indeed needed since even with penalizing prediction errors from rare labels, the model tends to assign the whole probability mass to one label. To avoid this, label smoothing [2] can be used. In [2] is stated that labels can be wrong and in this case, putting all weight on the wrong label is likely harmful for generalization. So instead of pursuing a perfect one-hot prediction, we also consider all other classes, but with a much smaller probability, like 0.1 and this value is uniformly spread over all other classes. This also ensures that the templates of all classes receive weight updates which does not happen when the softmax is very close to 1.

We tested the model by classifying a fresh dataset and manually evaluated the performance. The accuracy was pretty solid when we rejected all predictions with y_hat < 0.8. The model of course makes mistakes and sometimes it just guesses very good, since there is no pattern visible, not even for a human. But what's important is the fact that it learnt to generalize and provides a good baseline. Furthermore, since we use attention, we can analyze the contribution of each token for the final prediction which we did for a couple of examples. And indeed, if possible, the model just attended to the relevant tokens, while in case without clear patterns, a rather arbitrary selection was made. Like when a token just occurs in a single class, but the token itself has no further meaning.

Bottom line, for us we could say that 10K parameters is all we need to train a surprisingly strong, but simple model that can be used on all kind of devices. We do not pretend that our model understands the content, but this is also true for much larger models and in contrast to the training of such models, our model is very energy-efficient and not restricted on the device usage. We also experiment with unsupervised learning, but that is different story.

[1] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/
[2] Deep Learning [isbn:978-0262035613]

Self-Attention: Properly Handle Padded Input

Even if a paper is very verbose with hyper-parameters and other details, there is usually no space left about hints for efficient implementations of the required operations. Sometimes they are sketched, or even better github link is provided, but this is not always the case. Especially for attention on variable-length sequences, the softmax operation can be a bit tricky. Why? You need to consider the padding of shorter sequences and turn them into some masked value that does not contribute to the probability distribution. Let’s consider this numerical example for illustration: We have two rows, the first has 6 elements, the other 4. The padding is done with zeros, so the latter looks like this:

x = torch.FloatTensor([0.7, 1.2, -0.34, 2, 0, 0]).view(1, -1)
p = torch.softmax(a.view(1, -1), dim=1).view(-1)
tensor([0.1305, 0.2151, 0.0461, 0.4787, 0.0648, 0.0648])

It is obvious that the last two padded entries should not contribute to the probability but they do. The solution is simple, we need to fill the paded entries with an extremely large number:

p_mask = torch.softmax(x.masked_fill(x==0, -1e6), dim=1).view(-1)
tensor([0.1499, 0.2471, 0.0530, 0.5500, 0.0000, 0.0000])

The difference between ‘p’ and ‘p_mask’ is notable which is no surprise since the sum is normalized and since positive or zero values contribute more than negative values, zeros are problematic. And let us not forget that we apply exp(x_i) on each value before we normalize. The following numerical example illustrates the problem:

exp(0) = 1.00
exp(-0.34) = 0.71
exp(-99) = 1e-43

Bottom line, we need a sufficiently large negative value to avoid any contribution from padded entries, otherwise the padding acts as a constant value that always contributes to the probability distribution which is incorrect.

Dynamic Contextualized Word Embeddings

In our previous post [4], we briefly introduced a method that allows to learn word embeddings with very few parameters and the ability to project any word into the learned feature space. Despite all the advantages, the drawback is that the information of these embeddings cannot be easily captured with nearest neighbor analyses, since they do not directly capture syntactic structure. Now, this is no real problem since the idea is anyway to re-learn the projection matrix directly from the dataset. However, to further maximize the use of the trained network, it would be nice to also have the word embeddings.

We used a simpler version of the convolutional approach from [1] but since there was a tendency to also attend to non-useful words, we replaced the entire layer with some thing. And these days the answer to everything is a transformer, so we also used self-attention, but a slightly different variant [2]. Let us us do a quick recap of our net architecture: We have the projection operator to embed arbitrary tokens into a fixed-length representation. Then we feed the sequence into a multi-head attention layer and to compress this variable sequence, we use routing-by-agreement [3] which lead again to a fixed-length vector. Finally, the classification is done with an ordinary softmax. It should be noted that we intentionally ignore the order of the words!

We know that supervised learning might not lead to the best word embeddings, since they are strongly biased toward the task, but masked language learning is a lengthy process and we wanted to check if the idea works at all. So, we use the output of the self-attention layer to derive contextualized word embeddings. As a quick test, we used a dataset with 1,000 samples and five classes. For each sequence, we store each word along with the label and then we visualize a tSNE plot of those vectors. The label of each word as a color and as expected clusters are visible and reasonably well separated. We also studied the nearest neighbors of a couple of words and now a semantic is clearly visible, also if we count the number of words with the same label as the reference.

Bottom line, for a dataset with ~7K distinct words and ~1K rows, we just need ~9K parameters to learn both a full classifier and a contextualized word embedding. In this case, we can really say attention is all we need, since routing-by-agreement is a kind of attention, self-attention is clearly attention and there is nothing left. So, what comes next? BERT with hash-based word embeddings?

[1] D19-1506: Projection Attention Networks
[2] raberrytv.wordpress.com/2020/07/01/linear-attention-different-kernels-and-feature-representations/
[3] raberrytv.wordpress.com/2020/02/01/attention-via-routing-by-agreement/
[4] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/

Dynamic Word Embeddings

In contrast to the domain of images where the input space is entirely covered by a neural net, namely the pixel space, text documents are a completely different story. The reason is obvious, since text consists of a vocabulary that is not fixed, pretty large and it grows over time. Not to mention specific types, like proper nouns that cannot be modeled as part of other words. In [1] we revisited some aspects of word embeddings, but the bottom line is that full embeddings cover a large portion of the trainable parameters, or in case of the character-level, they require an extremely powerful model. With byte-pair encodings, BPE, the best of both world is combined, but the expressiveness for complex language is still limited and thus not optimal.

We tried to use a convolutional approach to learn a dynamic embedding, but despite the modest size of the extra parameters, the network failed to learn useful embeddings for smaller datasets. Thus, the resulting word embeddings were always inferior compared to the full word embeddings. So, the question is if there is a simpler method that can embed an arbitrary word, or better token, without retraining the whole model? Turns out there is as described in [2, 3].

The idea is both elegant and simple. They key to success is a function that takes an arbitrary sequence X of length N and transforms it into a sequence h of length K, where K is fixed: f(X) -> h. A recurrent neural net could be used for the function f but this would be computationally expensive. A much better choice is a hash function that compresses some text into a fingerprint of fixed length. The function does not need to be cryptographically secure and thus, any hash function, like CRC32, md5 or whatever can be used.

In case of [2], the output of the hash function is converted into bits and further compressed by the factor of two by merging adjacent bits. The following Python code illustrates the whole procedure:

tid = int(md5(input.encode('utf8')).hexdigest(), 16)
bits = "{0:b}".format(tid) pad = 128 - len(bits) + 1
bits = ("0" * pad) + bits
out = [reduce(bits[i], bits[i+1]) for i in range(0, len(bits)-2, 2)]

The code is written for clarity, not performance and does not contain any magic. We hash some input with MD5, convert it into an integer, then to a bit string, pad it to be exactly of length 128 and then we merge neighboring bits. The reduce function is as follows:

x, y = int(x), int(y)
if x == y and x == 1: return 1
elif x == y and x == 0: return 0
else: return -1

So, at the end, the output is a 64-“bit” string that consists of {0, -1, 1}. The advantage is that any token can be transformed with the method, for example words that are not present in the current dataset. This means that we never need to skip any tokens in new input or use an OOV embedding.

All we need to do know is just to convert the fixed-length input into a continuous feature space which is piece of cake. For this, all we need to do is to use a projection matrix of shape (D, d) where D is half of the hash length, D=128/2 and the embedding dimension d can be freely chosen, like 32.

With a numerical example it becomes obvious how much parameters are spared: Let us assume a vocabulary of size 5,000 and the embedding dim is 32.
(1) full word matrix: 500032=160K
(2) hash projection: 6432=2K
That is quite a lot, to be precise, a factor of 80, which makes such models well suited for devices with lower computational power which is also the purpose of [2]. In case of classification the experiments confirmed that the new method is indeed a very strong baseline.

Now one might ask if the embeddings that were learned with method (2) have similar properties as (1). Like reasonable nearest neighbors and visible groups when visualized with some 2D projection like tSNE. Stating the obvious, it is not reasonable to assume that 2K parameters can encode the same amount of information as 160K. As confirmed by experiments this does not limit the use of embeddings for a classification task, but our own preliminary experiments showed that the embeddings cannot be used as a stand-alone feature space. Again, this is no surprise.

Bottom line, the problem of ever growing vocabularies can be efficiently addressed with the described method with minimal overhead, but with respect to computation and the number of parameters. The next step is to try those dynamic embeddings for non-classification tasks.

[1] raberrytv.wordpress.com/2020/05/18/word-embeddings-risk-a-second-look/
[2] D19-1506: Projection Attention Networks
[3] Efficient On-Device Models using Neural Projections