Attention: Drop Heads Not Dimensions

There are different ways to discourage your network to just memorize the input data, or in terms of machine learning, to avoid overfitting. But for some architectures ‘1D’ dropout might not be the best solution. Such an example is multi-head attention. There, a batch of data is of shape (batch, seq, heads, dim) and dropout is applied to the view (batch, -1) of the input, or in other words, every dimension in a row has the same chance to be dropped. However, if we want regularize full heads individually, 1D dropout is not what we want.

In PyTorch there is also a 2D dropout version that applies dropout to each channel individually, where the expected input shape is (batch, channel, x, y). The length of the shape already matches, but we need to transpose the attention output to ensure that ‘heads’ comes second and not third.

x = input.transpose(1, 2) # (b, s, h, d) -> (b, h, s, d)
x = torch.nn.functional.dropout2d(x)
x = x.transpose(2, 1) # -> (b, s, h, d)

When we assume 2D dropout with p=0.1, it means that with a 10% chance (s * d) dimensions are set to zero and the sampling is repeated for each head.

This is related to the method described in [1]. The idea is to avoid that some heads dominate the attention during the training and that other heads just rely on the ‘leader’ to get things right, where 1D dropout works on the unit level to avoid co-adaption.

[1] Scheduled DropHead, acl:2020.findings-emnlp.178

N Layers is All We Need?

The other day we were joking [2] that the days of attention are counted and a few days later [2] another paper was published that showed that other methods also can solve the same problem without attention. As we mentioned before, we think attention is a great building block with respect to explain-ability, but the computational complexity is something we need to work on. Frankly, it feels a bit like walking on the edge of a circle where we now arrived where we started, or stated differently the insights during our walk indicates that plain feed-forward MLPs with ‘spatial mixing’ are enough to solve all the problems out there. So, we are back at ‘MLPs Are All We Need’.

However, regardless if attention is required or not, we need a building block that can at least partly explain the decisions of a neural net. As a consequence, at the end of the network, we need some scoring with respect to the ”input tokens”. When previous layers perform non-linear spatial mixing of these tokens, the scores obviously cannot be directly tracked to the input tokens, but this problem is existing for Transformer architectures in general. So, for the sake of simplicity, we always consider ‘4 layer’ networks an embedding layer, one mixing layer, an attention layer and finally the output layer for the prediction. For the mixing layer, the only constraint is that the shape of the input sequence is preserved which is usually (batch, seq, dim). With this in mind, we could use [1,3] or any classical attention method, like [4].

In case we use a building block from a paper it is worth to think about it for a moment before we implement it as a PyTorch Layer. Why? We are not sure what the design criteria for this block was, but we assume that some idea was verified by applying it to some problem. And that usually means some dataset is used and the goal is to train a model that generalizes well. Even if a goal is to design something that is applicable to a broad range of problems, the optimization of the design was likely done with respect to the conducted experiments and that means with respect to the used datasets.

But do not let us be vague here. We really appreciate the efforts and also that the authors share the results with the community, often with reference or some pseudo code, but our point is that maybe building block is too powerful for your problem. Very often a grid search is done for the hyper parameters, but it is less clear how to apply the idea to “minimize” a layer design. Maybe we are still to foggy here, so let’s be more concrete:

This is the pseudo code from [1]:
shortcut = x
x = norm(x, axis="channel")
x = gelu(proj(x, d_ffn, axis="channel"))
x = spatial_gating_unit(x)
x = proj(x, d_model, axis="channel")
x = x + shortcut
As said before, we do not imply that designs happen by incident, but as stated in [1] “it is still unclear what empowers such success” (meant is the Transformer architecture) and the same is true for new methods that provide no theoretical analysis to show that every operation is really necessary.

Furthermore, all those architectures were designed with scaling in mind which means nobody thinks about datasets with just hundreds or thousands of examples. It is right that models should be reused as often as possible to avoid wasting energy, but often fine-tuning or distilling a larger model requires more energy than to train the right model on a smaller dataset. And the right model implies a network architecture that minimizes the required FLOPs.

The point is that we believe that building blocks are still too coarse to be used as a recipe
for general problems, especially if the problem is tied to non-large dataset. That means if time and
space are not problem and the dataset is considerably large, often those building blocks already
provide a very strong baseline. However, it is not clear what parts of the blocks are really required
and what parts could be removed without scarifying anything. This is very much related to questions
like if you really need 768 dims, or if 384 are enough?

Why we think it is so important? Most of the companies out there have limited resources, time and
money, and if your network could deliver the same decision in half of the time or space, shouldn’t
this be preferred? Model distillation or quantization might be applied here, but we think it is
still important to do research how to optimize network architectures with respect to a budget.

[1] Pay Attention to MLPs, arxiv:2105.08050
[2] raberrytv.wordpress.com/2021/05/08/the-fall-of-attention/
[3] MLP-Mixer, arxiv:2105.01601
[4] Fast Autoregressive Transformers with Linear Attention, arxiv:2006.16236

Token Interactions: Mixing, Lambda and Attention

Yesterday we tinkered with LambdaNetworks [1,2] but we were unable to achieve the same accuracy when we just replaced the kernel attention with the lambda layer. Furthermore, due to the seed attention we were also unable to utilize the full potential of those layers.

To be fair, we wanted at least measure the gain when we use full self-attention (n_tok, n_tok), instead of (n_seed, n_tok). And thegain was quite noticeable. The test is simple, we generated a random batch of shape (32, 512, 64) = (n_batch, n_toks, n_dim) and we fed it to each layer:
(1) kernel attention: ~283 msecs
(2) lambda layer: ~41 msecs
which is almost a speed-up of factor 7 which is quite a lot. And for longer sequences, like 786, the factor is almost 10.

Then, we stumbled about [3] which replaced full attention with fourier transformations for performance reasons. The trend to work on alternatives [2,3,4] for attention is noticeable and we do not mean to sped up attention, but to replace it with some mixing functions. But for our rather modest problems, we focus on CPU-efficient methods and smaller models, so we thought about a way to keep the advantage of explainable decisions in combination with a token-to-token mixing, instead of just seed mixing.

The idea is quite simple. We keep the kernel attention at the top of the network, to be able to analyze what tokens contributed to the decision, and we use a lambda layer after the token embedding layer for mixing. This allows the tokens to interact with each other.

batch # (n_batch, n_toks, n_dim)
emb = dyn_word_embed(batch) + position()
m = lambda(emb, emb, emb) # <-- NEW: mixing
x = ln(emb + dropout(m)) y = att(seed, x, x) [..]

And now, in contrast to the drop-in replacement [1], there is a gain with respect to the precision of the model. But the comparison is not fair, since the new model is more powerful due to the additional parameters. Nevertheless, since the overhead both in terms of time and space, number of extra parameters, is minimal, we keep the new architecture, since it is stronger than our baseline.

Bottom line, at least for AI, the year 2021 seems to be very interesting and productive and it does not feel that we reached the peak yet.

[1] raberrytv.wordpress.com/2021/05/14/lambdanetworks-a-quick-glimpse
[2] LambdaNetworks (2102.08602)
[3] Mixing Tokens with Fourier Transforms (2105.03824)
[4] MLP-Mixer, (2105.01601)

LambdaNetworks: A Quick Glimpse

We still believe that attention is the way to go, since it allows to partly explain decisions made by neural networks and the mechanism is also biologically plausible. However, the computational complexity is a big problem when a lot of tokens are involved due to the quadratic complexity. There are alternatives, but there are also drawbacks, like for instance, Linear Attention, which does not allow to analyze the attention maps since they are never explicitly calculated.

While we went through the accepted ICLR 2021 papers, we stumbled about LambdaNetworks [1] whose aim is to provide a model for interactions over a long distance, but without attention. Since the method is simple to implement, at least in our case, we decided to compare the results with our current baseline which is kernel attention.

The setup is straightforward: The training dataset contains 150K samples and we have seven labels to predict. We use hash-based embeddings to be able to embed all tokens, even those not seen during training. The encoder is a seed-based transformer where we learn a ‘seed’ query which is used to attend over the input token. The loss is the categorical cross entropy. Since the interface of lambda layers is identically to attention layers, we fix all hyper-parameters and just replace the kernel attention with the lambda layer. Frankly, since we no experiences with the layer, tuning or improvements are not so easy. There are some hints in the paper [1] to normalize query/value embeddings after the projection, but we started with the vanilla code from the paper. What follows is no detailed evaluation, we just wanted to check if the layer can be used as a drop-in replacement with respect to performance, both time and precision.

As expected, the PyTorch implementation did not take much time, thanks to the pseudo code from the paper and the simplicity of the method. So, to build a module with the same interface took no more than 15 min.


First it should be noted that the advantage of the lambda layer is more visible in case of self-attention usage, since in our case, the similarity matrix is not of shape (num_toks, num_toks) but (num_seeds, num_toks) and num_seeds equals one. Thus, we have no quadratic complexity, but a linear one. And second, since both methods use the same number of parameters, there was no additional challenge to learn more parameters.

The runtime per ‘epoch was about 5% slower with the lambda layer but the implementation can be surely optimized. What caught our attention was a very slow learning with the vanilla version. For the kernel attention the loss was ~1.35 at epoch 6, while it was ~1.72 for the lambda layer. Since we noticed before that the softmax is not always the best choice for similarity, which is the reason we switched to kernelized attention, we replaced the softmax function with a kernel. This help a lot and now the loss is ~1.42 and the learning is also much faster. But then we noticed that after a number of steps, the progress stopped and the model does not improve any further. We tried different settings, but the problem remains, compared to the kernel attention where the progress is much longer visible before it converges and the progress stops. Also the normalization mentioned in the paper did not help. But as mentioned before, our goal is mainly to explore new directions and not to do a fair comparison of the methods.

So, in our special case, there is no gain to use the vanilla lambda layer since it is a bit slower and the final accuracy is lower compared to the kernel attention layer. However, the method proved to be very powerful for other problems, so we are curious why it does not work in our setting. But since our baseline is already very strong, there is no point to dive deeper into the problem, also since the method does not let us analyze attention maps which we use right now for debugging.

[1] LambdaNetworks (2102.08602)

The Fall of Attention?

The ‘X is All You Need’ style should not be treated too literally and self-attention is also surely not the end of the neural road, which means it makes sense to follow other paths to explore what else is there, like MLP-Mixer [1]. But let us take a moment to think about the status quo. First, the concept of attention is biologically plausible and second, it easier allows to explain what a model has learnt and how it derived at its final decision. Not to forget those networks do a very good job to solve various NLP tasks.

For example, if we take a very simple attention model where a single learnable seed is used to attend over a sequence of tokens and we just use one head. Then, we can easily analyze what tokens are used for the final decision. With more heads, things get complicated and surely not all heads can be easily summarized and some might seem totally chaotic, but very often there are heads that act as an understandable pattern detector. But to be frank, even with attention layers, large neural nets remain black boxes and it does not seem realistic to understand all the internals of a model with billions of parameters. So, in our humble opinion attention helps at least to shed a bit light into the black box.

The point is that we just started to use models that allow at least to partly explain their decisions and reveal some patterns they have learned which helps a lot to ‘prove’ that a model did not learn only bogus pattern from the dataset. However, despite all the success stories, we are still at the begin of our journey which means we need more research to better understand this powerful tool. We greatly welcome other lines of research, but the question is should we replace attention with just another black box, because the new approach is competitive with respect to some benchmarks? For example in [1] it is stated that “[..] neither of them are necessary” (meant is convolution and attention) and a new architecture is introduced that just uses vanilla MLPs.

So, when a new approach needs roughly the same amount of training time + data and is not more efficient with respect to the required computation and we further sacrifice transparency and being explainable is it worth to pursuit? This might be a bit too philosophical, but we should keep in mind that training of very large models consumes an immense amount of energy and thus our question is, how to conduct responsible AI research with respect to the environment, transparency of decisions, alternative lines of research and long-term goals?

[1] MLP-Mixer, arxiv:2105.01601

Word Embeddings: Why Hashing Helps

Regardless of the underlying model, word embeddings are required to learn representations for NLP problems. At the beginning of the NLP era, a full embedding matrix was used which means a vector for each word. The drawback is that a large vocabulary was responsible for most of the trainable parameters of a neural network. Subword information, at least in combination with a full embedding for the known vocabulary, does not help, with respect to the parameters, but it allowed to embed unknown words as a sum of ngrams. Then came byte pair encoding (BPE) and friends, to reduce the number of required embeddings parameters which helped a lot. But especially for rich morphemic languages those methods are still not optimal because BPE cannot embed tokens when all its segments are unknown.

To be frank, we have no solution for the problem, but for some tasks it helps to use a dynamic embedding based on hashing the input token [1]. The approach introduced a fixed number of trainlable parameters that do not depend on the vocabulary size. The advantage is that any word can now be embedded, but since the parameters are shared for all words, the expressive power of this embedding is limited. For comparison, for 10K tokens and 100 dims, one need 10M parameters, while the hash-based approach just needs num_bits*100 parameters where num_bits might be 200 which equals 20K.

The hash-based approach also has the advantage that rare tokens have no(!) separate embedding which is a problem for full embeddings, since during training those tokens are visited not very often and thus, the embedding does not get too many updates. Thus, it is a trade-off between being universal and being expressive, but at least for classification tasks, the dynamic embedding often delivers very strong results, since not every token needs a meaningful embedding.

For example, one problem we had to tackle is to classify a sequence of free-form tokens into a type. The tokens can be practically anything, either words or just single characters or combination of both. Either full embeddings or BPE would likely fail, since the frequency of pairings is not sufficient to derive splits and with full embeddings, there would be dozens of out-of-vocab tokens. The hash-based method allows to embed literally anything and with the weight sharing, tokens with lower frequencies also do not suffer the ‘low vector norm’ problem of full embeddings.

Bottom line, whenever the word embedding itself is not directly used, dynamic embeddings are a very efficient way to handle out-of-vocab problems and to reduce the model size with often no drawbacks at all. Especially for problems on the character-level, they are very efficient compared to RNNs that need more parameters and are also sequential in nature which means both more time and space is required.

[1] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/

HowTo: Load H.264 Videos With dvbcut

Almost a decade ago, in September 2012, we began our work on a personalized TV app. The project is like IPv6 which means a rathervery long-term project. However, we still learned a lot about Linux + DVB-S which was very useful back then. Nowadays things are a little easier, but when it comes to editing recorded movies via DVB-S2 not that much has changed. There are video editors but if all you want is a cutter, those apps feel a bit overloaded for the task.

A little elaboration first: We modified gnutv to support H.264 streams which still works like a charm. The output of the gnutv program is a MPEG-TS container that can be viewed with mplayer/mpv without any problems. And for ordinary MPEG-2 videos in the container, dvbcut is the first choice, if you just want to quickly edit a movie to remove some parts or to adjust begin and/or end.

So far everything is fine, but since nowadays people record stuff in high definition, formely H.264 and higher, those media cannot be loaded into dvbcut. We did a quick research about the status quo, but found no real solution. Instead, we use the swiss army knife for media ffmpeg to convert the media instead.

The number of options for ffmpeg can be intimidating, but all you have to do is:

ffmpeg -i your_h264_media_file.mpeg -c:v mpeg2video -qscale:v 2 -c:a mp2 -b:a 192k /tmp/output.ts

The last part is the audio stuff: MP2 with 192K/s and the first one re-encodes the h264 video into the good old mpeg2 format. But since the operation is no simple copy, it takes a lot of time for longer videos and thus does not come for free.

Bottom line, the proper solution would still be to implement the feature directly into dvbcut but since we have no time for this, the only option -we are aware of- is this lengthy kludge.

Contrastive Learning By Maximizing Mutual Information

To learn useful representations without labels has gained a lot of attention recently, for NLP with Transformers, but also for the domain of images. However, as pointed out in previous posts, like [3], most of the methods require a large batch size to ensure a sufficient number of negative examples. The approach in [2] uses latent cluster assignments combined with optimal transport, to avoid negative samples altogether. However, despite its simplicity, the training is not always straightforward and one needs to fiddle with hyperparameters and likely also with the network architecture.

But since the idea of the method can be expressed in many ways, we do not need to explicitly use the optimal transport + swapped assignments design. For no good reason we had to think about mutual information that was used also in relation with contrastive learning. For instance in [1] the authors presented an approach to train a Part-of-Speech tagger (PoS) in an unsupervised way. The idea is to relate ‘past’ and ‘future’ information with mutual information. Details can be found in [1] or in the referenced github code.

The adaption to our problem is then straightforward. All we need to do is to create two views of a sample and assigning the first one to ‘past’ and the other one to ‘future’. The derived loss from the paper is semantically similar to the one in [2], with the exception that we do not need to switch anything. The advantage of this objective is that we do not need a separate cluster step and thus, no extra hyper parameters. All we need is the softmax temperature which can be kept from the other approach. And last but not least, we can still use the procedure to derive the soft cluster assignments for new data, because the approach from [1] also learns a set of ‘clusters’.

For our smaller test dataset, the approach from [1] was able to recover the labels by using only pairwise similarities perfectly. A t-SNE plot also confirmed that the learned representation is able to both group related content, but also to separate samples with different concepts.

Bottom line, it is not reasonable to assume that we can solve all our problems by just finding the right loss function, but thanks to all the available papers, we are slowly getting more and more insights how we can tackle the remaining hurdles, one by one.

[1] “Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction”
[2] “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[3] raberrytv.wordpress.com/2021/03/20/findining-similarities-not-differences/

Findining Similarities, Not Differences

Despite all the labeled data out there, it is illusive to assume that every problem can be rephrased as a classification problem. First, annotations are costly and second, labeling cannot be done by everybody but often requires a certain knowledge and thus time. And finally, a label does not carry much information. For instance, a picture with a cat on a tree might be tagged with ‘cat’ but contains so much more additional information that is useful to understand the content as a whole and its concepts. So, even if everything would be labeled, a classical softmax model just learns the necessary features to differ between classes and there is no constraint to understand concepts as a whole to do this. Whatever fits to drive the loss to zero is used and these information does not always make sense. For instance, a copyright could be used for discrimination, or any other watermark. Bottom line, learning with labels feels wrong if your model should truly understand the underlying data and not just learn prototypes to summarize each label.

Unsupervised learning is still the holy grail of AI, but despite all the progress, especially for the domain of images, there is no bullet proof method yet. To learn the underlying factors of data by relating different views is very successful, as shown in [2], but the use of negative examples always worried us. Why? Because there is no way to decide if the query instance and the rest have actually nothing in common. Whenever there is a latent aspect both in the query and in some negative sample, it is pushed away, or at least it is not as plausible as the query and the positive example: score(query, pos) > score(query, neg_similar) after the gradient step. A possible solution was proposed by [3] which does not use negative examples at all by spreading all embeddings on the hypersphere.

And this bring us to our actual goal, namely to explain preferences of a user by using only the notation of pairwise similarity. In the classical setting, we learn a utility function that assigns a score to every sample and the higher it is, the better it fits the preferences. With a linear SVM, we require positive and negative instances to learn such a function and we also use labels which is not optimal for the reasons stated above. Like the fact that a label does not reveal what aspects of a sample a user liked or disliked.

Such models often provide a very strong baseline, but to express the preference of instances just as a the sum of feature weights is not very satisfying. The problem is that the total set of preferences is encoded in the positive features of the weight vector, but there are no fine-grained topics visible that are expressed by the user and hopefully learned by the model.

To be clear, we consider the domain of text ‘documents’ of arbitrary length. In our case, the docs contain a brief summary of TV content. So, we could create views by splitting content into halve, or by drawing two descriptions from the same TV item. But it is also possible to use tags if present, or meta information like directors, or annotated themes. The aim is now two relate two samples in terms of latent topics, cluster prototypes, with the clever trick to swap the assignments for prediction. In other words, if two samples have something in common, it should be possible to predict this particular aspect of x1 by using the features, topic assignments, of x2 and vice versa.

Bottom line, if feels much more natural to understand preferences by directly relating content than to contrasting samples with a large set of negative samples for which we cannot guarantee that there are no similarities with the query sample. And not to forget the computational complexity, because often contrastive learning needs a very large batchsize for good results, while this method just needs ‘enough’ samples with respect to the chosen number of clusters.

[1] “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[2] “Big Self-Supervised Models are Strong Semi-Supervised Learners”
[3] “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”

Learning Topics By Contrasting Cluster Assignments

Frankly, we are a bit excited, since we tried for a quite a while to find the missing puzzle piece but nothing seemed to work and finally it seems that the picture is complete. The goal is to learn a representation of a dataset that disentangles the concepts and topics in it. And a further goal is to to be as green as possible with respect to resources. No large machinery for training, no runtime of weeks and the model should have low memory footprint. But let us come back to part one.

The recently (newly) introduced contrastive learning caught our attention, especially the uniform + alignment method [1]. But despite the simplicity and that the method does not need negative samples at all, the final representation was never powerful enough to perform robust semantic retrieval for unseen data. This is likely not a problem of the method, but for our problem it did not work.

The soft nearest neighbor loss [2] also looked promising, but the partitioning of a dataset based on labels is only as good as the labels. In other words, the loss worked as expected, but since our larger dataset is only weekly annotated, and topics of labels are not disjoint, the generalization was also limited. So, again the problem is that labels do not carry enough information to learn an embedding with respect to the latent concepts of the data, or the labels would need to be very fine-grained which is intractable for many reasons.

For the task we try to solve, negative samples likely hurt the performance, since we never know if the reference and an arbitrary negative sample have something conceptually in common. This is why we favored the uniform + alignment loss, since it does not require any negatives. However, we also wish to summarize the dataset with respect to latent concepts and the loss did not enforce this. We could incorporate a neural topic model in our network, but as argued in [3], those layers easily degenerate and often either either only perform averaging or there are lots of ‘dead’ concepts that are never activated.

Frankly, we missed a paper [4] from last year, June 2020 that contains the last piece of our puzzle. But the key element [5] was presented much earlier, in 2013 at NIPS. However, the paper combined contrastive learning and clustering with [5] in a way that solved our problem. First, the method does not require any negative samples which fits perfectly in our setup, and second, and this is even more important, the contrasting is done via learned codebooks. Those clusters that are forced to capture the concepts of the dataset in order to classify a view of a sample from a cluster assignment of a different view of the same example. For more details, see [4].

So, it is a bit like AI Christmas for us, because the community was so kind to find the needle in our haystack for us. Furthermore, a by-product of the method is that it provides a lot of useful extra functions:
– We can use the learned embedding for semantic retrieval [main goal]
– The concepts in the dataset are summarized in codebooks
– The soft assignment can be determined for unseen data and returns a topic distribution
– The cluster distributions can be used for semantic retrieval (Kullback Leibler)
– Individual codebooks can be also used as ‘classifiers’ to detect this particular concept
– And finally we can cluster the data topically with the soft assignments

Bottom line, this is a perfect example that clever ideas do not need to be complex at all. You just need a creative mind, or two, to combine existing approaches with a new ingredient to invent something new. We are thankful that the authors [4, 5] shared their work with the community which allows us to stop tinkering with the problem and to continue to advance our idea.

[1] 2005.10242: “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”
[2] 1902.01889: “Analyzing and Improving Representations with the Soft Nearest Neighbor Loss”
[3] raberrytv.wordpress.com/2021/03/13/optimal-transport-the-silver-bullet
[4] 2006.09882: “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[5] Sinkhorn distances: Lightspeed computation of optimal transport, NIPS 2013