[MASK] is All You Need

After all the years and dozens of trained, or maybe tried models, and million of gradient steps, neural networks still seem like mythical creatures to us. They are very powerful on the one hand, but also very mysterious with respect to their insides on the other hand. The best example are classifiers where we assume that it correctly learns to understand the data, but all it does is to find the simplest patterns to explain all labels and very often, the simplest way is not to learn the real concepts, but some shortcut.

Plus, there are so many pitfalls you can trap in, like for instance, a recent tweet of Karpathy [1] regarding the learning rate. So, stating the obvious, if we would have to predict [MASK], it would be not a single concept “token”, but a set of concepts:

  • Find the right optimizer, plus the hyper-parameters
  • Carefully evaluate how to schedule / adjust the learning rate
  • Decide about a stop condition and/or the number of epochs / steps
  • Choose a proper network architecture, but don’t be a hero
  • Come up with a loss function that really learns a useful representation
  • Create a dataset that is as clean as possible and almost every detail about it
  • Hope that you have enough patience to jointly optimize all the points

We stress that again and again since way too often, papers missing a pitfall section that helps to apply the paper method on similar problems, but different datasets. From our experience it is often a very time consuming and painful process to do this, even if you have the github source repository.

Bottom line, no [MASK] is not all you need, maybe in case it is a swiss-army knife, but otherwise it is very unlikely that there is a single silver bullet that solves all our neural problems. A question is if research is focused too much on finding the holy AI grail instead of improving the pipeline by better understanding individual steps.

[1] twitter:karpathy:1431380525759885313

Tales From The Attention World

The recent shift from attention to other methods might be justified or not, but we still believe that attention is a very useful concept, also as an ingredient to explain what a network does. For our small-scale problems, the quadratic complexity of self attention is no problem, also because we do not use the mha(x, x, x)-mode, but the mha(q, x, x)-mode which has a much lower complexity of |q| * |x|.

But let’s be more specific: We tested a slight variation of prototype networks [1] where we replaced the averaging of the support set with an attention. Let q be the query of shape (1, dim) and s be the support set of shape (K, dim). The vanilla approach derives the center of the support set s by ‘center = s.mean(dim=1)’ which is known to work, but if classes are coarse like ‘food’, the support set might contain different topics like baking, barbecue and cooking. In this it is not optimal to just average all samples, since the support set likely contains different topics and the center should be conditioned on the topic of the query. With attention the solution is as follows:


query = W_query(query)
support = W_key(support)
attn = softmax(query @ support.T, dim=1)
centers = attn @ support


It resembles the classical multi-head attention, but we only use a single head and we do not project the output with a weight matrix. In other words, we determine the similarity of the query and the support set, to only keep those samples that are relevant for the query. However, it might be also possible that the support samples and the query do not share any fine-grained aspects beside being in the same class. But is also no problem, since if the all scores are very similar, the softmax output degenerates into uniform distribution which is exactly like averaging the support samples.

So far, for the theory, but the reality is not that rosy. It is known that lower layers in a big Transformer network might just perform averaging in the attention layer. Since our test network is shallow, we checked the output and actually the scaled dot product does exactly that. All the scores are small and very similar which means the output is an almost uniform distribution, regardless if the query and the support are related or not.

To sharpen the distribution of the softmax, we L2 normalized the query and the support which means we use the cosine score and since the output range is [-1,+1] we further scaled the scores it with a constant factor to transform the range int [-N, +N]. We verified that the modification worked by analyzing some query and support pairs and indeed the softmax output was often peaked, or at least not uniform.

Bottom line, the MHA layer is pretty standard and an essential building block, but especially in customized networks, and also smaller networks, the reference implementation likely needs some tweaking. For example, the dropout step after the softmax is harmful in our case, since the support set is usually small and the chance that an important sample is dropped out is therefore high. Also, to resemble the averaging step from the ProtoNet, the output is not projected with a W_value matrix which ensures that query and center are in the same embedding space. And last but not least, the scaled dot product is pretty useless for shallow network and also often need to be adjusted.

[1] NIPS 2017, Prototypical Networks for Few-shot Learning

Attention: Drop Heads Not Dimensions

There are different ways to discourage your network to just memorize the input data, or in terms of machine learning, to avoid overfitting. But for some architectures ‘1D’ dropout might not be the best solution. Such an example is multi-head attention. There, a batch of data is of shape (batch, seq, heads, dim) and dropout is applied to the view (batch, -1) of the input, or in other words, every dimension in a row has the same chance to be dropped. However, if we want regularize full heads individually, 1D dropout is not what we want.

In PyTorch there is also a 2D dropout version that applies dropout to each channel individually, where the expected input shape is (batch, channel, x, y). The length of the shape already matches, but we need to transpose the attention output to ensure that ‘heads’ comes second and not third.

x = input.transpose(1, 2) # (b, s, h, d) -> (b, h, s, d)
x = torch.nn.functional.dropout2d(x)
x = x.transpose(2, 1) # -> (b, s, h, d)

When we assume 2D dropout with p=0.1, it means that with a 10% chance (s * d) dimensions are set to zero and the sampling is repeated for each head.

This is related to the method described in [1]. The idea is to avoid that some heads dominate the attention during the training and that other heads just rely on the ‘leader’ to get things right, where 1D dropout works on the unit level to avoid co-adaption.

[1] Scheduled DropHead, acl:2020.findings-emnlp.178

N Layers is All We Need?

The other day we were joking [2] that the days of attention are counted and a few days later [2] another paper was published that showed that other methods also can solve the same problem without attention. As we mentioned before, we think attention is a great building block with respect to explain-ability, but the computational complexity is something we need to work on. Frankly, it feels a bit like walking on the edge of a circle where we now arrived where we started, or stated differently the insights during our walk indicates that plain feed-forward MLPs with ‘spatial mixing’ are enough to solve all the problems out there. So, we are back at ‘MLPs Are All We Need’.

However, regardless if attention is required or not, we need a building block that can at least partly explain the decisions of a neural net. As a consequence, at the end of the network, we need some scoring with respect to the ”input tokens”. When previous layers perform non-linear spatial mixing of these tokens, the scores obviously cannot be directly tracked to the input tokens, but this problem is existing for Transformer architectures in general. So, for the sake of simplicity, we always consider ‘4 layer’ networks an embedding layer, one mixing layer, an attention layer and finally the output layer for the prediction. For the mixing layer, the only constraint is that the shape of the input sequence is preserved which is usually (batch, seq, dim). With this in mind, we could use [1,3] or any classical attention method, like [4].

In case we use a building block from a paper it is worth to think about it for a moment before we implement it as a PyTorch Layer. Why? We are not sure what the design criteria for this block was, but we assume that some idea was verified by applying it to some problem. And that usually means some dataset is used and the goal is to train a model that generalizes well. Even if a goal is to design something that is applicable to a broad range of problems, the optimization of the design was likely done with respect to the conducted experiments and that means with respect to the used datasets.

But do not let us be vague here. We really appreciate the efforts and also that the authors share the results with the community, often with reference or some pseudo code, but our point is that maybe building block is too powerful for your problem. Very often a grid search is done for the hyper parameters, but it is less clear how to apply the idea to “minimize” a layer design. Maybe we are still to foggy here, so let’s be more concrete:

This is the pseudo code from [1]:
shortcut = x
x = norm(x, axis="channel")
x = gelu(proj(x, d_ffn, axis="channel"))
x = spatial_gating_unit(x)
x = proj(x, d_model, axis="channel")
x = x + shortcut
As said before, we do not imply that designs happen by incident, but as stated in [1] “it is still unclear what empowers such success” (meant is the Transformer architecture) and the same is true for new methods that provide no theoretical analysis to show that every operation is really necessary.

Furthermore, all those architectures were designed with scaling in mind which means nobody thinks about datasets with just hundreds or thousands of examples. It is right that models should be reused as often as possible to avoid wasting energy, but often fine-tuning or distilling a larger model requires more energy than to train the right model on a smaller dataset. And the right model implies a network architecture that minimizes the required FLOPs.

The point is that we believe that building blocks are still too coarse to be used as a recipe
for general problems, especially if the problem is tied to non-large dataset. That means if time and
space are not problem and the dataset is considerably large, often those building blocks already
provide a very strong baseline. However, it is not clear what parts of the blocks are really required
and what parts could be removed without scarifying anything. This is very much related to questions
like if you really need 768 dims, or if 384 are enough?

Why we think it is so important? Most of the companies out there have limited resources, time and
money, and if your network could deliver the same decision in half of the time or space, shouldn’t
this be preferred? Model distillation or quantization might be applied here, but we think it is
still important to do research how to optimize network architectures with respect to a budget.

[1] Pay Attention to MLPs, arxiv:2105.08050
[2] raberrytv.wordpress.com/2021/05/08/the-fall-of-attention/
[3] MLP-Mixer, arxiv:2105.01601
[4] Fast Autoregressive Transformers with Linear Attention, arxiv:2006.16236

Token Interactions: Mixing, Lambda and Attention

Yesterday we tinkered with LambdaNetworks [1,2] but we were unable to achieve the same accuracy when we just replaced the kernel attention with the lambda layer. Furthermore, due to the seed attention we were also unable to utilize the full potential of those layers.

To be fair, we wanted at least measure the gain when we use full self-attention (n_tok, n_tok), instead of (n_seed, n_tok). And thegain was quite noticeable. The test is simple, we generated a random batch of shape (32, 512, 64) = (n_batch, n_toks, n_dim) and we fed it to each layer:
(1) kernel attention: ~283 msecs
(2) lambda layer: ~41 msecs
which is almost a speed-up of factor 7 which is quite a lot. And for longer sequences, like 786, the factor is almost 10.

Then, we stumbled about [3] which replaced full attention with fourier transformations for performance reasons. The trend to work on alternatives [2,3,4] for attention is noticeable and we do not mean to sped up attention, but to replace it with some mixing functions. But for our rather modest problems, we focus on CPU-efficient methods and smaller models, so we thought about a way to keep the advantage of explainable decisions in combination with a token-to-token mixing, instead of just seed mixing.

The idea is quite simple. We keep the kernel attention at the top of the network, to be able to analyze what tokens contributed to the decision, and we use a lambda layer after the token embedding layer for mixing. This allows the tokens to interact with each other.

batch # (n_batch, n_toks, n_dim)
emb = dyn_word_embed(batch) + position()
m = lambda(emb, emb, emb) # <-- NEW: mixing
x = ln(emb + dropout(m)) y = att(seed, x, x) [..]

And now, in contrast to the drop-in replacement [1], there is a gain with respect to the precision of the model. But the comparison is not fair, since the new model is more powerful due to the additional parameters. Nevertheless, since the overhead both in terms of time and space, number of extra parameters, is minimal, we keep the new architecture, since it is stronger than our baseline.

Bottom line, at least for AI, the year 2021 seems to be very interesting and productive and it does not feel that we reached the peak yet.

[1] raberrytv.wordpress.com/2021/05/14/lambdanetworks-a-quick-glimpse
[2] LambdaNetworks (2102.08602)
[3] Mixing Tokens with Fourier Transforms (2105.03824)
[4] MLP-Mixer, (2105.01601)

LambdaNetworks: A Quick Glimpse

We still believe that attention is the way to go, since it allows to partly explain decisions made by neural networks and the mechanism is also biologically plausible. However, the computational complexity is a big problem when a lot of tokens are involved due to the quadratic complexity. There are alternatives, but there are also drawbacks, like for instance, Linear Attention, which does not allow to analyze the attention maps since they are never explicitly calculated.

While we went through the accepted ICLR 2021 papers, we stumbled about LambdaNetworks [1] whose aim is to provide a model for interactions over a long distance, but without attention. Since the method is simple to implement, at least in our case, we decided to compare the results with our current baseline which is kernel attention.

The setup is straightforward: The training dataset contains 150K samples and we have seven labels to predict. We use hash-based embeddings to be able to embed all tokens, even those not seen during training. The encoder is a seed-based transformer where we learn a ‘seed’ query which is used to attend over the input token. The loss is the categorical cross entropy. Since the interface of lambda layers is identically to attention layers, we fix all hyper-parameters and just replace the kernel attention with the lambda layer. Frankly, since we no experiences with the layer, tuning or improvements are not so easy. There are some hints in the paper [1] to normalize query/value embeddings after the projection, but we started with the vanilla code from the paper. What follows is no detailed evaluation, we just wanted to check if the layer can be used as a drop-in replacement with respect to performance, both time and precision.

As expected, the PyTorch implementation did not take much time, thanks to the pseudo code from the paper and the simplicity of the method. So, to build a module with the same interface took no more than 15 min.


First it should be noted that the advantage of the lambda layer is more visible in case of self-attention usage, since in our case, the similarity matrix is not of shape (num_toks, num_toks) but (num_seeds, num_toks) and num_seeds equals one. Thus, we have no quadratic complexity, but a linear one. And second, since both methods use the same number of parameters, there was no additional challenge to learn more parameters.

The runtime per ‘epoch was about 5% slower with the lambda layer but the implementation can be surely optimized. What caught our attention was a very slow learning with the vanilla version. For the kernel attention the loss was ~1.35 at epoch 6, while it was ~1.72 for the lambda layer. Since we noticed before that the softmax is not always the best choice for similarity, which is the reason we switched to kernelized attention, we replaced the softmax function with a kernel. This help a lot and now the loss is ~1.42 and the learning is also much faster. But then we noticed that after a number of steps, the progress stopped and the model does not improve any further. We tried different settings, but the problem remains, compared to the kernel attention where the progress is much longer visible before it converges and the progress stops. Also the normalization mentioned in the paper did not help. But as mentioned before, our goal is mainly to explore new directions and not to do a fair comparison of the methods.

So, in our special case, there is no gain to use the vanilla lambda layer since it is a bit slower and the final accuracy is lower compared to the kernel attention layer. However, the method proved to be very powerful for other problems, so we are curious why it does not work in our setting. But since our baseline is already very strong, there is no point to dive deeper into the problem, also since the method does not let us analyze attention maps which we use right now for debugging.

[1] LambdaNetworks (2102.08602)

The Fall of Attention?

The ‘X is All You Need’ style should not be treated too literally and self-attention is also surely not the end of the neural road, which means it makes sense to follow other paths to explore what else is there, like MLP-Mixer [1]. But let us take a moment to think about the status quo. First, the concept of attention is biologically plausible and second, it easier allows to explain what a model has learnt and how it derived at its final decision. Not to forget those networks do a very good job to solve various NLP tasks.

For example, if we take a very simple attention model where a single learnable seed is used to attend over a sequence of tokens and we just use one head. Then, we can easily analyze what tokens are used for the final decision. With more heads, things get complicated and surely not all heads can be easily summarized and some might seem totally chaotic, but very often there are heads that act as an understandable pattern detector. But to be frank, even with attention layers, large neural nets remain black boxes and it does not seem realistic to understand all the internals of a model with billions of parameters. So, in our humble opinion attention helps at least to shed a bit light into the black box.

The point is that we just started to use models that allow at least to partly explain their decisions and reveal some patterns they have learned which helps a lot to ‘prove’ that a model did not learn only bogus pattern from the dataset. However, despite all the success stories, we are still at the begin of our journey which means we need more research to better understand this powerful tool. We greatly welcome other lines of research, but the question is should we replace attention with just another black box, because the new approach is competitive with respect to some benchmarks? For example in [1] it is stated that “[..] neither of them are necessary” (meant is convolution and attention) and a new architecture is introduced that just uses vanilla MLPs.

So, when a new approach needs roughly the same amount of training time + data and is not more efficient with respect to the required computation and we further sacrifice transparency and being explainable is it worth to pursuit? This might be a bit too philosophical, but we should keep in mind that training of very large models consumes an immense amount of energy and thus our question is, how to conduct responsible AI research with respect to the environment, transparency of decisions, alternative lines of research and long-term goals?

[1] MLP-Mixer, arxiv:2105.01601

Word Embeddings: Why Hashing Helps

Regardless of the underlying model, word embeddings are required to learn representations for NLP problems. At the beginning of the NLP era, a full embedding matrix was used which means a vector for each word. The drawback is that a large vocabulary was responsible for most of the trainable parameters of a neural network. Subword information, at least in combination with a full embedding for the known vocabulary, does not help, with respect to the parameters, but it allowed to embed unknown words as a sum of ngrams. Then came byte pair encoding (BPE) and friends, to reduce the number of required embeddings parameters which helped a lot. But especially for rich morphemic languages those methods are still not optimal because BPE cannot embed tokens when all its segments are unknown.

To be frank, we have no solution for the problem, but for some tasks it helps to use a dynamic embedding based on hashing the input token [1]. The approach introduced a fixed number of trainlable parameters that do not depend on the vocabulary size. The advantage is that any word can now be embedded, but since the parameters are shared for all words, the expressive power of this embedding is limited. For comparison, for 10K tokens and 100 dims, one need 10M parameters, while the hash-based approach just needs num_bits*100 parameters where num_bits might be 200 which equals 20K.

The hash-based approach also has the advantage that rare tokens have no(!) separate embedding which is a problem for full embeddings, since during training those tokens are visited not very often and thus, the embedding does not get too many updates. Thus, it is a trade-off between being universal and being expressive, but at least for classification tasks, the dynamic embedding often delivers very strong results, since not every token needs a meaningful embedding.

For example, one problem we had to tackle is to classify a sequence of free-form tokens into a type. The tokens can be practically anything, either words or just single characters or combination of both. Either full embeddings or BPE would likely fail, since the frequency of pairings is not sufficient to derive splits and with full embeddings, there would be dozens of out-of-vocab tokens. The hash-based method allows to embed literally anything and with the weight sharing, tokens with lower frequencies also do not suffer the ‘low vector norm’ problem of full embeddings.

Bottom line, whenever the word embedding itself is not directly used, dynamic embeddings are a very efficient way to handle out-of-vocab problems and to reduce the model size with often no drawbacks at all. Especially for problems on the character-level, they are very efficient compared to RNNs that need more parameters and are also sequential in nature which means both more time and space is required.

[1] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/

HowTo: Load H.264 Videos With dvbcut

Almost a decade ago, in September 2012, we began our work on a personalized TV app. The project is like IPv6 which means a rathervery long-term project. However, we still learned a lot about Linux + DVB-S which was very useful back then. Nowadays things are a little easier, but when it comes to editing recorded movies via DVB-S2 not that much has changed. There are video editors but if all you want is a cutter, those apps feel a bit overloaded for the task.

A little elaboration first: We modified gnutv to support H.264 streams which still works like a charm. The output of the gnutv program is a MPEG-TS container that can be viewed with mplayer/mpv without any problems. And for ordinary MPEG-2 videos in the container, dvbcut is the first choice, if you just want to quickly edit a movie to remove some parts or to adjust begin and/or end.

So far everything is fine, but since nowadays people record stuff in high definition, formely H.264 and higher, those media cannot be loaded into dvbcut. We did a quick research about the status quo, but found no real solution. Instead, we use the swiss army knife for media ffmpeg to convert the media instead.

The number of options for ffmpeg can be intimidating, but all you have to do is:

ffmpeg -i your_h264_media_file.mpeg -c:v mpeg2video -qscale:v 2 -c:a mp2 -b:a 192k /tmp/output.ts

The last part is the audio stuff: MP2 with 192K/s and the first one re-encodes the h264 video into the good old mpeg2 format. But since the operation is no simple copy, it takes a lot of time for longer videos and thus does not come for free.

Bottom line, the proper solution would still be to implement the feature directly into dvbcut but since we have no time for this, the only option -we are aware of- is this lengthy kludge.

Contrastive Learning By Maximizing Mutual Information

To learn useful representations without labels has gained a lot of attention recently, for NLP with Transformers, but also for the domain of images. However, as pointed out in previous posts, like [3], most of the methods require a large batch size to ensure a sufficient number of negative examples. The approach in [2] uses latent cluster assignments combined with optimal transport, to avoid negative samples altogether. However, despite its simplicity, the training is not always straightforward and one needs to fiddle with hyperparameters and likely also with the network architecture.

But since the idea of the method can be expressed in many ways, we do not need to explicitly use the optimal transport + swapped assignments design. For no good reason we had to think about mutual information that was used also in relation with contrastive learning. For instance in [1] the authors presented an approach to train a Part-of-Speech tagger (PoS) in an unsupervised way. The idea is to relate ‘past’ and ‘future’ information with mutual information. Details can be found in [1] or in the referenced github code.

The adaption to our problem is then straightforward. All we need to do is to create two views of a sample and assigning the first one to ‘past’ and the other one to ‘future’. The derived loss from the paper is semantically similar to the one in [2], with the exception that we do not need to switch anything. The advantage of this objective is that we do not need a separate cluster step and thus, no extra hyper parameters. All we need is the softmax temperature which can be kept from the other approach. And last but not least, we can still use the procedure to derive the soft cluster assignments for new data, because the approach from [1] also learns a set of ‘clusters’.

For our smaller test dataset, the approach from [1] was able to recover the labels by using only pairwise similarities perfectly. A t-SNE plot also confirmed that the learned representation is able to both group related content, but also to separate samples with different concepts.

Bottom line, it is not reasonable to assume that we can solve all our problems by just finding the right loss function, but thanks to all the available papers, we are slowly getting more and more insights how we can tackle the remaining hurdles, one by one.

[1] “Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction”
[2] “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[3] raberrytv.wordpress.com/2021/03/20/findining-similarities-not-differences/