Oldies But Goldies: Part 1, SCP

This a non-ML post that might be categorized as ‘silly mistakes’ or just bad luck. The Secure Shell Protocol was introduced in 1995 and is now the standard to work on remote systems. For decades we never had any trouble with it until a recent update of a server that is using Debian 11 (server B).

The problem: We want to copy data from server A to server B with scp.
“subsystem request failed on channel 0”
However, auto-completion still worked, so it is clearly no general SSH problem.

The problem is that after an apt update, scp used the SFTP subsystem by default and not the older SCP protocol any longer. But since SFTP is disabled on server A, we get the error. The manpage of scp does not explicitly mention that SFTP is now the default which caused the problem.

The solution is simple the ‘-O’ option of scp forces to use the older SCP protocol. With this option copying files works again without any problems.

Bottom line, it would have been nice in general that new defaults are also documented in the manpage, since users hardly read change logs of all updated packages. But also that it is a good idea to thoroughly read the documentation instead of just stack-overflow-ing the problem.

Contextual Prototype Memory

The ability of humans to recognize concepts after just seeing a couple of exemplars is remarkable. In case of neural nets, most architectures are still very data hungry and need hundreds of examples to reliably recognize a concept. There surely is lots of progress, but one tendency is to just improve a model by increasing the size of it and to feed-in more data. The idea of using prototypes in neural is nothing new. The approach in [1] is surely not the first one, but very simple and nevertheless still very robust approach. The idea of [1] is to define a prototype by averaging feature representations. The the prototype can be seen as a kind of memory to classify unseen exemplars according to the distance to the prototype. The key to do this is metric learning to separate unrelated content in the representation space. However, averaging is not always the best choice due to possible in-class variance.

In [2] a prototype memory is created by using a recurrent neural net. The idea to use an RNN allows us to create a stateful memory that is not just averaging, but is a non-linear function of the timesteps. But the problem is that the context vector generated by the RNN is depending on the order of the input.

What do we mean by this: f(x1, x2, x3) != f(x1, x3, x2).

This does not have to be a disadvantage in case that the order of the input important, but if we just want tosummarize exemplars by a prototype (vector), the order should not matter. For example, if we want to recognize the concept of “mugs”. We are feeding in dozens of feature representations of mugs into our memory function M. Then the final output of the M should not depend on the order of the input, it should just summarize the concept as a single vector. This is obviously not possible with any RNN and to train the network to ignore the order of the input by permuting it to generate the same output is very prohibitive and also not reliable. Not surprisingly, the answer is the Transformer, or to be more precise any permutation equivariant network model.

Intuitively, to ignore the order of the input makes sense, since if we have for instance three images of mugs, for a human it does not matter what mug we look at first, as long as have a chance to take a look at all of them. What we actually store as a pattern to recognize a mug is a bit objective, or stated differently, what we attend to, but we surely look forth and back to generate pattern. In terms of RNNs, it only allows to look forth and in case of a bidirectional RNN, also back, but even then the output still depends on the order of the input.

So, what we need is a set-based function to summarize variable-length input with a fixed-length output. In case of Transformers, we just need to get rid of the positional encoding to ignore the order of the input. What is missing is the ability to summarize the input, since with a vanilla transformer the output size equals the input size. One solution is to use latents L and cross attention to summarize the input X: Y = cross_attention(L, X). The idea is to select (different) tokens of X per latent, or stated differently, each latent should focus on different aspects of the input. Then the output Y of shape (num_latents, dim) can be either flattened into a vector of shape (1, num_latents*dim) or averaged like in [3].

Bottom line, instead of just averaging set a of samples to generate a prototype that gives equal importance to each sample, an attention-based pooling method can be used to allow the network to select relevant parts of the input to generate the prototype. The method is agnostic to the order of the input and also to the type of input. For instance, it can be used for images or text as long as the input can be tokenized.

[1] NIPS 2017: Prototypical Networks for Few-shot Learning
[2] ICLR 2021: Wandering within a World: Online Contextualized Few-Shot Learning
[3] ICML 2021: Perceiver: General Perception with Iterative Attention

Applying Weight Decay On Model Parameters

For PyTorch, a setting like this:
opt = torch.optim.RMSprop(model.parameters(), lr=1e-4, weight_decay=0.1)
applies the weight decay on all parameters, including biases.

First, since biases have much fewer parameters, they are less prone to overfit. But second, if the weight decay is very high, it might drive those parameters to zero, which is especially a problem for nn.LayerNorm, since then layer normalization:
ln_x = scale*x + bias
also drives the input x to zero.

For this reason, the optimizer usually has two groups:
(1) parameters that should be decayed
(2) parameters without any decaying

It is not hard to come up with some code, but depending on the shape, some bias values might still be decay. Let us assume the if statement is like that
'bias' in parameter_name or len(param.shape) == 1
Seem fine, right? Yes, but it does not match for parameters with this setting.
cls_token = (1, dim)
since the shape is a dummy axis and no bias in it.

Maybe the decay was on purpose, but if ever a bias is of shape (1, dim) and has no bias in its name, the check will fail.

With a slight variation, those parameters will be added to the no decay group:
if len(param.squeeze().shape) == 1

Bottom line, tensor shapes can be a real pain in the neck and even such a trivial function to split the parameter into groups can fail because of shape variations. In our opinion PyTorch should at least provide an optional to exclude bias parameters from weight decay, so the extra function is not required. But still, this does not help if a parameter is still a bias but with a different name and/or shape. So, one learnt lesson is definitely to check your parameter shapes twice.

Directly Optimize IR Metrics

It would be helpful to have an AI system that at least allows to retrieve information form a large corpus of data with a simple query. In other words, information retrieval (IR), a classic ML topic. And even without training, methods TF-IDF or BM25 provide a solid baseline.

But we would like to directly optimize some IR metric. A short recap: we want to learn a ranking function that orders a set of documents X given some arbitrary query Q. We decided to take DCG [1] because it measures the quality of a ranking.

In our case, the relevance is just a binary vector {0,1} where 0 means unrelated
and 1 means related to the query:
DCG = sum(i=1, N) relevance[i] / log_2(i + 1)
The idea is simple: when the position of only related content comes first, the gain is maximized.

Now the function does not depend on some input and is thus hardly differentiable. So, let us redefine it:
DCG = sum(i=1,N) relevant[i] / log_2(rank(x[i]) + 1)
where rank is the position of document x in the result. This still does not sound very continuous, so let’s try again. The ideas come from [2].

We have a query Q of shape (1, dim) and some documents X of shape (N, dim). First we determine the scores:
scores = Q @ X.T (of shape (1, N)
To determine the rank of the first element, we do:
s_0 = scores - scores[0] rank_0 = 1 + sum(s_0 < 0)
But this is still not differentiable, darn it. So, we need to approximate the step
function with a differentiable function, like the sigmoid that is S-shaped.
scale = 1/0.05
approx_rank_0 = 1 + sigmoid(s_0 * scale)

With proper scaling the approximated ranking function returns a ranking [1, …, N] based on the similarity of each document with the query.

And now, we can differentiate the DCG metric which means, we can directly optimize it. To make things a bit easier, we normalize the metric by just using the relevant entries in the optimal positions (maximum).

norm = sum(j=0,|POS|, i in X[pos]) relevant[i] / log_2(1 + j)
DCG_n = DCG / norm

Bottom line, whenever we replace some function with a continuous approximation for a IR metric that depends on the data, we can directly optimize it. Like Smooth AP [3]. In contrast to other loss functions, this objective does not really care for similarity values as long as the ordering of the results is correct. Of course in real-world applications we would need some threshold for filtering, but in most cases, the ‘search engine’ would anyway just present the top-k results and other results just on demand.

A closing remark: we are wondering why it took 12 years for the topic to get some more attention. Probably because neural nets needed an AlexNet to get more attention ;-).

[1] en.wikipedia.org/wiki/Discounted_cumulative_gain
[2] A General Approximation Framework for Direct Optimization of Information Retrieval Measures, 2008
[3] Smooth AP, arix:2007.12163

Let’s Make AI Green (Again)

With our very modest computational machinery, we search for an AI architecture that is both lightweight and versatile. That is why the tendency of AI to improve models just by using bigger networks and more data is a bit concerning. The energy required to train and run those models is quite a lot, not to talk about the problem that only very big companies can afford to train those models and therefore decide how others can use them.

It is also a problem that the Transformer architecture that is now used almost everywhere has a quadratic complexity with respect to the size of the input. A lot of variants have been proposed, but the real-world models out there, are mostly using the vanilla Transformer architecture. There are lots of variations that address the issue, like [1], but since the method does not generate the similarity matrix, it is also hard to ‘debug’ possible attention patterns.

Even if a method is working very well, we advocate to follow also other, maybe even, non-promising research directions, since it is unlikely that the Transformer is the end of the AI road. The fact that papers like [2] exist, confirms that other researcher have similar thoughts. To stress this point, we do not favor any architecture, we just want a neural that is “getting things done” efficiently and it does not matter if it is an FFN, RNN, Transformer or ConvNet. But since [2] aims for fixed-size input which is problematic for NLP, which is our domain, we did not pursuit this line of research. But not a year later, [3] introduced a method to work with variable-size input and the method is also called “green”. So, good news everyone, we have an green AI architecture to deliver.

We implemented the method in PyTorch which is very simple if you get the axis right. Instead of using a fixed weight matrix of shape (N, dim), where N is the sequence length, we generate the matrix by a ‘hyper network’. This allows us to process inputs of any length. And the complexity is linear, O(n), with respect to the input and not quadratic, like in Transformers. So, problem solved and we are done here?
We created a modular neural net with PyTorch that allows easily to replace the ‘mixing’ layer by just following a defined interface. In other words, we can test vanilla Self Attention, Linear Attention and Hyper Mixer by just replacing one line. It should be noted that we did not optimize any particular method. We used the reference implementation of [1], [4] for Self Attention and our own implementation of [3] which just consists of matrix multiplications and transposing operations.

We discarded vanilla Transformers from the evaluation, since preliminary tests with respect to the runtime for longer sequences, on commodity hardware(!), showed that it did not scale for the use-case we are currently working on. It should be noted that our implementation of [3] might not be efficient. But for all experiments we conducted, [1] clearly outperformed [3] with respect to the loss function and also the CPU runtime. We did a lot of tests to check if our implementation of [3] is buggy, but since learning is clearly visible, just slower in terms of CPU-time and the loss function, we concluded that the code is correct, just not proper for the problem at hand.

So, let’s get back to green AI. We did not formerly analyze the FLOPS required of [1] and [3], we just measured the CPU-time of both methods to train a model. Both per epoch and the total time until the model “converged”. And both Linear Attention and Hyper Mixer seem to have a very similar complexity class, with respect to the timing information. This might change if you train with ‘thousands of dimensions’, but in our case Hyper Mixer is just as green as the Linear Transformer [1].

Bottom line, since Transformers are pretty dominant right now, the quadratic complexity should be definitely addressed to speed-up computations which saves energy and reduces the carbon footprint. However, even if our experiments are very modest, for our kind of NLP problems, the Transformer still performs best. But as we mentioned earlier, we do not believe that the Transformer is the end of the AI road which is why we encourage researches to explore other lines of research and welcome alternatives like [2,3].

[1] Fast Autoregressive Transformers with Linear Attention, arxiv:2006.16236
[2] MLP-Mixer: An all-MLP Architecture for Vision, arxiv:2105.01601
[3] HyperMixer: An MLP-based Green AI Alternative to Transformers, arix:2203.03691
[4] github.com/karpathy/minGPT

Avoid Collapsed Token Representations

When neural networks are trained supervised, the actual feature representation is not important as long as the network is able to correctly predict the correct class with the ability to generalize to unseen samples. Then, the shape of the learned embedding space by the network is not interesting, if the network is not used for other tasks and later fine-tuned.

But the situation changes, for instance in case of seq2seq learning where each state of the encoder is used. As demonstrated in [1], when the encoder is some RNN, each hidden state should represent a token and should thus be discriminative with respect to other tokens. However, it is possible that the states are highly correlated which means they are all very close to their mean. The name of this metric is conicity and can be understood as a the “spread” of the vectors. A conicity of 1 means they are perfectly aligned.

Why is this a problem? In case of attention, for example, it means that the scores with respect to some context vector are all very similar which is not useful to pick only the most relevant tokens. A second example is when a recurrent net is used to “contextualize” a word embedding, since it means that the tokens (states) are all very similar and not very discriminative.

Bottom line, whenever the individual token representation is used for a task, it should be a distinct entity that can be clearly discriminated against other tokens. This is also relevant for generative models since if tokens are anisotropic [2], which means a very high conicity, diversity is hurt and the output might even degenerate.

To force a model to learn distinct token representation, either the conicity-based loss from [1] could be used, or the contrastive one proposed in [2].

For the contrastive version, the loss is formulated to enforce that each pair of token has at least some distance, given by a margin. Example in PyTorch notation:
dists = torch.pdist(states)**2
loss = torch.clamp(margin - dists, min=0).mean()
In other words, if the distance is larger than the margin, the loss is zero, otherwise a loss of (margin – dist[i]) is propagated for the state.

[1] Towards Transparent and Explainable Attention Models, 2020.acl-main.387
[2] A Contrastive Framework for Neural Text Generation, arxiv:2202.06417

Causal Convolution (1d)

With the rise of Transformers, Attention is omnipresent and in case of language modelling, causal attention is required to forbid that the current token can take a look into the feature. In other words, if X is a sequence X = [x1, x2, x3, x4] of tokens, the token x3 should have only access to x3, x2 and x1 but not x4. This is done by masking dot products for the softmax which drives those scores to zero.

This is not new, but we recently tried alternatives since Transformers are notoriously data-hungry and not trivial to train. We stumbled, again, across QRNNs [1]. The concept is not new, but it gained attention again, pun intended, with the introduction of pQRNN [2]. The implementation of Quasi-Recurrent Nets is straightforward, except for the ‘causal’ mode which requires an operation also named masked convolution [4]. The idea is the same as for transformers, a convolutional filter should only have access to the left, but not the right window which is equivalent to the future.

Despite the pattern in NLP, there is no ready-to-use PyTorch layer for it, but thanks to the community, like [3], the implementation is actually very easy. All you have to do is to understand the concept of dilated convolutions and padding. Then one possible implementation could look like:

class MaskedConv1d(nn.Module):

def __init__(self, in_chann: int, out_chann: int, kernel_size: int):


self.padding = dilation * (kernel_size - 1)

self.conv = nn.Conv1d(in_chann, out_chann, kernel_size, padding=self.padding, dilation=kernel_size - 1)

def forward(self, x):

return self.conv(x)[:, :, :-self.padding]

It should be noted that the authors of [1] also gave hints how to perform this operation: “This concept, known as a masked convolution[…], is implemented by padding the input to the left by the convolution’s filter size minus one.” in [1], Section 2, “Model”, page two. Other variants are possible, but we used the ideas from [3], since the implementation is simple and the overhead is negligible.

All other required tensor operations for QRNNs are provided by PyTorch and the implementation is also very simple, but that is a different post.

[1] Quasi-Recurrent Neural Networks, arxiv:1611.01576
[2] ai.googleblog.com/2020/09/advancing-nlp-with-efficient-projection.html
[3] discuss.pytorch.org/t/causal-convolution/3456
[4] Pixel recurrent neural networks, arxiv:1601.06759

Vector Quantization: Avoiding Codebook Collapse

The idea to learn codebooks that capture semantic aspects of the input data, like local concepts in images, or to group tokens in case of text is not new, but still challenging. Why? The issue is that too often the codebooks collapse which means that all tokens are assigned to only one or very few codebooks and the other ones are ‘dead’ and do not receive any updates. With ongoing training the there is a reinforcement effect, since the active code books receive gradient updates, the norm is growing and the dead ones are unaltered which lowers the chance that a token selects it.

A similar phenomenon can be see for mixture of experts where all the samples are assigned to very few or in the worst case, just one, expert. The solutions for the problem is usually to introduce some random noise and to introduce a loss that encourages a more uniform distribution across experts / codebooks. However, this requires careful tuning of the scaling parameter and the amount of noise to add and sometimes the training can still become unstable.

We experimented a lot, since a new dataset means to tune again from scratch, which is time consuming and usually also very painful. So, are there other methods that are more reliable and maybe easier? After some research, we found the paper [1] which addresses the issue by random restarts of codebooks. This helped a lot and with ongoing training, the frequency of restarts got lower and lower as a sign that training stabilized. More Details can be found in [1, Section 3.1 “Random restarts for embeddings”].

[1] arxiv:2005.00341 [Jukebox: A Generative Model for Music]

Thunderbird – Problem With POP3 Accounts

This is not AI-related, but since it took quite some time to fix this issue, we thought that it would be nice share the knowledge to avoid that other users are also spending hours with this ‘bug’.

The upgrade of Thunderbird from 7x to 9x introduced a lot of new features, but there were also lots of forum posts saying that POP3 accounts were unable to fetch mails any longer. There is just a message in the status line that it connects to the server but it does nothing and “loops” forever.

The support linked the problem to the TLS version, but this fix did not solve the problem in our case. The adjustment of the minimal TLS version with the config editor to 1/2 also did not solve the problem.

The idea was to create a new profile and see if it works with the default settings and indeed it does. So it must be somehow related to the configuration.

Long story short, the following option solved the problem:
[x] ‘Check for new messages at startup’

It is a bit crude that the option itself does not work, since when you start Thunderbird it does NOT fetch any mails, but now when you trigger manually ‘Get messages’ it works.To verify that this is indeed the problem, the option was unchecked, Thunderbird was restarted and it did not work. Check it again, restart, it works.

Since the Thunderbird version is no vanilla version the issue might be fixed already, but we thought
it would be at least nice to share the knowledge in case some users still have the some problem.

[MASK] is All You Need

After all the years and dozens of trained, or maybe tried models, and million of gradient steps, neural networks still seem like mythical creatures to us. They are very powerful on the one hand, but also very mysterious with respect to their insides on the other hand. The best example are classifiers where we assume that it correctly learns to understand the data, but all it does is to find the simplest patterns to explain all labels and very often, the simplest way is not to learn the real concepts, but some shortcut.

Plus, there are so many pitfalls you can trap in, like for instance, a recent tweet of Karpathy [1] regarding the learning rate. So, stating the obvious, if we would have to predict [MASK], it would be not a single concept “token”, but a set of concepts:

  • Find the right optimizer, plus the hyper-parameters
  • Carefully evaluate how to schedule / adjust the learning rate
  • Decide about a stop condition and/or the number of epochs / steps
  • Choose a proper network architecture, but don’t be a hero
  • Come up with a loss function that really learns a useful representation
  • Create a dataset that is as clean as possible and almost every detail about it
  • Hope that you have enough patience to jointly optimize all the points

We stress that again and again since way too often, papers missing a pitfall section that helps to apply the paper method on similar problems, but different datasets. From our experience it is often a very time consuming and painful process to do this, even if you have the github source repository.

Bottom line, no [MASK] is not all you need, maybe in case it is a swiss-army knife, but otherwise it is very unlikely that there is a single silver bullet that solves all our neural problems. A question is if research is focused too much on finding the holy AI grail instead of improving the pipeline by better understanding individual steps.

[1] twitter:karpathy:1431380525759885313