In-Context Learning: Learning To Add

The ability of LLMs to perform in-context learning (ICL), without any weight updates, to solve a broad range of tasks is quite impressive. But why and when ICL emerges is not well understood. In [1] experiments are conducted to shed some light on those questions. In contrast to ICL, in-weights learning (IWL) means that a neural net just relies on its weights to perform a correct prediction which is the ‘default’ training mode.

The goal of our little experiment is to compare the scores of both methods with a toy task and in our case, it is the addition of two positive numbers. The setup we use is very similar to [2], but we just train a single function. The ‘x’ input is a concatenation of two one-hot vectors to represent x1, x2 and the y output is then a one-hot vector of the result, x1+x2. The input and output dimension is the same: 2*range(x). Causal attention is used avoid cheating.

A training sequence consists of (x, y) example pairs and final (x,) to predict the correct y output. The number of pairs can vary and if no pairs are given, the ICL turns into IWL, since no example is given and only the weights are be used for the prediction.

Note: for such a toy problem, a model with enough capacity can memorize all input to output mappings which means 100% training score, but likely 0% test. We use weight decay to limit the model capacity.

The baseline model is a very small Transformer with just one layer and 32 dims. We provide three examples and one ‘query’ per training tep and the model is trained with RMSprop. The dataset consists of all possible pairs of adding two numbers and 20% is used for testing the model with unknown data. The test score is periodically checked during training, to avoid memorizing / overfit issues.

As expected, even with regularization the model quickly learns to solve the task. With some tuning, the test score for the unknown data eventually reaches ~85% which clearly confirms that the model learnt something beyond memorizing. For the test score, the training set is shuffled for every test input and three pairs are used as examples. Due to sampling, the test score might vary a bit. For the IWL test score, no examples were provided and the prediction solely relies on the input x.

First, we analyzed the errors of the ICL setup. We repeated the training with different seeds, but two examples were always predicted wrong, if they ended up in the test set:
x=(min, min) -> y=min x=(max, max) -> y=2*max
For example 0+0=0 and 4+4=8.

What’s so special about those numbers? Well, actually it is not about the numbers, but the frequencies of the pairs.
Too vague? Okay, let’s do some counting. If we enumerate all pairs how to add to numbers from 0 to 3
0 + 0 = 0
0 + 1 = 1
0 + 2 = 2
1 + 0 = 1
1 + 1 = 2
1 + 2 = 3
2 + 0 = 2
2 + 1 = 3
2 + 2 = 4
it is obvious that the first and the last example is only present once, while other pairs are more frequent. In terms of training it means the network only sees those min/max pairs each once an iteration of the training set. So, it’s not a training issue but a data distribution issue.

With a simple script, a plot can be created to visually inspect the frequencies but the result is not really surprising. The plot forms a ‘triangle’ with the maximum at range(x)/2 and it confirms the intuition that some pairs are very rare and thus are harder to correctly predict, since the network didn’t “see” those very often. A possible solution is to scale the loss according to the frequency which is a common pattern also for other problems.

What about IWL? We set the number of examples to zero and repeated the test. The result was a bit disappointing, since the score was sometimes a bit higher and at never below the ICL score. In other words, for such a toy problem, the extra examples does not seem to improve the score and sometimes even hurt it.

To draw conclusions is a bit hard, since the task is likely too simple which means IWL likely works perfectly and ICL is not needed at all and thus does not ’emerge’. The long-tail issue of the input/output pairs is unlikely to prevent ICL to emerge, but of course affects the learning dynamics of the model.

References
[1] arxiv:2311.08360 “The Transient Nature of Emergent In-Context Learning in Transformers”
[2] arxiv:2310.03016 “Understanding In-Context Learning in Transformers And LLMs By Learning To Learn Discrete Functions”

Linear RNNs

With the rise of LLMs, the transformer architecture is all over the place and the good old recurrent net does not get much attention any longer. However, not every task or problem needs a data-hungry transformer with billions of parameters. In our own experiments, inspired by [3], we use a simple GRU to derive the task vectors and the capacity of this simple RNN is sufficient. Now the question is, can we further simplify the design, since the non-linearity as part of the recurrence can be a bit problematic, even with gradient clipping and gating units.

The idea of linear RNNs is described in [1] and as a blog post again in [2]. They describe more sophisticated versions, but we focus on the basic version to present the idea.

The vanilla RNN formula is defined as, biases are omitted for brevity:
ht = tanh(Wx * x[i] + Wh * ht[-1])
Just the embedding of the input and the previous state, add them up and apply a non-linearity, that’s it.

Without the non-linearity, the ability of the net to solve complex tasks is limited, but what if we apply the
non-linear step, after all the recurrent steps? First, the non-linear step can be any MLP, and we use a GLU:

glu(x) = sigmoid(Wg * x) * (Wh * x)

We implemented our idea in torch. We have an input X of shape (seqlen, batch, dim) and a initial hidden state H of shape (batch, dim). Since the embedding of the input does not depend on the hidden state, it can be pre-computed.

B = Wx * X # shape (seqlen, batch, dim)
out = []
for i in range(X.shape[0]): ht = layer_norm(Wh * ht + B[i]); out.append(ht) out = glu(torch.stack(out))

The design is extremely simple. Just embed the whole input X with one matrix multiplication which is very fast. Then do all the recurrent steps, with layer normalization to ensure consistent norms, of the states and finally the output is stacked into a matrix of shape X and a point-wise non-linearity is applied which is again just a matrix multiplication.

Except for the recurrent step, all operations can be batched and thus are very efficient. We did some timings and depending on the dimensions the speed-up was 15% to 30% despite the fact that the GRU code is highly optimized and the linear RNN code was just pure python torch code. All tests were done with no_grad enabled.

And since the non-linearity is missing, also the derivative is easier and thus, the time to do backprop is smaller which speeds up the training.

What is missing is a qualitative comparison of the trained models, but preliminary tests indeed indicate that the expressive power and generalization ability of the simplified model is solid.

References

[1] arxiv:2307.11888 “On the Universality of Linear Recurrences Followed by Nonlinear Projections”
[2] adrian-valente.github.io/2023/10/03/linear-rnns.html
[3] arxiv:2310.15916 “In-Context Learning Creates Task Vectors”

LLMs – A Task To Remember ?

University lectures can be both entertaining and instructive. The one [1] gives an overview of self-supervised learning for NLP. It covers the topic from the basics to recent methods. Lecture 15 is about “In-context learning”.

On slide page 50, an interesting hypothesis is made, namely that input-output examples might not teach the model a new task, but to “remember” the task from the training set. It has been demonstrated that the impressive few-shot capabilities of LLMs require at least a certain model size, in the billions and a sufficiently large and diverse training dataset.

To illustrate the model sizes with same numbers: the embedding dimension for LLMs can be huge as 12,888. And since projection matrices are used all over the place, this means a single matrix already has: dim * dim => 12,888 * 12,888 ~ 166M parameters.
To compare this with other models, BERT-Large has 340M parameters in total which means that just two(!) weight matrices of a large LLM almost match the number of parameters: 332M vs. 340M.

To do a quick recap of the Transformer architecture: we have attention layers, and feed-forward layers, plus layer normalization layers but the number of parameters is negligible. We also omit biases for the same reasons.
FFN layers: dim * 4 * dim + 4*dim * dim = 2 * 12888 * 4 * 12888 ~ 1,3B
Attention layers: (1+1+1+1) * dim * dim ~ 664M
An example for an LLM is GPT-3 that has 175B parameters.

Since LLMs have no external memory, they must encode the knowledge from the training set into the model parameters. This knowledge can be easily probed with dedicated question like:
"What is the highest mountain on earth?" -> Mount Everest.
In [3] it is shown that more popular concepts and terms lead to a better accuracy when providing questions or input-output examples in the prompt which is not surprising, since the word distribution in general follows Zipf’s law.

For few-shot examples [2] calls this function “task location” and for zero-shot it is a bit like “memory location”. But since the answer is not in a single parameter, the prompt induces a state that [4] calls “task vector” which just means the model adapts itself to maximize the generation probabilities of the right tokens to form the answer.

Coming back to the slides, page 50, again. The data sets used to train existing LLMs definitely introduce a lot of diversity, but it is also hard to believe that every possible task has been seen during training. Such an example is ability to translate text which is clearly not part of the training set. It surely is impossible to really understand the mechanics of a 175B model and, at least in our opinion, there is a tendency to overestimate the powers of existing LLMs which is why we welcome that researchers that risk a second look and try to shed some light on those mysterious capabilities with structured experiments.

Since operating LLMs is very challenging, it would be interesting to better understand what is needed to train smaller expert model, with a limited task set it needs to solve and what model size is required to do so and also what dataset is needed for training (size, diversity, etc.).

[1] self-supervised.cs.jhu.edu/sp2023, NLP: Self-supervised Models
[2] arxiv:2102.07350, Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
[3] arxiv: 2202.07206, Impact of Pretraining Term Frequencies on Few-Shot Reasoning
[4] arxiv: 2310.15916, In-Context Learning Creates Task Vectors

Different Normalization For Your [CLS] Token

In the paper that introduced Transformers [1], a special token was introduced to summarize the given input sequence as a fixed-length vector. The so-called CLS token is usually used by downstream tasks, like classification or retrieval. Thus, the embedding is a bit special, since its goal is to capture the “global” information of the input. However, the token is treated as any other token and therefore, it uses the same trainable parameters as the other tokens. In [2] it is shown that the learned representation of the CLS token is not optimal in the sense that it does not preserve maximal information which can be determined by analyzing the uniformity, the spread, of the embedding [2, Equation 4].

But even without those results, it seems wrong to treat a non-word token with a very specific purpose like any other token with respect to the transformations that includes parameter sharing. In [2] the layernorm step (LN) is replaced witha batchnorm (BN) step, but just for the CLS token and as it can be seen in [2, Figure 3], using BN for the CLS token improves the uniformity of the embeddings significantly. Depending on the downstream task, a low uniformity might be no problem and the representation still delivers a good performance, but for retrieval, well-separated embeddings are often a key to success, since topics might be overlapping and hard to separate.

With respect to the implementation, extra care needs to be taken to stick to the highly optimized tensor
operations provided by libraries like PyTorch to get optimal performance. But thanks to the simple architecture of Transformers, it is not hard to write a layer that uses the BN just for the CLS token which is assumed to be always at the first position.

The pseudo code for PyTorch looks like this:
cls = bn(batch[:, 0, :])
x_ln = ln(batch[:, 1:, :])
x = torch.cat((cls.unsqueeze(1), x_ln), dim=1)
y = mha(x, x, x)[0] + x
x = dropout(ffn(y) + y)

The input batch is of shape (batch, seqlen, dim) and the [CLS] token is always located at position zero. The code does not need much explanation, nn.BatchNorm1d is used and it expects an input of shape (batch, dim) and contains all [CLS] token embeddings. The mha is simply nn.MultiheadAttention and ffn is just a sequence of two nn.Linear with a non-linearity in between. The unsqueeze is required to turn the shape (batch, dim) into (batch, 1, dim) which is required to concat the cls and the rest of the input sequence of shape (batch, seqlen-1, dim) into the full shape (batch, seqlen, dim) again. The code can be combined to form a SepNormTransformerEncoderLayer which then can be further stacked to increase the representational power.

The additional complexity, at least for the code is minimal, but of course the chunk and concat step is not for free, but what is noticeable is the extra batch normalization step which includes mean + variance operations across the embedding dimension. But despite the experiments in [2], the architectural change might introduce training instabilities and other side-effects, or worst that is does not improve the accuracy metric at all.

Regardless if the modification helps or not, we welcome the direction of research to improve the representational power of the “summary token”, since for lots of real-world problems out there a fixed-length summary of input is required.

[1] “Attention Is All You Need” arxiv:1706.03762
[2] “On Separate Normalization in Self-supervised Transformers” arxiv:2309.12931

Retrieval: kNN versus Exemplar SVMs

While browsing the usual suspects on twitter, we found a post of Andrej Karpathy [1]. The idea we review in this post came up in the context to build a very simple movie recommender system by using embeddings of the plot + summary. The details are not important, just that the text is embedded somehow and a ranking is done by converting the query into the same embedding space and then perform a nearest neighbors (kNN) search as a baseline.

The interesting part is the note that kNN can be replaced with an exemplar SVM [2] that is really just using one positive example, the query, and treat all other examples as negative. This might feel strange for several reasons. First, is one example really sufficient to derive a useful decision boundary? But the more important question is, does it work at all, since we don’t know if there are movies with the same “label” than the query? To be clear, there are no real labels at all, we just set “1” for the query and “0” for the rest.

Let’s do a step by step evaluation.

At first, linear SVMs, like the exemplar SVM, work surprisingly well in high-dim spaces. In the example, each vector has 1536 dims which is “quite a lot”. Furthermore, the training is using a balanced loss function, because otherwise it would achieve a 99% accuracy by just predicting always “0”. The “C” hyperparam is inverse to the weight decay in neural nets, so, the lower it is, the more it should regularize. The default setting is 1,0 and in the example it is 0,1. So, we shouldn’t be worried that there is only one example.

What is equally important is to think of the training model not as a classifier, but a “ranking” function. In scikit learn, the ‘decision_function’ returns a value that can be interpreted as the “alignment” with the decision boundary given some example. So, values close to zero indicate an example almost lies on the boundary, while large positive values indicate a positive alignment with the model and a large negative ones, a negative alignment.

Since, the training step of the linear SVM does only care to minimize the loss function with respect to the given labels, the following “score” can be used to induce a ranking:
score = X @ W.T # shape (X.shape[0],1)
It should be noted that we not really care for the actual scores, just the ordering, as an indicator which of the “training examples” is most similar to the query. So, neither the sign, nor the magnitude of the result will be interpreted any further, since it might confuse the user.

Bottom line, sometimes ideas might be counterintuitive but this does not mean they won’t work. To quote Hinton [3] “My favorite philosopher of science was Paul Feyerabend who said that the correct method in science is whatever worked”. So, sometimes, despite the excellent advise of Karpathy [4] himself “Don’t be a hero”, you might want to hack-in and evaluate your crazy ideas, if it can be done in no time ;-).

References:
[1] twitter.com/karpathy/status/1647054838658924546
[2] cs.cmu.edu/~tmalisie/projects/iccv11/
[3] cs.toronto.edu/~hinton/HintonMumbai.pdf
[4] karpathy.github.io/2019/04/25/recipe/

Autoregressive: What Is Next?

It might be a naive point of view, but right now it seems that most of the ‘AI’ progress comes just from scaling a single architecture with respect to parameters and with more and more data. The issue that only big tech companies can operate such huge models and that the energy and money required to train such a model is huge does not seem too important. The stored knowledge and the abilities of the models are impressive, but that ‘real intelligence’ arises by just using the predict-next-item objective seems absurd. A human that would read millions of documents and be able to remember the whole content, could surely do impressive things, but without interacting with the world, its skills remain theoretical and cannot evolve.

The trend to learn from multiple modalities surely helps, because in videos there are often interactions with the real world and it allows to see what consequences actions have and how the world changes when objects interact with it. But again, a model would need to ‘watch and learn’ from a lot of videos with the right objective to learn all the required aspects and this is very expensive in terms of money and time. Not to talk about the complexity to create a video dataset to provide a diverse set of topics, since bias in data is a huge problem for AI models.

And this brings us to the point: Why is it so hard to come up with a good objective function? If we take a look at some very popular models, the used loss functions are:

  • contrastive loss, like in CLIP
  • negative log-likelihood, like in GPT

A lot of other loss functions were developed, evaluated and tested, but for most huge models those
seem to dominate. One reason is that no labels are required. So, all we need is a huge corpus, or in
case of clip two modalities, images and (noisy) descriptions, but that is it.

The downside is that those function both need a huge amount of data to learn the statistics of the modalities and they cannot be applied to smaller datasets. And to finetune larger models is still very challenging with normal hardware. So, one question is, does it make sense to spent so much money on training and re-training huge models just to be SOTA for one more dataset or task?

A similar line of thought can be found in the thread [1]. We agree that it feels wrong to “advance” AI by just using bigger and bigger models that nobody except for a few tech giants can operate. Even if the access would be “free”, the model is still controlled by a single entity.

At the end, our question is if we should put more energy in alternative lines of research instead of just watching and waiting for the tech giants to come up with bigger and “better” models that can be app- i-fied via some API to include some AI functionality in your product?

Bottom line, despite that hundred of researchers working on “AI” problems, the diversity seems to shrink since the introduction of Transformers. Now it feels like a large portion of papers is just adapting or “advancing” the architecture to solve existing problems, but very little work is done to work on new objectives on how and what a model should learn. Maybe, as stated in [1], it would be nice if a model could come up with its own goals, but in any case there needs to be a driving force for the model to optimize something, like a reward.

[1] mobile.twitter.com/RichardSocher/status/1562359669468717056

Oldies But Goldies: Part 1, SCP

This a non-ML post that might be categorized as ‘silly mistakes’ or just bad luck. The Secure Shell Protocol was introduced in 1995 and is now the standard to work on remote systems. For decades we never had any trouble with it until a recent update of a server that is using Debian 11 (server B).

The problem: We want to copy data from server A to server B with scp.
“subsystem request failed on channel 0”
However, auto-completion still worked, so it is clearly no general SSH problem.

The problem is that after an apt update, scp used the SFTP subsystem by default and not the older SCP protocol any longer. But since SFTP is disabled on server A, we get the error. The manpage of scp does not explicitly mention that SFTP is now the default which caused the problem.

The solution is simple the ‘-O’ option of scp forces to use the older SCP protocol. With this option copying files works again without any problems.

Bottom line, it would have been nice in general that new defaults are also documented in the manpage, since users hardly read change logs of all updated packages. But also that it is a good idea to thoroughly read the documentation instead of just stack-overflow-ing the problem.

Contextual Prototype Memory

The ability of humans to recognize concepts after just seeing a couple of exemplars is remarkable. In case of neural nets, most architectures are still very data hungry and need hundreds of examples to reliably recognize a concept. There surely is lots of progress, but one tendency is to just improve a model by increasing the size of it and to feed-in more data. The idea of using prototypes in neural is nothing new. The approach in [1] is surely not the first one, but very simple and nevertheless still very robust approach. The idea of [1] is to define a prototype by averaging feature representations. The the prototype can be seen as a kind of memory to classify unseen exemplars according to the distance to the prototype. The key to do this is metric learning to separate unrelated content in the representation space. However, averaging is not always the best choice due to possible in-class variance.

In [2] a prototype memory is created by using a recurrent neural net. The idea to use an RNN allows us to create a stateful memory that is not just averaging, but is a non-linear function of the timesteps. But the problem is that the context vector generated by the RNN is depending on the order of the input.

What do we mean by this: f(x1, x2, x3) != f(x1, x3, x2).


This does not have to be a disadvantage in case that the order of the input important, but if we just want tosummarize exemplars by a prototype (vector), the order should not matter. For example, if we want to recognize the concept of “mugs”. We are feeding in dozens of feature representations of mugs into our memory function M. Then the final output of the M should not depend on the order of the input, it should just summarize the concept as a single vector. This is obviously not possible with any RNN and to train the network to ignore the order of the input by permuting it to generate the same output is very prohibitive and also not reliable. Not surprisingly, the answer is the Transformer, or to be more precise any permutation equivariant network model.

Intuitively, to ignore the order of the input makes sense, since if we have for instance three images of mugs, for a human it does not matter what mug we look at first, as long as have a chance to take a look at all of them. What we actually store as a pattern to recognize a mug is a bit objective, or stated differently, what we attend to, but we surely look forth and back to generate pattern. In terms of RNNs, it only allows to look forth and in case of a bidirectional RNN, also back, but even then the output still depends on the order of the input.

So, what we need is a set-based function to summarize variable-length input with a fixed-length output. In case of Transformers, we just need to get rid of the positional encoding to ignore the order of the input. What is missing is the ability to summarize the input, since with a vanilla transformer the output size equals the input size. One solution is to use latents L and cross attention to summarize the input X: Y = cross_attention(L, X). The idea is to select (different) tokens of X per latent, or stated differently, each latent should focus on different aspects of the input. Then the output Y of shape (num_latents, dim) can be either flattened into a vector of shape (1, num_latents*dim) or averaged like in [3].

Bottom line, instead of just averaging set a of samples to generate a prototype that gives equal importance to each sample, an attention-based pooling method can be used to allow the network to select relevant parts of the input to generate the prototype. The method is agnostic to the order of the input and also to the type of input. For instance, it can be used for images or text as long as the input can be tokenized.

[1] NIPS 2017: Prototypical Networks for Few-shot Learning
[2] ICLR 2021: Wandering within a World: Online Contextualized Few-Shot Learning
[3] ICML 2021: Perceiver: General Perception with Iterative Attention

Applying Weight Decay On Model Parameters

For PyTorch, a setting like this:
opt = torch.optim.RMSprop(model.parameters(), lr=1e-4, weight_decay=0.1)
applies the weight decay on all parameters, including biases.

First, since biases have much fewer parameters, they are less prone to overfit. But second, if the weight decay is very high, it might drive those parameters to zero, which is especially a problem for nn.LayerNorm, since then layer normalization:
ln_x = scale*x + bias
also drives the input x to zero.

For this reason, the optimizer usually has two groups:
(1) parameters that should be decayed
(2) parameters without any decaying

It is not hard to come up with some code, but depending on the shape, some bias values might still be decay. Let us assume the if statement is like that
'bias' in parameter_name or len(param.shape) == 1
Seem fine, right? Yes, but it does not match for parameters with this setting.
cls_token = (1, dim)
since the shape is a dummy axis and no bias in it.

Maybe the decay was on purpose, but if ever a bias is of shape (1, dim) and has no bias in its name, the check will fail.

With a slight variation, those parameters will be added to the no decay group:
if len(param.squeeze().shape) == 1

Bottom line, tensor shapes can be a real pain in the neck and even such a trivial function to split the parameter into groups can fail because of shape variations. In our opinion PyTorch should at least provide an optional to exclude bias parameters from weight decay, so the extra function is not required. But still, this does not help if a parameter is still a bias but with a different name and/or shape. So, one learnt lesson is definitely to check your parameter shapes twice.

Directly Optimize IR Metrics

It would be helpful to have an AI system that at least allows to retrieve information form a large corpus of data with a simple query. In other words, information retrieval (IR), a classic ML topic. And even without training, methods TF-IDF or BM25 provide a solid baseline.

But we would like to directly optimize some IR metric. A short recap: we want to learn a ranking function that orders a set of documents X given some arbitrary query Q. We decided to take DCG [1] because it measures the quality of a ranking.

In our case, the relevance is just a binary vector {0,1} where 0 means unrelated
and 1 means related to the query:
DCG = sum(i=1, N) relevance[i] / log_2(i + 1)
The idea is simple: when the position of only related content comes first, the gain is maximized.

Now the function does not depend on some input and is thus hardly differentiable. So, let us redefine it:
DCG = sum(i=1,N) relevant[i] / log_2(rank(x[i]) + 1)
where rank is the position of document x in the result. This still does not sound very continuous, so let’s try again. The ideas come from [2].

We have a query Q of shape (1, dim) and some documents X of shape (N, dim). First we determine the scores:
scores = Q @ X.T (of shape (1, N)
To determine the rank of the first element, we do:
s_0 = scores - scores[0] rank_0 = 1 + sum(s_0 < 0)
But this is still not differentiable, darn it. So, we need to approximate the step
function with a differentiable function, like the sigmoid that is S-shaped.
scale = 1/0.05
approx_rank_0 = 1 + sigmoid(s_0 * scale)

With proper scaling the approximated ranking function returns a ranking [1, …, N] based on the similarity of each document with the query.

And now, we can differentiate the DCG metric which means, we can directly optimize it. To make things a bit easier, we normalize the metric by just using the relevant entries in the optimal positions (maximum).

norm = sum(j=0,|POS|, i in X[pos]) relevant[i] / log_2(1 + j)
DCG_n = DCG / norm

Bottom line, whenever we replace some function with a continuous approximation for a IR metric that depends on the data, we can directly optimize it. Like Smooth AP [3]. In contrast to other loss functions, this objective does not really care for similarity values as long as the ordering of the results is correct. Of course in real-world applications we would need some threshold for filtering, but in most cases, the ‘search engine’ would anyway just present the top-k results and other results just on demand.

A closing remark: we are wondering why it took 12 years for the topic to get some more attention. Probably because neural nets needed an AlexNet to get more attention ;-).

[1] en.wikipedia.org/wiki/Discounted_cumulative_gain
[2] A General Approximation Framework for Direct Optimization of Information Retrieval Measures, 2008
[3] Smooth AP, arix:2007.12163