Missing Cookbook Recipes to Train Neural Nets

Training a neural net in the old times, without fancy normalization layers and often randomly chosen weights that were not optimal was a real pain. Even for vanilla classification networks it sometimes took several runs with different hyper-parameters before the network started to learn. At least for standard architectures, like a ResNet, nowadays, it is fairly simple to train a classifier for an arbitrary dataset.

However, despite a smart grid search and to start with good default values, we are still far away from a cookbook that helps to solve a problem with neural nets in a more disciplined way. The blog post [1] has a lot of very useful hints, but due to the complexity of modern nets, it is impossible to put every hint into a single document. For example, training a large transformer is still not trivial and requires expertise and probably also patience.

And as mentioned earlier several times, there a lots of loss functions and combinations of them that will lead very complicated loss landscapes. In this case, it is imperative to start simple. Like using a very conservative learning rate + optimizer and try to overfit a small batch to verify if the loss goes to zero. This tip, also mentioned in [1], is one of the most helpful tips we know, since without this assurance, errors can be lurking anywhere. Truth be told, when the loss goes to zero, errors are still possible and likely, but at least we know that the network is able to solve a simplified version of our problem.

For example, in case of recurrent nets, gradient clipping and gating are now fairly standard, maybe along with layer normalization. But does it mean that by using those means, the success is guaranteed? Definitely not for all kind of nets and problems. The choice of LSTM vs. GRU might be easily evaluated by comparing resources vs. accuracy, but what about gradient clipping? What norm is useful? In papers those values range from 0.01 to 10 and this value surely depends on the actual norm of the gradient during training with a specific net and loss. And even for such a simple method like layer norm, there are at least two flavours, namely pre: f(ln(x*W)) and post: ln(f(x*w)). Depending on the paper and therefore the problem / method, there is no clear winner.

Recap: For RNNs we have to choose the number of units / layers, the type (GRU or LSTM) w/o LN and the norm for clipping the gradients.

Starting simple is definitely a good idea, but especially for RNNs good default values for the number of units can be challenging. We once had a classification problem where a small number of units lead to an incredibly slow learning, while a large number of units lead to unstable learning despite using gated units, gradient clipping and layer norm. At the end, we had to do a grid search on a smaller dataset to find a good trade-off for number the units.

When RNNs became popular, it took some time until also researchers without a profound expertise in this area were able to use them for ‘everyday tasks’. This was also owed to the introduction of high-level frameworks like Lasagne or Keras, both introduced in 2015 and both were using Theano as a back-end then. However, in case of problems, high-level APIs hide most information as they are supposed to, but this makes debugging very hard and in case of static computational graphs (used in Theano) even more horrible.

So what to do if the loss does not go down as expected, at least for the training data? Switching to a different optimizer? Lowering the learning rate, or increasing it? More hidden units? Different kind of units?

A good tip is definitely to monitoring as much as possible. Besides the loss, there are many other possible values:
– The norm of the gradient / max & min values of the gradient, max(abs) values, etc.
– The total norm of the network / norm of each layer
– Max / min of activation functions (think of -1/+1 for tanh, 0/1 for sigmoid)
– Number of ‘dead’ units in case of ReLU
– In case of NLL it is useful to compare early loss values with -log(1/N_CLASSES)

The idea is to get a feeling for the flow of the data through the network. First, in the forward direction, where wrongly sampled weights might lead to abnormal behavior like dead units, or very large/small values or even saturated values. Similar, wrongly encoded data might lead to similar behavior. Second, in the backward direction, which is the flow of the gradient. There large gradients can lead to NaN values, or oscillating behavior and very small gradients can lead to vanishing gradients and therefore no progress.

Since it is very hard to keep track of those values, it is a good idea to visualize a condensed summary in a plot. Maybe the gradient per layer as a bar graph [2]. But it is also useful to dump a subset of those metrics during training to get a feeling if the training is healthy. We usually dump at least the loss train/test, the total norm of the network and an average of the gradient norm.

With all these tips we even did not scratch the surface and we are far away from a manual or even a reference that helps to find pointers in case of specific symptoms. A further problem is that researchers who work a lot with different nets, usually focus to publish only the positive aspects of work but rarely mention drawbacks and negative results which might be at least in a condensed form equally useful.

[1] karpathy.github.io/2019/04/25/recipe/
[2] raberrytv.wordpress.com/2019/09/22/flow-gradient-please-flow-darn-it/

Classifier: Softmax vs. Nearest Neighbors

In case of a fairly standard setup which means a neural network and a loss to correct classify arbitrary data points, the network learns a representation that allows to linearly separate [1] the classes in the dataset. At least as good as possible with the chosen architecture. Finally, the output of the penultimate layer is multiplied (dot product) with all ‘template’ weight vectors, one for each class, and the score is normalized to get a probability distribution y_p:

scores = [3.14, -1.2, 0.05]
y_p = exp(scores) / exp(scores).sum()
y_p = softmax(scores) = [0.945, 0.012, 0.043]

In other words, if the inner product is sufficiently large and positive, which means the hidden state and the class template agrees, and the other products are lower by a margin, the corresponding class y’ is getting most of the probability mass and y_p[y’] is close to 1. And if y_p[y’] is close to one, the negative log-likelihood -log(y_p[y’]) is close to zero and the loss is thus, minimal. This is the goal.

This kind of loss is very well studied and also pretty stable if the framework of choice is handling possible overflows, for instance in PyTorch, one should use log_softmax instead of softmax. A further advantage is that this approach allows to directly classify unseen data points with a single matrix multiplication h*Wy, where h is the output of the last hidden layer and Wy is the ‘template’ matrix of shape (h_dim, n_classes).

So, what is the drawback? As stated in [1] depending on the goal, it might not be required to learn a linearly separable representation. For instance, if it suffices for a ‘classification’ to learn local neighborhoods where the class can be induced by a voting of labels in the neighborhood, the network does not need to waste capacity to learn class global boundaries. Instead it can focus to shift data points around until stable neighborhoods are learned.

Now one might ask if this task is easier than to learn class boundaries, also because without templates, a “classification” requires a subset of the training / test set during the inference step.

Well, this can be a problem, but if the ultimate goal is not to predict labels but to learn a representation where the space itself allows to rank or to cluster new content, the problem is secondary. Of course for larger datasets a nearest neighbor calculation is challenging, but there are GPU-powered methods (GPU: at least during the training) that work fairly well.

Besides the success and its simplicity of the (categorical) cross entropy method, it is a bit unsatisfying. Why? Each prediction only considers a single data point. In case for an input (x,y=1) the inner product is large for two template weight vectors, the learning step shifts the hyperplanes to move x on the right side which means, nearby points, but with a different label, also might need to be shifted to be on the correct side on the hyperplane.

Now, instead of shifting global template vectors, class boundaries, it would suffice to have a local neighborhood for input (x,y=1) where the majority of neighbors has the same label. In other words, for the x, we search for the nearest neighbors and if there are violators with y’ != 1, we need to push them a bit away from x and at the same time, if there are supporters x_, with the same label, we need to pull them a little closer to to x.

How one could determine such neighborhoods and how to choose those neighbors is for example, explained in [2]. The idea is for an input x, to derive a radius by considering k nearest neighbors and to push the violator that is farthest away, outside the circle and pull a supporter inside it. There are other methods, but it is always a good idea to give ideas a shot that are a bit out-of-the-box.

However, the point is that it is important to use as much context as possible to make learning more efficient and easier for the network. This can be done by using all samples in a batch to derive the final ‘prediction’ instead of just one and all template vectors.

To be more concrete: In [3] we discussed how to turn a batch of different labels into a classification problem that uses the whole input (X,Y) instead of just a specific pair (x,y). How is this related to [2]? We consider the distance of samples to assign a weight to each x with respect to a query q.

Let B=(X,Y) be a batch of n items and let X_i be the samples without entry i and let q be X[i]. For Y this is similar: Y_i, and q_i. And each row in Y is a one-hot vector encoding the integer label y (0,…,K).


temperature = 0.1
dists = ((q - X_i)**2).sum(dim=1).view(1, -1) # squared euc distance to query
probs = softmax(-dists/temperature, dim=1)
yhat = matmul(probs, Y_i) # shape=(1, K)

In words: For a query q, row i from the batch B, we calculate the distances to all other rows. The goal is to minimize distances for rows with the same label and to push other labels ‘sufficiently’ away. The temperature controls how much larger distances are treated. Numerical examples: exp(-0)=1, exp(-3)~0.05.

The matrix multiplication of probs with the one-hot vectors in Y_i ‘distributes’ the weight to the corresponding classes. For the extreme case, that we have K=2 and a batch of three rows Y=(1, 1, 2) and probs = [0.45, 0.45, 0.1], Y_i = [[1, 0], [1, 0], [0, 1]], we get yhat = (0.9, 0.1). The other case: probs=[0.05, 0.05, 0.9], we get yhat = (0.1, 0.9). Let be the label of the query 1, then for the first example, the distances agree with the label and we just need to nudge the input x with y=2 a little outside the ‘radius’ of the query. For the latter, the distances of the two rows with the same label are larger, but the distance of the row with the wrong label is very close. Thus, we somehow need to adjust the representation that both rows with the same label are closer to the query and the violator is pushed away. In other words, we tidy up the neighborhood.

It is easy to see that for larger batches the network is forced to adjustthe weights to fulfill multiple constraints instead of just one. This is a bit like looking ahead to see if it makes sense to move multiple rows at once, instead of moving just one but shifting all template vectors. This is also related to triplet learning where a learning step just “sees” the anchor, the positive and the negative item to adjust the weights. This issue is addressed in [4] where centroids of classes are used to separate several classes at once. In this paper they use a similar loss to ours to achieve this goal:
Let q=(1,dim) be the query, and let C=(dim, K) be the centers for the K classes.
dists = ((q – C)**2).sum(dim=1).view(1, -1)
probs = softmax(-dists/temperature, dim=1) # (1, K)
loss = -log(probs[0, q_i])
In words: we want to minimize the distance between the query and the center of the class with label q_i, the label of the query q, and at the same time to “maximize” it for all other classes. To transfer our method to theirs, all we have to do is to average all rows with the same label.

Recap: It all comes down to the context and we don’t want to make a decision based on a single example. Furthermore, we are more interested to learn a good representation that can later be used to model similarity of samples by the distances, rather than to enforce linear separability. To see if the learned representation is useful, we use tSNE on a subset of the data to study clusters, their spread and how well they are separated during training and at the end.

Now one might ask why we are not optimizing the kNN directly? The loss functions discussed so far, already optimize “label neighborhoods”, but we can also do it directly with the loss function from [5]:

The notation of query and batch stays the same, only the loss is a bit different: Let X_l the rows of X with label l and X_i is X without row i. The idea is how likely it is to sample a data point with label l in the neighborhood of q:


dists_l = ((q - X_l)**2).sum(dim=1)
dists_i = ((q - X_i)**2).sum(dim=1)
prob = exp(-dists_l/temperature).sum() / exp(-dists_i/temperature).sum()
loss = -log(prob)

Again, it is not hard to see that all these loss functions are related because they all consider a query and the distance to a specific set of X. What is different is how they use this information to derive the loss and how to arrange the batch.

At the end, it all comes down to learn a neighborhood for a query that is dominated by points with the same labels. The other points are pushed away, but not in an arbitrary manner, since we iterate over the whole batch to select a query. Thus, a violator is simultaneously pushed away from a supporter of a different class, but at the same time pulled-in but a supporter of the its own class. One could say that we try to learn an equilibrium that aims to learn stable neighborhoods for all queries. If this actually works depends on the data, the network and the learning algorithm.

Bottom line, one issue of softmax networks is that they are sometimes too confident even if they are wrong. There are ways to fight this, like label smoothing, or using [5] as a regularizer, but those solutions are not fully satisfying. Plus, such a network expresses all decisions in terms of the learned classes, or stated differently, to the affinity of an input to each template vector. This means even for unknown classes, the input can lead to ridiculously high confidence scores. This is no general weakness, since the representation of the network is learnt by linearly separating classes with the template vectors. But still this behavior might not be wanted, if the ultimate goal is to work rather with the feature representation, for instance to search, rank or cluster new data, instead of class predictions. Then, it might be a better idea to use a neighborhood-inspired loss function to learn futures.

After all the years, we still feel like we are at the very begin of our journey, but this is probably how lots of researcher feel. This reminds us of what Plato said: “never discourage anyone who continually makes progress, no matter how slow”.

[1] colah.github.io/posts/2014-03-NN-Manifolds-Topology/
[2] ttic.uchicago.edu/~gregory/papers/nips2014_WithSupp.pdf
[3] raberrytv.wordpress.com/2020/03/23/you-are-not-alone-ask-your-neighbors-for-help/
[4] papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.pdf
[5] raberrytv.wordpress.com/2019/07/06/implementing-loss-functions/

PyTorch: CPU / GPU Model Sharing

In earlier phases of research, we focus on the clarity of the model code and not performance. As Knuth stated premature optimization is the root of many evil. First code should work and be clear and then it should be thoroughly analyzed for possible bottlenecks. In PyTorch, thanks to dynamic graphs, you can express your crazy ideas for neural nets in a very pythonic way which help a lot to express your thoughts quickly.

We recently tested an approach that contains different kind of modules, standard modules like word embeddings based on 1D convolution, fully connected layers, but also a new kind of routing that we implemented as a loop. It is not hard to see that the model as a whole is not optimal to run a GPU which can be quickly confirmed with a actual test. What to do? PyTorch offers mechanisms to use multiple GPUs, but the straightforward way does not work in our case.

The naive approach is to distribute different parts of the model to multiple GPUs. Since we are often using a standard laptop, and thus we only have one GPU, we tried the simple approach to move the convolution stuff to the GPU and then we transferred the result back to the CPU:


net = Network()
net.word_emb = net.word_emb.to('cuda:0')
[..]
emb = net.word_emb(batch).cpu()

Even though this is clearly not optimal, the time saving was up to 30%. Stated differently, the idea is to use the largest coherent block in your network and transfer this part to the GPU. The tricky parts, like loops and parts that do not rely solely rely on matrix operations stay on the CPU. The benefits, if any, can be easily measured by the required time per step only on CPU vs. CPU/GPU.

Bottom line: Clearly, this is not rocket science, but since the optimization of the batch flow can be challenging and time consuming, the use of all available resources is especially helpful in early stages of research to see if your model works at all in a reasonable time.

PyTorch – Free Your Memory

The problem is that with more memory in your GPU, you want to fit bigger models, or at least train faster with a larger batch size. This is not limited to the GPU, but there memory handling is more delicate. In other words, you don’t want to waste your precious memory with data that is not used any longer. This sounds trivial, but because of the reference handling in Python, there are sometimes side effects.

In the FAQ [1] of PyTorch, there are already some good hints. For instance, if you just want to track the loss of your network, either use: float(loss) or loss.item(). Both statements convert the differentiable variable in a native python type “float” and avoid dependencies that prevent to mark the object as ‘to-be-freed’.

The symptom of this problem is easy to spot, since the memory usage of your training process increases continually and often dramatically. Surely, there are dynamics in the memory usage, but when larger chunks are never released eventually, something might be wrong. This is reasonable since usually only required data is kept in memory and not everything. In case a case a cache is build gradually the situation might be different, but even then the usage should not grow to fast.

Probably the accumulation problem happens pretty easy, but is also easy to fix, since we only need to analyze the use of loss variables. However, sometimes, problems are more sneaky and need more investigations. It would be great to have some kind of cookbook for such problems, or at least best practices, compiled somewhere to make the life of engineers more easy.

A tip that is also noted in the FAQ is to explicitly use ‘del var’ statement to emphasize that you don’t need the object any longer.

Update 2020-03-27: Since in the context of NLP we use Conv1d a lot for dynamic word embeddings, it should be noted that there is a problem in recent versions of PyTorch. In the default setting the memory continually increases which leads to a out-of-mem problem sooner or later. The issue is summarized in the ticket #27971. The workaround is to set an environment variable ‘export LRU_CACHE_CAPACITY=1’. We tested this kludge with PyTorch 1.4 and the night build and it works and the memory does not grow any longer.

It would be interesting to know what others do to keep the (PyTorch) memory free and tidy.

[1] pytorch.org/docs/stable/notes/faq.html

Self Attention & Factorization

We stumbled about an interesting blog post [1], the referenced paper is [arxiv:2003.08165] and besides a lot of interesting ideas, the authors described a simpler variant of self attention.

As a reminder, self attention is defined like that:

query, keys, values = X*Wq, X*Wk, X*Wv
probs = softmax((X*Wq) * (X*Wk)^t) / sqrt(dim)
values = (X*Wv)*probs

where X is the input sequence of shape (batch x seqlen x dim). Because of residual connections, we assume a W matrix of shape (dim, dim). Therefore, the total number of parameter is (dim*dim)*3, so dim=50, this equals 7,500.

The paper simplified the approach by omitting the embedding of the values:

values = X*probs

because probs is of shape (seqlen x seqlen), the shape of X is not altered and residual connections are still possible. Furthermore, the W matrix has a different shape (dim, d) where d << dim. If we assume d=10, the total number of parameters now equals: 2*(d*dim) = 1,000.

Even if our networks are very modest compared to a full transformer, we always aim to optimize the space and time complexity if it does not hurt the performance. Thus, we also tried to replace the self attention in an existing network, but found out that the performance degraded notably.

In case of residual connections, the shape of the output must be the same as the input and thus, the shape of the W matrix is fixed. However, we can factorize the Wv matrix into two lower rank matrices: Wv = U * V, where U = (dim, d) and V = (d, dim).
With this modification, the total number of parameters now is: 2 * (dim * d) + 2 * (dim * d) = 4*dim*d = 2,000. This is slightly more than the simplified version, but still less than in the original method. With this slight modification, the performance of the network actually improved, compared to the unaltered baseline.

So, the conclusion is that the embedding of query, key, values is an essential part of self attention, but it also tells us that we do not always need full rank matrices. Now, factorized weights are not new, probably also not for transformer architectures, but might have a noticeable impact.

[1] attentionagent.github.io

You Are Not Alone – Ask Your Neighbors For Help

In these times you hardly get any synergy at the coffee machine in the office kitchen. Nevertheless it is still important to draw inspiration from somewhere.

We are still searching for an approach to combine long-range contexts and attention for learning with small-scale data sets. We liked the ELECTRA model since there, every token gets a learning signal, instead of just masked ones. However, Transformer-based models are very data hungry and thus, not really suited for problems with few data.

The experiments we conducted with ‘dynamic attention’ [2] lead to very promising results, but that a classification does not consider relevant neighbors (let’s call it context) for a final decision lead sometimes to silly mistakes. In [1] a method is proposed to improve the classification of trained models by using a cache. There are other papers but this one has the advantage that it is very simple and elegant. We will give a brief description and then sketch how we used the idea during training.

Let’s assume that we have a trained model f(X)->Y where f() is function that embeds arbitrary data X into some space to derive the final prediction Y which is the softmax output of K classes. The cache memory C is filled with the output of, maybe, the penultimate layer of f(), or stated differently the output of this hidden layer g. All memory content is L2 normalized. Each slot also has the corresponding label as a one-hot-vector (K dim). C = [(g(x1), y1), …, (g(xn), yn)].

To determine a label distribution from the cache, the cosine score of a query g(xquery) and all slots in the cache is calculated, followed by the softmax. The scale parameter defines the ‘peakiness’ of the distribution.
scores = matmul(g(xquery), (g(x1), ..., g(xn)))
probs = softmax(scores*scale, dim=1)

The output is a similarity distribution of the cache with respect to the query. Stated differently, slots that are more similar distribute more to the final decision, while slots that are ‘sufficiently far away’ do not contribute much.

Now, we need to combine the information with the labels of each slot toget a soft distribution over the labels:
yhat = matmul(scores, stack((y1, ..., yn)))
The shape of yhat is (1, K) and the content sums one one. Thus, it can be directly used as a proxy for classification.

Intuitively, the method considers the neighborhood of the query embedding and according to the angle (cosine score) between it and the cache slots, it derives the final decision. It’s like a k-NN classification with the difference that every slot gets to vote, not only the top-k, even if the contribution is very tiny.

In a nutshell: the method can be used with any pre-trained network from any domain (nlp, vision, video) to get a second opinion for a prediction.

Now, our goal is still to use more context during learning and thus, we decided to use this method during training. It is similar to [3] but we opted for a simpler approach where we used a circular buffer to fill the cache. In our case, we used the method as a kind of regularizer to learn a representation that disentangles and separates the individual classes. Thus, we only use the labels during training and also have no softmax layer.

At the begin, the cache is empty and we need to fill it with output from the net. Until the cache is sufficiently filled, there is no learning signal. A batch consists of K samples, each with a different label. Every row of the batch, query_i, acts as a query which is used to get a soft label distribution from the cache. The loss is -log(yhat[0, y_i]) for query_i with label y_i. Then, the whole batch is added to the cache, in PyTorch we use query_i.detach() to ensure that slots in the cache do not require a gradient. If the cache is
full, the oldest item is overwritten (circular buffer).

Why we detach the node from graph? Without this operation, further gradients would still depend on older cache entries which is not scalable. Thus, the gradient is derived only for the query_i and the cache contents are considered “fixed”. In other words, the gradient is not backproped through the cache, only the neural net.

We compared the representations of a standard softmax network and this approach of several runs with 2D tSNE plots. The cache results were more tightly clustered and better separated with respect to class boundaries.

Bottom line: This method shows that a good regularizer does not need to be very complex to have a huge impact. There is a tendency to use fancy models and customized loss functions to solve problems, but we should never forget that sometimes all we need is a standard net and a well-known loss function, maybe with a little tweak, like a cache for a rock solid baseline.

[1] “A Simple Cache Model for Image Recognition” [NIPS 2018]
[2] raberrytv.wordpress.com/2020/02/01/attention-via-routing-by-agreement/
[3] [arxiv:1703.03129] “Learning to Remember Rare Events”

Dropout was Yesterday – Here Comes Dropin

Inspired by the new method ELECTRA that was also mentioned in this blog post [1] we thought about similar ways to to speed up learning by getting a signal from every token in the sequence. The idea of ELECTRA [2] is simple: Some tokens are replaced with reasonable substitutes and then, for every token, there is a binary prediction if the token is unaltered or modified. This way, there is a prediction for everytoken and not only the masked ones.

In masked language learning, a token is either replaced with the token, it is kept or replaced with a noise token, sampled from the vocabulary. But what happens if we insert new “intruder” tokens that are randomly sampled from the vocabulary and consider only those that are not present in the chosen sequence. Then, we classify each token, but not as original or replaced, but as original and noise.

We even went a step further and reformulated the task as supervised source separation. The idea is a small transformer net and to concat two sequences S=A||B without any separator token. The input S is fed into the net to get a contextualized embedding F(S). Again, we use a binary classification to predict every token of A as ‘1’ and every token of B as ‘0’. It is reasonable to assume that there are words that both appear in A and B, but since they are contextualized, it should be possible to predict the source label from them.

We have not any decisive conclusions yet, but we think it is extremely important to think out of the box and to test new ideas, as crazy or stupid as they seem.

[1] ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html
[2] openview.net/forum?id=r1xMH1BtvB