Learn, Network, Learn: SGD with Restarts

Those who tried to train a bigger model from scratch probably know that this can be very frustrating, because often papers are very brief on failures and necessary heuristics to ensure that the model converges. And even then, it often takes some patience and experience to get your network from the ground. With normalizing layers and more sophisticated optimizers like Adam, the training is much easier than in the early times of AlexNet. However, even with all the fancy stuff, it is not unusual that a lot of hyperparameter tuning is required before the training starts to converge and the sad part is that this tuning often depends on the dataset and the model architecture and thus cannot be reused automatically.

During this year, we read a lot of papers, tried a lot of models, also unusual ones, and then we thought about what helped most to train those models that we had most trouble with converging. It is sad but true that a lot of phenomenons cannot be explained precisely, but it is at least helpful to know these tricks to let your model converge to some useful state:

– Despite the fact that Adam is so popular, it was and still is not our first choice, instead we use Adagrad. Maybe with adjusted hyperparameters (betas, eps) it would perform as well as Adagrad, but the latter requires less tuning and converged faster most of the times. Since we use a broad range of datasets and models, the only reason cannot be a bias towards the data or the architecture. So starting with AdaGrad and a learning rate of ~0.05 was a pretty good baseline.

– Gradient clipping is used to avoid exploding gradients, especially for RNNs, but empirically it also helps for loss functions with a certain landscape which can lead to larger gradients. In general it is often useful to decouple the direction of the gradient with its length. To clip gradients never helped us if a model did not converge, but it often helped to stabilize the training.

– But by far the most successful method was to use cyclical learning rates. There are some hints in the literature why this helps like “avoid spurious minima” but it is not fully understood yet. In contrast to a learning rate decay, the rate goes up and down according to some schedule. One popular scheduling is sine or cosine. The idea is elegant and simple: First the learning rate increases up to LR_MAX and then it decreases to LR_MIN(=0). If we set RESTART to 10 steps, we go from 0 to LR_MAX in RESTART/2 steps and then we decrease LR_MAX to LR_MIN in RESTART/2 steps, at least for a sine schedule. The formula is also straightforward: lr_next = sin((STEP % RESTART / RESTART) * PI)*LR_MAX.

Let us consider the extreme cases: sin(0*pi)=sin(1*pi)=0, sin(0.5*pi)=1, which means we start with zero and halfway to RESTART, it reaches the maximum and at the end it is zero again. The steps variable STEP is increased each backprop step and thus needs to be reduced to the range of RESTART.

So why this makes the difference between no learning at all and gaining momentum after a dozens of steps?
For one model that did not converge, we tried at least to understand what goes on with respect to the gradient norm over time. For the test, all parameters were the same, only the learning rate was adjusted (1), or kept fixed (2). In case (2) the gradient norm quickly reached almost zero and never recovered which indicates that the model is trapped inside a unfavorable region of the loss landscape and it requires smaller steps to escape it, to explore more favorable regions. For case (1) the model starts with a very small learning rate that reduces the chance to overshot more favorable paths, but on the same time it might also escape spurious minima by using larger step sizes. After some initial very small gradient norms, like (2), it gains ‘momentum’ and the gradient norm continually increases.

We have to admit that this explanation is not satisfying at all, but at least it allows us to successfully solve problems at hand and by the insights we got from the model and its learned representation, we hope that we can eventually put together the puzzle pieces to explain why this step is so useful.

In a nutshell: Also consider less popular optimizers, clip your gradients and treat your lrate as a cycle.

Flow Gradient, Please Flow, Darn It!

With residual networks, batch normalization and sophisticated recurrent net like LSTMs, it seems that getting a network to “fly” is just a matter of data and time. It probably feels like that because most papers do not mention the effort the authors had to take before the network actually started learning. We have no evidence for that but we a hunch that a successful setup can take a lot of time, mostly to figure out the hyper-parameters and to tune the network architecture. But who knows, we could be also totally wrong.

What’s definitely useful is to visualize as much aspects of the learning as possible. This includes to keep track of stats like the gradient norm, magnitudes of activations, the accuracy and the loss, and many other things. One excellent example is TensorBoard that allows to visualize metrics during learning. But sometimes it just helps a lot to visualize the “flow” of the gradient through the network, to see if the training is healthy or not.

In our case, we tried to train a network that contains two LSTMs but despite the mitigation of vanishing gradients that LSTMS should have, there was no real flow in the network and thus, the loss did not decrease as expected. For each learning step, we dumped the norm of the whole gradient but this is not useful to learn something about the flow of it through the network per layer. So, we decided to plot the absolute mean of the gradient for each layer, excluding bias values, to debug which layer is responsible for “blocking” the backward flow of the error signal.

We used the following script:

def plot_grad_flow(net):
  lay_grads, lay_names = [], []
  for name, param in net.named_parameters():
    if param.requires_grad and 'bias' not in name:
      name = name.replace('.weight', '')
      lay_names.append(name)
      mean = float(param.grad.abs().mean())
      print(" ++ %s=%.6f" % (name, mean))
      lay_grads.append(mean)

  plt.bar(range(len(lay_grads)), lay_grads, align='center', alpha=0.3)
  plt.hlines(0, 0, len(lay_grads) + 1, linewidth=1, color="k" )
  plt.xticks(range(0,len(lay_grads), 1), lay_names, rotation=90)
  plt.xlim(left=-1, right=len(lay_grads) + 1)
  plt.ylabel("avg grad"); plt.title("gradient flow"); plt.grid(True); plt.tight_layout()
  plt.show()

We borrowed ideas from a post in the PyTorch discussion forum[1] with some adjustments. For instance, without the tight layout step, not all text was visible on the x axis and we did not use plot, but the bar style.

It is hard to come up with heuristics for absolute values, but if a bar in the plot is almost zero for several iterations, it’s a good indicator that something is wrong. Since if it’s close to zero, the weights of this particular layer won’t get updated much and thus it is likely that the loss won’t go down, because the gradient is “blocked”.

Despite the fact that the visualization will not offer any concrete solutions, you have at least a clue where to start debugging which can help a lot if you have a 24 layer network.

[1] discuss.pytorch.org/t/check-gradient-flow-in-network/15063

Implementing Loss Functions

With PyTorch it’s pretty easy to implement arbitrary loss functions because of the dynamic computational graph. With it, you can use loops and other Python flow control which is extremely useful if you start to implement a more complex loss function. We follow the principle “readable first, performance later” which means we don’t care if the code is efficient, as long as it is easy to read and of course to easy to debug. When we are sure, after performing extensive tests, that we correctly implemented the loss function, then and only then, we think about optimization. The recipe from Adrej Karpathy[1] gives a good hint that we should overfit on a small data set first, before we start training on the whole dataset. This is very important since if we cannot get the loss to go down on a tiny data set, something is wrong and we need to debug our code.

We recently stumbled on a paper from Hinton[arxiv:1902.01889] with the note that a batched version of the loss is provided for TensorFlow (in the CleverHans package). However, since we are using PyTorch, we started from scratch to learn something about the challenges.

Let’s start with some notation. We have a batch X with the corresponding labels in Y, where X is of shape (n, dim) and Y has a shape of (n) and contains integer labels, while X contains the features from some hidden layer.

For a sample (x, y) the distance between x and some other sample x’ is used to define the probability to sample neighboring samples for x, see chapter 2 in[arxiv:1902.01889]. To turn distances into ‘similarity’ values, a gaussian kernel function is used: exp(-dist(x, x’)**2/T), where T is the temperature which is tunable to stronger consider also larger distances (if T is chosen larger) and dist is the euclidean distance betwen x and x’.

For a sample x_i of label y_i, the loss is calculated as:

same_i = sum(j=1 to n, j != i and y_i == y_j) exp(-dist(x_i - x_j)**2/T)
rest_i = sum(k=1 to n, k != i) exp(-dist(x_i - x_k)**2/T)
prob_i = same_i/rest_i
loss_i = -log(prob_i)

The (j != i) and (k != i) part is also known as “leave one out” which means, we skip the sample to avoid a self reference where exp(-dist(x_i, x_i)**2/T) is 1 because the distance is zero.

The idea is pretty simple: if two samples share the same label, the distance between the pair should be lower than to any other sample with a different label. This is the (y_i == y_j) part in same_i formula which selects all samples with the same label. The negative log step should be familiar from the softmax loss.

Let’s consider two cases before we start coding:
(1) A pair (x_i, x_j) share the same label and is actually pretty close. Thus, exp(-dist(x_i, x_j)**2/T) is close to 1 since dist(x_i, x_j) is close to zero. If other distances (x_i, x_k) are reasonable large, the term exp(-large**2/T) is close to zero. So, same_i looks like [0.99, 0.99] because all entries with the same labels are pretty close. And in case other entries are well separated, rest_i looks like [0.99, 0.01, 0.99, 0.01, 0.01] which means 0.99 are entries with the same label and all other entries have different labels. If we sum this up, and divide it, we get a probability ~0.98 and thus a loss of ~0.015 as expected.

(2) Now, if a pair (x_i, x_j) is further away and a sample with a different label x_k is closer to x_i, same_i now looks like [0.01, 0.99] since one entry has a larger distance. And since an impostor now has a smaller distance, rest_i looks like [0.99, 0.01, 0.01, 0.99, 0.01]. The values for rest_i are the same, because only two entries switched places, but the sum of same_i is different now. This means, the probability is now only ~0.5 and the loss is ~0.70.

Bottom line, the loss is supposed to be low, if all samples with the same label are densely located in the feature space, or at least, each pair with the same label is closer to to any other sample with a different label. This is the soft version of the classical k nearest neighbor method.

What follows is a naive implementation of the loss:

for i in range(batch_x.shape[0]):
 label = batch_y[i]
 batch_same = batch_x[(batch_y == label) & (idx != i)]
 batch_rest = batch_x[idx != i]
 
 dists_1 = ((batch_x[i] - batch_same)**2).sum(dim=1)
 dists_2 = ((batch_x[i] - batch_rest)**2).sum(dim=1)
 
 prob = torch.exp(-dists_1/temp).sum() / torch.exp(-dists_2/temp).sum()
 assert float(prob.data) <= 1
 loss = -torch.log(prob)
loss /= batch_x.shape[0]

That’s not scary at all thanks to some nice features in PyTorch. First, with masks, it is straightforward to filter entries in the batch and since we can also use the logical AND operation, it is possible to combine selection filters. For batch_same, we select all entries with the same label, except entry i, while for batch_rest everything except entry i is selected. The calculation of the squared distances is also easy. Then we turn the distances into ‘similarities’ sum them up as in the example and divide them to get the final probability to sample an entry with the same label. At the end, we average the loss and we are done.

No doubt that the code can be optimized, but what we wanted to show that with PyTorch the implementation of complex loss functions, one that use advanced indexing for example, is straightforward. To fully utilize GPU devices or to implement efficient batching is a different story we tell later.

[1] karpathy.github.io/2019/04/25/recipe/

Sharing Useful Hints How To Train Neural Nets

These times it is pretty easy to find useful stuff for neural nets with respect to architecture, frameworks and methods. Without a doubt there is also material available that gives you hints how to avoid common pitfalls, but a lot of text is about the illustration of more complex ideas like Bert, to give an overview about an area or an introduction to a framework.

This is already helpful, but sometimes the things that were *not* said are even more interesting. That’s why we welcome the blog by Andrej Karpathy[1] so, much since it contains a cookbook how to avoid common mistakes. This form of assembled knowledge is rather seldom, but so much requires to lower the level of frustration for newbies when they start to dive into neural nets.

By random we stumbled about a homework from a university[2]. It is about implemented recurrent nets and gives you exactly these hints that are to often missing. Like “As always, make sure you can overfit a very small training set as an initial test”[2], which is also present in the recipe[1]. Or “and small hidden sizes for the RNN may work better than you think”[2] which hints that you should start small and only increase the capacity if needed. Or that any net with a softmax should start with a loss of -log(1/#number_of_classes) [2, “Getting started”].

For recurrent nets, there are already some hints, like gradient clipping, but the proper value likely depends on the data plus the architecture, so it might be a good idea to give the actual value instead of justing “gradients were clipped”. Like initializing the gates of highway network with a negative value to bias the first evaluation towards a state. This is definitely not rocket science, but might help to focus on the actual problem and not to get the initial net up and running.

For neural net veterans or skilled researchers this might seem trivial, but to successfully train a neural net can be challenging for newbies, since there is a lot of stuff you have to keep in mind. Plus, lots of hyper-parameters affect each other and even if you do almost everything right it might still not work.

Bottom line, maybe we can make a it a rule also to talk about failures and how we fixed them instead of just telling the good parts of the story. If the code is available, those hints are sometimes given in the code, but scanning a whole code-base for hints can also be time consuming.

[1] karpathy.github.io/2019/04/25/recipe/
[2] _www.cs.utexas.edu/~gdurrett/courses/sp2019/a4.pdf

Character-Based Neural Language Models

With enough data and computational power, it’s quite easy these days to train a high-capacity language model. In this case, you can start from scratch since there is enough data, like Wikipedia, to learn all model parts jointly which includes a word embedding. However, not all of us have access to a large-scale machinery and thus it still is a good idea to think about solutions with a low memory footprint.

Of course we are aware that there are plenty of pre-trained word embeddings, also for non-English languages, and also some which include subword information. However, with 1M tokens and 1M ngrams, each with 300 dimensions, such a model alone needs ~1 GB memory which is pretty much for smaller projects.

So, are there any other options to train an embedding with fewer resources but that is similar with respect to the expressive power? Well, we did not perform real comparisons, but if we go down to the character level, the required resources for embeddings are much lower. The price for is that we now require more computational power. For example, a sequence of N words needs N steps with a recurrent net. In case of characters, we have #(word_1) + #(word_2) + … + #(word_n) steps or stated differently, the sum of the chars per word which is much larger than N. For a word-based approach, the embedding is N x dim, where N is the distinct number of words, where for character-based methods, it is M x dim, where M is the number of distinct chars which is usually much lower. Depending on the data, M is maybe 100, where N is often in the range of 25,000 and more.

The trade-off between time and space is a necessary evil, but we can “cheat” a little to improve our situation. What if we use the character-level for the embedding, to restrict the number of parameters, but at the same time, we let the recurrent net, or whatever is used in the architecture, operating on words. How is that possible? We borrow ideas from conv nets for text. The idea is to treat each word separately and learn a dynamic embedding. Too abstract? Okay, let’s discuss some details.

Let’s consider the sequence “beer is awesome”. The set of chars is: {‘a’, ‘b’, ‘e’, ‘i’, ‘m’, ‘o’, ‘r’, ‘s’, ‘w’} and we have three words [beer], [is], [awesome]. Now, we need to convert each sequence of chars, like [b, e, e, r], into a fixed representation, and of course also [i, s] and [a, w, e, s, o, m, e]. This can be easily done with a convolution (2d) and max-pooling over time. The clou is with such a cnn encoder can convert words of arbitrary lengths into a fixed representation. With such a function, we can learn the vocabulary dynamically per batch and use it for the input to the recurrent net which now operates on words again. Additionally to the computational advantage, we are also able to encode variations of words or even new words, as long as they use the same character set from the training. For example, [beers] or [awesomeness] can be also encoded and hopefully each embedding is close to the “parent” word.

In PyTorch this is pretty easy to code. We set char_dim=15 and n_filters=10:

char_emb = nn.Embedding(n_chars, char_dim, padding_idx=0)
c1 = nn.Conv2d(1, n_filters, (1, char_dim))

#sentence [n_batch, max_wordlen]
emb = self.char_emb(sentence).unsqueeze(1) # [n_batch, 1, max_wordlen, char_dim]
c1 = self.conv1(emb)# [n_batch, n_filters, 32, 1]
# maximum for each filter, across all chars(!)
h1, _ = c1.squeeze(3).max(2) # [n_batch, n_filters]

This toy code shows the basic idea, but is limited in its capacity. The issue can be addressed by using wider filters and more of them. It should be noted that this is not our work, but that we borrowed ideas from [arxiv:1508.06615] and from other excellent publicly available research.

Bottom line, with this conv-based word encoder that works on character-level, we are able to generate a dynamic embedding for arbitrary words. The embedding can then be used as input for other model components, like transformer networks, or recurrent nets. Furthermore, since most language models are trained in an unsupervised way, we can also use the character-embedding function to build embeddings for other tasks in a very compact way, since there is no embedding matrix, but a function to do the embedding that has much fewer parameters.

PyTorch – Combining Dense And Sparse Gradients

In case you a train a vanilla neural network, gradients are usually dense. However, in PyTorch, the embedding layer supports the “sparse=True” option to speed up learning in case of larger vocabularies. We wrote about it before[1]. The only optimizer that can handle both dense and sparse gradients is SGD and not to forget Adagrad. But it should be noted in case of momentum, for SGD, at least for version <1.0.0, there seemed a slowdown because of the accumulator after some epochs. Thus, we usually stayed with Adagrad which was a good choice.

Nevertheless, it seems not satisfying to be tied to one optimizer for all our problems. But what happens if you mix both types of gradients? Adam complaints it can't handle sparse gradients and SparseAdam complaints it can't handle dense ones. Maybe the solution is obvious, but probably not straightforward as it should. In the a discussion[2] a quite elegant solution is proposed to avoid to call multiple optimizers.

But before, let's precisely describe the problem. Let's say we've a simple network with an Embedding layer (sparse) and a Linear layer (dense) and the forward() function combines both. If we want to use Adam as the optimizer, we need one optimizer for the sparse layers (Embedding) and one for the rest. We are omitting bias values for brevity.

The learning step looks like that:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()
opt_sparse.zero_grad()
opt_dense.zero_grad()
loss.backward()
opt_sparse.step()
opt_dense.step()

It’s not hard to follow, but it could be probably a bit more intuitive. The thing is that we can’t use net.parameters(), at least not without filtering parameters that require a sparse gradient. At the end, we need two parameter lists and a separate optimizer for each type: sparse and dense.

At least from the software engineering point of view, also noted by [2], we can improve the situation a little.


class MultipleOptimizer:
 def __init__(self, *op):
  self.optimizers = op

def zero_grad(self):
 for op in self.optimizers:
  op.zero_grad()

def step(self):
 for op in self.optimizers:
  op.step()

With this helper, the new training code becomes:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)
opt = MultipleOptimizer(opt_sparse, opt_dense)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()
opt.zero_grad()
loss.backward()
opt.step()

And we can take this approach a little further by letting it figure out what layers lead to sparse gradients and which lead to dense ones.

Surely, there is more work to do, but our aim was to give a simple recipe how to handle sparse and dense gradients without the hassle to call multiple optimizers.

[1] raberrytv.wordpress.com/2017/06/28/efficient-embedding-models-with-pytorch/
[2] discuss.pytorch.org/t/two-optimizers-for-one-model/11085

Implementing The SMORMS3 Optimizer in PyTorch

Even so the Adam optimizer is a pretty solid choice if you begin to train your neural network, it might be possible that learning is still slow at the beginning. Furthermore, for more complex loss functions, it might be possible that you need to tweak beta and find the optimal learning rate to get the network to start learning reasonably fast. The same is true for almost any other optimizer which can be a bit frustrating. So the question is if there is no optimizer that requires no tweaking or at least very little? The answer is, well there is a variant that at least has no beta values and thus, you can focus solely on the learning rate.

Yes, we talk about SMORMS[1] a variant of RMSprop that was conceived by Simon Funk in 2015. To the best of our knowledge it got not much attention back then, even so our earlier experiments confirmed that it improved the learning for some tasks and often delivered the same performance as RMSprop & friends.

The reason we thought about it again is that for our current task, the designed network learns very slow and needs a lot of time until it begins to converge. We tried quite a lot optimizer and hyper-parameters but as the blog by Andrej Karpathy[2] impressively emphasizes it, network training often fails silently and it might require quite a lot of debugging to find out why (axis, weights, architecture, data, etc.).

So, we checked if somebody already implemented SMORMS3 in PyTorch but found no code and thus, we decided to implement it from scratch. That was also a valuable experience since the time of Theano, we usually do not implement optimizers ourselves. Except for the in-place operations, it is straightforward and you only need to follow the “Optimizer” interface by PyTorch.

The code that follows is free, hopefully also free from serious bugs and mainly for illustration. The purpose was not high-speed performance, but the runtime seems already satisfying.

class SMORMS3(Optimizer):
 """
 Optimizer described by Simon Funk
 """
 def __init__(self, params, lr=0.0025, eps=1e-16):
  """
  Setup optimizer with parameters.
  lr: default learning rate
  eps: default epsilon
  """
  defaults = dict(lr=lr)
  super(SMORMS3, self).__init__(params, defaults)
  self.eps = eps

 def step(self, closure=None):
  """
  Perform a single gradient step for all parameters.
  """
  loss = closure() if closure is not None else None
  for group in self.param_groups:
   lr = group['lr']
   for p in group['params']:
    if p.grad is None: # skip if param has no gradient
     continue
    grad = p.grad.data
    param_state = self.state[p]
    if 'mem' not in param_state: # setup accumulators once
     param_state['mem'] = torch.full_like(p.data, 1.)
     param_state['g'] = torch.full_like(p.data, 0.)
     param_state['g2'] = torch.full_like(p.data, 0.)
    mem = param_state['mem']
    g, g2 = param_state['g'], param_state['g2']
    # g = (1-r)*g + r*g, g2 = (1-r)*g2 + r*g**2
    r = 1. / (mem + 1.)
    g.mul_(1 - r).addcmul_(r, grad)
    g2.mul_(1 - r).addcmul_(r, grad**2)
    # mem = mem * (1 - g*g/(g2 + eps)) + 1
    div = g2 + self.eps
    mem.mul_(1 - (g**2 / div)).add_(1.)
    # lrate = min(lr, g*g/(g2 + eps))
    lrate = torch.clamp(g**2 / div, max=lr)
    # p = p - lrate*grad/(sqrt(g2) + eps)
    new_grad = lrate*grad / (g2.sqrt() + self.eps)
    p.data.add_(-1, new_grad)
  return loss

[1] sifter.org/~simon/journal/20150420.html
[2] karpathy.github.io/2019/04/25/recipe