Those who tried to train a bigger model from scratch probably know that this can be very frustrating, because often papers are very brief on failures and necessary heuristics to ensure that the model converges. And even then, it often takes some patience and experience to get your network from the ground. With normalizing layers and more sophisticated optimizers like Adam, the training is much easier than in the early times of AlexNet. However, even with all the fancy stuff, it is not unusual that a lot of hyperparameter tuning is required before the training starts to converge and the sad part is that this tuning often depends on the dataset and the model architecture and thus cannot be reused automatically.
During this year, we read a lot of papers, tried a lot of models, also unusual ones, and then we thought about what helped most to train those models that we had most trouble with converging. It is sad but true that a lot of phenomenons cannot be explained precisely, but it is at least helpful to know these tricks to let your model converge to some useful state:
– Despite the fact that Adam is so popular, it was and still is not our first choice, instead we use Adagrad. Maybe with adjusted hyperparameters (betas, eps) it would perform as well as Adagrad, but the latter requires less tuning and converged faster most of the times. Since we use a broad range of datasets and models, the only reason cannot be a bias towards the data or the architecture. So starting with AdaGrad and a learning rate of ~0.05 was a pretty good baseline.
– Gradient clipping is used to avoid exploding gradients, especially for RNNs, but empirically it also helps for loss functions with a certain landscape which can lead to larger gradients. In general it is often useful to decouple the direction of the gradient with its length. To clip gradients never helped us if a model did not converge, but it often helped to stabilize the training.
– But by far the most successful method was to use cyclical learning rates. There are some hints in the literature why this helps like “avoid spurious minima” but it is not fully understood yet. In contrast to a learning rate decay, the rate goes up and down according to some schedule. One popular scheduling is sine or cosine. The idea is elegant and simple: First the learning rate increases up to LR_MAX and then it decreases to LR_MIN(=0). If we set RESTART to 10 steps, we go from 0 to LR_MAX in RESTART/2 steps and then we decrease LR_MAX to LR_MIN in RESTART/2 steps, at least for a sine schedule. The formula is also straightforward: lr_next = sin((STEP % RESTART / RESTART) * PI)*LR_MAX.
Let us consider the extreme cases: sin(0*pi)=sin(1*pi)=0, sin(0.5*pi)=1, which means we start with zero and halfway to RESTART, it reaches the maximum and at the end it is zero again. The steps variable STEP is increased each backprop step and thus needs to be reduced to the range of RESTART.
So why this makes the difference between no learning at all and gaining momentum after a dozens of steps?
For one model that did not converge, we tried at least to understand what goes on with respect to the gradient norm over time. For the test, all parameters were the same, only the learning rate was adjusted (1), or kept fixed (2). In case (2) the gradient norm quickly reached almost zero and never recovered which indicates that the model is trapped inside a unfavorable region of the loss landscape and it requires smaller steps to escape it, to explore more favorable regions. For case (1) the model starts with a very small learning rate that reduces the chance to overshot more favorable paths, but on the same time it might also escape spurious minima by using larger step sizes. After some initial very small gradient norms, like (2), it gains ‘momentum’ and the gradient norm continually increases.
We have to admit that this explanation is not satisfying at all, but at least it allows us to successfully solve problems at hand and by the insights we got from the model and its learned representation, we hope that we can eventually put together the puzzle pieces to explain why this step is so useful.
In a nutshell: Also consider less popular optimizers, clip your gradients and treat your lrate as a cycle.