Training a neural net in the old times, without fancy normalization layers and often randomly chosen weights that were not optimal was a real pain. Even for vanilla classification networks it sometimes took several runs with different hyper-parameters before the network started to learn. At least for standard architectures, like a ResNet, nowadays, it is fairly simple to train a classifier for an arbitrary dataset.
However, despite a smart grid search and to start with good default values, we are still far away from a cookbook that helps to solve a problem with neural nets in a more disciplined way. The blog post  has a lot of very useful hints, but due to the complexity of modern nets, it is impossible to put every hint into a single document. For example, training a large transformer is still not trivial and requires expertise and probably also patience.
And as mentioned earlier several times, there a lots of loss functions and combinations of them that will lead very complicated loss landscapes. In this case, it is imperative to start simple. Like using a very conservative learning rate + optimizer and try to overfit a small batch to verify if the loss goes to zero. This tip, also mentioned in , is one of the most helpful tips we know, since without this assurance, errors can be lurking anywhere. Truth be told, when the loss goes to zero, errors are still possible and likely, but at least we know that the network is able to solve a simplified version of our problem.
For example, in case of recurrent nets, gradient clipping and gating are now fairly standard, maybe along with layer normalization. But does it mean that by using those means, the success is guaranteed? Definitely not for all kind of nets and problems. The choice of LSTM vs. GRU might be easily evaluated by comparing resources vs. accuracy, but what about gradient clipping? What norm is useful? In papers those values range from 0.01 to 10 and this value surely depends on the actual norm of the gradient during training with a specific net and loss. And even for such a simple method like layer norm, there are at least two flavours, namely pre: f(ln(x*W)) and post: ln(f(x*w)). Depending on the paper and therefore the problem / method, there is no clear winner.
Recap: For RNNs we have to choose the number of units / layers, the type (GRU or LSTM) w/o LN and the norm for clipping the gradients.
Starting simple is definitely a good idea, but especially for RNNs good default values for the number of units can be challenging. We once had a classification problem where a small number of units lead to an incredibly slow learning, while a large number of units lead to unstable learning despite using gated units, gradient clipping and layer norm. At the end, we had to do a grid search on a smaller dataset to find a good trade-off for number the units.
When RNNs became popular, it took some time until also researchers without a profound expertise in this area were able to use them for ‘everyday tasks’. This was also owed to the introduction of high-level frameworks like Lasagne or Keras, both introduced in 2015 and both were using Theano as a back-end then. However, in case of problems, high-level APIs hide most information as they are supposed to, but this makes debugging very hard and in case of static computational graphs (used in Theano) even more horrible.
So what to do if the loss does not go down as expected, at least for the training data? Switching to a different optimizer? Lowering the learning rate, or increasing it? More hidden units? Different kind of units?
A good tip is definitely to monitoring as much as possible. Besides the loss, there are many other possible values:
– The norm of the gradient / max & min values of the gradient, max(abs) values, etc.
– The total norm of the network / norm of each layer
– Max / min of activation functions (think of -1/+1 for tanh, 0/1 for sigmoid)
– Number of ‘dead’ units in case of ReLU
– In case of NLL it is useful to compare early loss values with -log(1/N_CLASSES)
The idea is to get a feeling for the flow of the data through the network. First, in the forward direction, where wrongly sampled weights might lead to abnormal behavior like dead units, or very large/small values or even saturated values. Similar, wrongly encoded data might lead to similar behavior. Second, in the backward direction, which is the flow of the gradient. There large gradients can lead to NaN values, or oscillating behavior and very small gradients can lead to vanishing gradients and therefore no progress.
Since it is very hard to keep track of those values, it is a good idea to visualize a condensed summary in a plot. Maybe the gradient per layer as a bar graph . But it is also useful to dump a subset of those metrics during training to get a feeling if the training is healthy. We usually dump at least the loss train/test, the total norm of the network and an average of the gradient norm.
With all these tips we even did not scratch the surface and we are far away from a manual or even a reference that helps to find pointers in case of specific symptoms. A further problem is that researchers who work a lot with different nets, usually focus to publish only the positive aspects of work but rarely mention drawbacks and negative results which might be at least in a condensed form equally useful.