In the previous post, we talked about the importance to properly initialize the weights before training. This is a well known problem and even if some recent heuristics helped a lot to find good values, the problem is far from being solved yet. Though, it is more severe for deeper models, it also affects the quality of shallow models. But now to something different.
In case we have an unbalanced data set, with two labels for simplicity, the model will focus on the class with more samples. That means the model mostly learns patterns from this class because it is sufficient to drive the loss down. For an 80/20 distribution of the labels, a model that has an accuracy of 80% does not even need to correctly predict a minor label, which has a huge impact on the generalization ability of the model and the overall quality. To solve the issue, we could sample according to the inverse frequency of a class, or we could use a simple round robin method. However, because the data is not balanced, it takes much longer until the model has seen all samples from the majority class and it sees samples from the minor class more often, because of the repetitions. To some degree, the latter issue can be fought with drop-out to avoid that a network sees exactly the same sample twice.
So far, we have the weight issue that we address with the common heuristic:
randn(n_in, n_out) / sqrt(n_in)
and the sampling issue that we fight with round-robin sampling in combination with drop-out. Thus, the last remaining issue is regularization.
The most common regularization method is called L2 weight decay that penalizes large weights W: sum(W**2) * coefficient, where coefficient needs to be carefully selected, but a common choice is 0.0005. Why this helps to avoid overfitting? Intuitively, the term is big if the weights are large. For instance, W=[0.98, 0.01, 0.01] has a a value of 0.9606, while W=[0.5, 0.25, 0.25] has only a value of 0.375 with coefficient=1. That means smaller weights that are more “diverse” are preferred because they have a lower penalty value. The weights are allowed to grow, but only if they substantially contribute to lower the actual loss function of the model. In other words, the term discourages to put too much confidence in single features but tries to spread them on several features to model patterns in the data. It should be also noted that in case of models with saturating units, larger weights can lead to extreme values of neurons which decreases the non-linearity of the model. Thus, we can say that smaller weights often lead to “simpler” solutions that also help to fight overfitting because of “smoother” mappings to the feature space which are more robust to smaller changes in the input space. For example, slight variations of the input sample should not result in a different label prediction.
Bottom line, to train a good model it is extremely important to pick reasonable initial weights, but at the same time it is also essential to use a regularizer for the model to ensure its ability to generalize to unseen data in combination with sampling to ensure that the model learns patterns from all classes and not only the majority one.