The question why l2 weight decay helps to regularize (linear) models has been answered several times. And often the explanation is done graphically to show the effect on the decision boundary which is useful to understand the impact of different values. We want to illustrate it the impact by analyzing feature weights of a linear SVM model.
The objective of a linear SVM is:
alpha * T.sum(w**2) * 0.5 + T.maximum(0, 1 - y * T.dot(x, w)).
where x is the input vector, w the weight matrix and y the label as a scalar. We are omitting the bias for brevity.
Let’s start with alpha=0 which means the learning is only driven by the hinge loss. In case there is a small group of features present in all positive samples, the model assigns higher values to those weights. So far, this is nothing special and hopefully helps to drive the error towards zero. With a test set, we can define a stop condition and determine the best model.
But what happens if new data follows a slightly different pattern and those features from the group are not (always) present? Because the learning assigned larger weights to those particular features, the only hope is that the rest of the features is sufficient for a position prediction: T.dot(x_new, W) > 0. However, if the learned model mostly relies on high-magnitude feature weights for the predictions, new data might get a lower confidence or even a negative value.
In other words, without a constraint for the weights, it can happen that the prediction is using very few features to determine the class and thus, assigns high values to those weights. In the most extreme case, the prediction only relies on a single feature for a correct prediction of the training data. Thus, other features are mostly ignored, since a small subset of features explains all regularities in the _training_ data. How can this problem be fixed with weight decay?
Let’s ignore alpha for the moment and focus on the sum. The loss of the l2 decay is lower if all weights are close to zero or in general “smaller”. For instance, W = [0.89, 0.01, 0.05, 0.63] has a l2 loss of ~1.19 while W = [0.445, 0.05, 0.1, 0.315] only has a loss of ~0.31. As a result, a model with l2 decay is forced utilize more features for a correct prediction instead of relying just on very few features. This also has a positive effect on the regularization because the weight is “distributed” among more features which increases the chance that unseen, but more challenging, examples will be correctly classified.
Bottom line, weight decay is a very easy, but also powerful method to to improve the generalization of models. The alpha is a trade-off between classification accuracy and the model capacity which need to be chosen carefully. On the one side, if alpha is too large, no correct predictions are possible because the weights cannot grow to push examples on the correct side of the hyperplane. And on the other side, if alpha is too small, the model will overfit and does not -sufficiently- generalize to new data.