# Label Schmoozing

Even if there are millions of items out there, for instance images, tweets, documents and other stuff, labels are still very scarce. Yes, there are large datasets with labels, but they might not work for the problem at hand, or the labels might be also too noisy. So, at the end, we often have a rather small dataset to train our model and with a such a dataset comes the old problems. In case we do not regularize the model, the network is likely to memorize most patterns and learns the noise in the training data. In other words, the model overfits and does not generalize to new examples. A good sign that this happens is that the network is overconfident when it predicts the label distribution. That means, even if the network is totally wrong, it puts all confidence in a single class. Of course it is the best to get more data, but if this is not possible, there are astonishingly simple regularizers that can help to improve the situation.

(1) Label Smoothing[1]

We start with label smoothing which became popular again with the inception network architecture, though it is actually around for a longer time, but was probably forgotten or simply ignored. The idea is pretty simple: The network is discouraged to put all confidence in a single label class. Informally speaking, even if the network is sure that the image contains a frog, it is also open to other suggestions even if they are unlikely. This leads to the following regularized loss:

y = T.vector()

y_hat = T.nnet.softmax(..)

y_smooth = 0.9 * y + 0.1 / num_labels

loss = -T.sum(y_smooth * T.log(y_hat))

If we have 4 classes, and the correct class is 0, y_smooth looks like this: `y_smooth = [0.925, 0.025, 0.025, 0.025]`.

Stated differently, for each label y, we set it to k with probability 0.9 and replace it with probability 0.1 with a label from the uniform distribution, 1/#labels, that is independent of the actual label.

Compared to other regularizers there are only very few papers with qualitatively results for neural networks. However, since the implementation is straightforward, it is easy to compare vanilla and smoothed models for the problem at hand at the cost of training the model twice.

(2) Penalizing Confident Predictions[2]

This particular approach is a newer but also related to the idea of method (1). It is based on the idea to penalize network outputs with low entropy. Where the entropy is defined as follows:

`entropy = -T.sum(y_hat * T.log(y_hat))`

This seems familiar, right? Now, let’s fill in some numbers to learn something about the range. Since y_hat is the output of a softmax, all values are in [0, 1].

Let’s calculate the entropy for some examples with 4 classes:

y_hat = [0.95, 0.04, 0.005, 0.005] => entropy = 0.23

y_hat = [0.75, 0.1, 0.14, 0.01] => entropy = 0.76

y_hat = [0.25, 0.25, 0.25, 0.25] => entropy = 1.38

The aim of the model is thus, to discourage peaked distributions that have low entropy with the maximum when all labels are equally probable. This is of course not useful for classification which means we need an extra hyper-parameter ‘beta’ to control the regularization.

This leads to the following regularized loss:

y_hat = T.nnet.softmax(..)

entropy = -T.sum(y_hat * T.log(y_hat))

loss = -T.sum(T.log(y_hat) * y) - beta * entropy

Actually, as noted in the comments for the approach, the idea has been around for a while, e.g. [3], but not in the domain of neural networks. Thus, the situation is the same as for (1), but again it is easy to assess the results by comparing a model that is trained without the regularizer and one that uses it.

Summary

The focus of the post is not whether or not these methods are actually new, but how to fight overfitting in case of smaller datasets and/or challenging learning tasks. Especially for method (2) the various experiments confirmed that this regularizer can actually help to learn better neural network models which is important since the methods were not broadly applied in this domain yet.

Links

[1] arxiv.org/abs/1512.00567

[2] openreview.net/pdf?id=HkCjNI5ex

[3] jmlr.org/papers/volume11/mann10a/mann10a.pdf