Before the advent of batch normalization (BN), dropout was the silver bullet to avoid overfitting and to regularize a network. Now, after BN belongs to the standard tools to train networks, dropout is often omitted or at least the rate is drastically reduced. However, for some special ‘networks’, like factorization machines (FM), neither dropout nor BN can be (efficiently) used and thus, we are searching for ways to regularize such a model with a stronger method than weight decay.
First, let’s see why dropout does not make sense. The input data for FMs is usually very sparse and if we use dropout for the input data, it is possible that all elements are zero which does not make any sense. So, for the linear part T.dot(x, W) dropout is not possible. What about the non-linear part T.sum(T.dot(x, V)**2 – T.dot(x**2, V**2))? It can be easily seen that we have the same problem here.
As noted earlier, weight decay is always a good idea to prevent that a prediction of a model is driven by very few large weights. For example, if we have two weights [w1, w2] the model w=[0.3, 2.1] have a loss of
l=0.5 * (0.1**2 + 2.1**2)=2.21,
while the weights w2=[0.1, 0.7] have a loss of
l=0.5 * (0.3**2 + 0.7**2)=0.25.
The problem of weight decay is that it does not help if the model got stuck in a poor local minima or in case of plateaus.
In a paper about sgd variants they mentioned a simple method that adds noise to the gradient [arxiv:1511.06807]. The approach is very simple because it just adds Gaussian random noise, with a standard deviation that is decayed over time, to the gradient. The annealing ensures that at later stages of the training, the noise has less impact. This is beneficial because we likely found a good minima and too much noise increases the chance that we “jump” out of it.
The intention of the paper is to make the optimization of _deep_ networks easier, but the method itself is not limited to deeo networks but can be used for all kind of models. So, we used gradient noise to train a FM model, to see if we can improve the generalization performance (by finding a better local minima).
To compare the results, we fixed all random seeds and trained the model for ten epochs. The data is a two-class problem to predict if a user likes a movie or not. To be fair, the problem is already simple, but with gradient noise we achieved a precision of 100% in 3 epochs, while without the noise, it took 9 epochs. We further fixed a set of movies to study the predictions of both models. However, there is no clear trend except a possible tendency that the model trained with the noise gives lower scores to movies where the preference is “vague”, but also lower scores to movies that are rated negative.
Bottom line, we could not clearly verify that gradient noise improved the accuracy of the model, but the method definitely helps to fight against overfitting where dropout cannot be used and lead to a faster convergence.