Alternatives to SGD: L-BFGS

The training of a sophisticated neural networks feels a little like finding a needle in the haystack, with a haystack of the size of a whole continent. The reason for that is that cost functions of those networks are highly non-convex which means that there exists lots of local minima and very likely lots of “flat” regions which needs to be traversed.

Stated differently, weight initialization is very important because it tells the network where to start its journey which is vital to find a good local minima and to assure that the network learns anything at all. This is related to the vanishing gradient problem where small gradients propagated through many network layers vanish at the end and therefore, there is no learning -weight updates- at all. The problem is also related to saturating non-linear hidden units because if the neuron is at its maximal value, no error signal is propagated and the model does not improve at all.

For large-scale datasets, a variation of stochastic gradient descent (SGD) is often the only choice to fight the massive amount of required resources for successfully training a model. However, not every problem requires a huge dataset and in this case, it is often very beneficial to use a more sophisticated optimization method.

BFGS is such a method, named after its inventors, but the drawback is that the (inverse) hessian matrix H is required during training. For smaller models, the memory required to store H is no p problem, but even for a moderate models with about N=50K parameters, the overhead of N*N*sizeof(float32) bytes is noticeable. For that reason, L-BFGS is usually used because, it approximates the matrix H and thus requires much less memory. Therefore the name: (L)imited-memory-BFGS.

As a marginal note, often the algorithm is used in a black-box mode which means we provide the gradient for a problem, a cost function, and we get updated model parameters. In an earlier post, we gave a brief example how this can be combined with Theano to train a model.

So, in a nutshell, we still need a good initial guess (weight initialization) of the parameters, but we do not need to worry about the learning rate at all. And since there is no schedule or adaption
of the learning rate required, the convergence to a (good) local minima is usually much faster in terms of iterations or steps. In case you are solving problems in the Python world, there is also no need to fiddle with the algorithm yourself, because there is a good implementation of L-BFGS available in scipy.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s