Changing The Learning Landscape

A huge advantage of learning one sample at a time is that there is immediately some progress even if a single step is more noisier than using a mini-batch or even a full batch to estimate the direction of the gradient. Plus, the method is very memory efficient since it does not depend on the size of the data set. However, the most crucial hyper-parameter is still the learning rate and related to it, the optimizer for the weight updates. Vanilla stochastic gradient descent should be avoid at all costs, since it might take an eternity to converge.

Especially in the last few years there has been a lot of improved methods, like Adam or RMSprop (+ momentum) or SMORMS3, but even if they work with default values, a lot of tuning might be required to find the best hyper-parameters for your problem at hand. The alternative is to use a (quasi) 2nd order method for optimization but depending on the algorithm this is not always feasible due to memory consumption or the computational complexity.

For instance L-BFGS is working on large batch sizes and even if it is possible to use it with smaller batches, the size of them is often limited especially for GPU-based models. In any case, the estimation of the step size requires a minimal amount of samples and in on-line mode, with just a single sample, a reliable estimation of the curvature seems to not very likely.

The reason why we bring this up is that for a lot of sophisticated learning problems, those beyond simple classifications, the choice of the optimizer + hyper-parameters can make the difference between learning a useful model and learning a crappy model or no learning at all.

In other words, despite the fact that there are lots of tutorials that suggest to use method A with hyper-param B and it will work, the reality is different when it comes to more advanced loss functions and/or constraints like that you have to feed one sample at a time. With minimal knowledge of the underlying algorithmic method, all you can do is to tune the learning rate, maybe the weight decay, but if this does not work you have to switch to a different method in the hope it’ll fix the problem.

All in all this seems to be rather unsatisfying since it feels like trial-and-error instead of finding the cause for the real problem. In general, despite the huge success of so called Deep Learning, some (theoretical) aspects of it still feel like we are living in the medieval times and too often, at least it feels this way, people try to beat state-of-the-art results instead of advancing the theory to improve the learning landscape.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s