The Circus Of Optimization

Opinions about kernel machines are mixed, but at least they have a convex loss function which leads to a global minimum. Of course, this property will not help much, if you need exponentially many training examples to build the support vectors in the worst case. But at least the optimization part for SVMs is fun, while it is often a pain for Neural Networks.

For instance, in the last decades a lot of update routines for the gradients have been proposed: AdaGrad, AdaDelta, RMSprop, Rprop, Nesterov momentum, Adam and all the other variants. Not to forget methods to adjust the learning rate. And of course methods to initialize the weights to break the symmetry, like uniform, gaussian, glorot-style or orthogonalization.

Therefore, even for simple networks, a lot of decisions have to be made:
– what neurons will be used (relu, tanh, sigmoid, …)?
– how many hidden layers?
– how many hidden nodes?
– hyper-parameter for weight decay (or early stopping)?
– use drop-out or not? (plus the probability value)
– what kind of loss function?
– how to update parameters? (rmsprop, adagrad, …)
– how to init parameters? (uniform, orthogonalization)
– size of the mini-batch
– use a constant learning rate or decay it somehow?
– select an initial learning rate
– is data pre-processing useful/required?

Thanks to the community there are libraries that provide a convenient interface to easily build models with standard parameters that had been proven to be efficient. However, if the network does not get off the ground” or the problem is rather exotic, the user has to adjust the default parameters or even worse, the user has to deal with low-level details.

In any case, it is essential that users have a basic understanding of numerical optimization and Neural Networks and even then, working with large networks can be very frustrating and the frustration is not limited to novice users only. One reason is that there is no standard procedure to select the best hyper-parameters in a straightforward and efficient way. There surely is grid search or cross-validation but often this takes too much time and is therefore not feasible.

Sometimes neural networks feel a little like a clown’s car, because first the problem (car) seems to be pretty small and not very hard to solve, but then an implausible number of challenges (clowns) is emerging from this car and soon after, it feels like being Sisyphus. At least until one found a proper learning rate and a good initialization of the weights.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s