The learning rate is THE most annoying hyper parameter for large-scale optimization problems, because we cannot use 2nd order methods due to the immense amount of memory such methods require for larger models. That means we need to tune the parameter ourselves, either with helpers or known heuristics. The good news is that our models are rather small compared to the big deep nets out there. For a softmax classifier with 1,000 input keywords and 20 genres to predict, with two hidden layers with 100 units each, we have about 115K parameters. This number is pretty small compared to real-world (conv)nets with a million and more parameters.
Our library of choice for the optimization is Theano because it is fast, flexible and it allows to optimize any cost function that you can express with the supported tensor functions. But in contrast to a situation where we are forced to use (stochastic) gradient descent to solve a problem, we sometimes want to use a different optimization method to speed up the convergence. Since Theano allows to derive the gradients for arbitrary cost functions, we can combine the power of automatic differentiation with the black box approaches used by some popular optimization libraries. In our case, we want to illustrate it with an example by coupling Theano with scipy optimize. The method we use is Conjugate Gradient descent “cg” for short and it works best in batch mode.
Because scipy optimize only supports flat parameter vectors, we need a slightly different representation for the model. Our example is a RBM with 256 input nodes and 50 hidden nodes model trained with Minimum Probability flow. The parameters of an RBM consists of a weight matrix “W”, a hidden bias “h” and a visible bias “v”. Therefore, the length of the parameters is 256*50 + 50 + 256 = 13,106. The access to these values is done by indexing the flat parameter vector “theta”.
W = theta[0:256*50].reshape((256, 50))
h = theta[256*50:256*50+50]
v = theta[256*50+50:]
theta is initialized by a numpy array where the first 256*50 bytes are initialized by random and zero otherwise.
theta = np.zeros(13106)
theta[0:256*50] = np.random.normal(scale=0.01, size=256*50)
theta = theano.shared(theta, borrow=True)
The cost function can then be defined as usual because with the indexing, there is no difference to an ordinary Theano model:
energy_1 = -T.dot(x1, v) -T.sum(T.log(1. + T.exp(T.dot(vis, W) + h)), axis=-1)
energy_2 = -T.dot(x2, v) -T.sum(T.log(1. + T.exp(T.dot(vis, W) + h)), axis=-1)
cost = T.mean(T.exp(0.5*(energy_1 – energy_2)))
However, the next step is different. We only consider the gradients and thus, we do not define an update function. Plus, we only have one
grads = T.grads(cost, theta)
To evaluate the cost function, we define a new function.
loss = theano.function([x1, x2], cost)
x1 is for the training data and x2 is for the training data flipped by 1-bit for each training instance.
The next step is to use the gradients provided by Theano to guide the scipy optimization procedure. To optimize an arbitrary function, we need to provide a gradient for specific parameters “theta”, the function to minimize and optionally a callback.
We call the functions “train_fn” and “train_fn_grad”. Both functions have a single parameter theta. The Theano model parameters need to be set to this value before either the gradient or the cost is determined:
The task of train_fn is just to return the cost of the given parameters theta for the function to minimize:
“return loss(X_train, X_train_flip)”
The other function returns the gradient for the parameters theta:
“return grads(X_train, X_train_flip)”
Finally, we can call the actual optimization procedure:
fmin_cg(f=train_fn, x0=theta.get_value(), fprime=train_fn_grad)
The whole procedure can be seen a black box. We randomly initialize the box, see what parameters it magically spills out, then we provide the gradient for these specific values and wait for the next iteration. With this approach, there is no learning rate to tune. All the magic is done inside the black box. Furthermore, the box also takes the burden of many other decisions, for instance, when the solution has been converged or when hyper parameter needs to be adjusted. At the end, we get the “best” possible solution based on the initial parameters.
There are more elegant ways to interface Theano with scipy but regardless of the method, it feels a little strange that all parameters have to be flattened and especially for larger models, extra care needs to be taken to avoid mixed representations or overlapping of parameters. But once this hurdle is taken, especially for convex cost functions, the benefits of advanced optimization can be huge.