A Closer Look at Pre-training

Here is the short version of a long story: To train a bigger model, you usually start by pre-training each layer separately with stochastic gradient descent and momentum. The aim of this step is not to find the best minimum, but to move the weights to some region in space that is close to one. This is done for all layers.

The result is not very useful yet, since there was no joint effort to optimize all layers to perform a specific task, like to reconstruct the input data. Stated differently, the idea is to help each layer to make an educated guess instead of starting the parameter search from scratch. After all layers are roughly initialized, the layers are unfolded into a single, deep auto-encoder that will be optimized jointly.

While the pre-training is usually done with good old gradient descent, the fine tuning often uses a more sophisticated approach, like Conjugate Gradient. The idea is to get rid of the adjustment of the learning rate and let the method itself decide what is the best choice.

Our favorite library, Theano, comes with gradient descent out of the box, but needs some extra effort to work with external optimization routines. Usually, the optimization function is used as a black box. We start with an initial guess of x and then we provide the gradient of the cost function F at x.

To integrate it into Theano, we define our cost function as usual but then we create two functions. First, a function to evaluate the cost function F with our data to return the cost and second, a function that returns the actual gradient at x. It should be noted, that x is equal to all parameters of the model, but since most optimization APIs expect a 1D vector, we need to flatten our model parameters into a single vector that represents the whole model.

We illustrate the procedure with a simple auto-encoder. The parameters for this model are (weights, bias_hidden, bias_visible). Therefore, x is weights||bias_hidden||bias_visible. At least, a final step is required, to update the Theano model with the output of the optimization function. To do this, we map the individual parameters to a range of x:

weights = x[0:(num_hidden*num_visible)].reshape(num_hidden, num_visible)
off = num_hidden*num_visible
bias_hidden = x[off:off + num_hidden]
bias_visible = x[off+num_hidden:off+num_hidden + num_visible]

This has the advantage that an update of x automatically updates the parameters of the model. Now, we are ready to call the optimization function. In pseudo code this should be something like:

fmin_cg(f=cost_function, fprime=grad_function, x0=x)

where “f” is the function to minimize, “fprime” returns the gradient at x and “x0” is our initial guess which was selected by pre-training.

To sum it up, in this post we considered a very simple example with just a single layer. In case of more layers, the procedure remains the same and only the mapping of the model parameters to x gets a little more unreadable. Plus, the cost function is a little more complex and so is the setup of the initial model parameters. In a nutshell, we use Theano as a black box to calculate gradients and some optimization function as a black box to choose the ideal learning rate to minimize the cost function. The result is one big fine tuned auto-encoder model for the data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s