# Batch Normalization: Warm-Up

Like the introduction of the ReLU activation unit, batch normalization -BN for short- has changed the learning landscape a lot. Despite some reports that it might not always improve the learning a lot, it is still a very powerful tool that gained a lot of acceptance recently. No doubt that BN has been already used for autoencoder -AE- and friends, but most of the literature is focused on supervised learning. Thus, we would like to summarize our results for the domain of textual sparse input data, starting with warm-up that is soon followed by more details.

Because BN is applied before the (non-linear) activation, we will introduce some notation to illustrate the procedure. In the standard case, we have a projection layer (W,b) for an input x

g(x) -> W*x + b

and then we apply the activation function

f(x) -> maximum(0, g(x))

which is usually non-linear. For BN, it looks like this

g(x) -> W*x

h(x) -> (g(x) - mean(g(x))) / std(g(x))

f(x) -> maximum(0, a * h(x) + b)

The difference is that the projection “g” is without the bias, then “h” normalizes the pre-activation values with the statistics -mean and deviation of a mini-batch. Finally “f” is applied on the standardized output which is scaled with “a” and shifted with the bias “b”.

With this in mind, a ReLU layer can be expressed as:

bn = BatchNormalization(x) | Projection(x)

out = ReLU(bn)

The difference is that the activation function now needs to be a separate “layer” which does not have any parameter.

The use of sigmoid units has a bitter taste, but only because they saturate which slow down the learning or even stops it totally. However, with batch normalization, this can be avoided which is very beneficial if the data is binary and thus sigmoids are a natural choice.

Two issues we want to shed some light on are:

1) The need for dropout is reduced in case of BN, or it is even discouraged, but is it also true for the AE setting?

2) Usually BN is not applied for the top-layer, but at least one paper that is also about AE and BN mentions that they apply BN to all layers. Thus, we are interested to analyze the situation for AE setting.

As usual, we use Theano for all experiments, but don’t use any frameworks, to stay in full control of all parameters and to make sure we really understand what we are doing ;-).

A few comments :

1. Original BN paper mixed together two independent techniques: variance normalization, and ” scale and shift per channel”. We can use each of them separately. For example we can use BN layer just before conv layer, and original motivation for adding bias and scale will disappear.

2. Why BN speed up training? The obvious answer is that when you do back propagation of gradient , you scale it by dividing on std of input, and this works as local learning rate adaptation. More interesting observation is that it looks like BN changes the curvature of loss function in such that way that it minimizes overfitting.