Like the introduction of the ReLU activation unit, batch normalization -BN for short- has changed the learning landscape a lot. Despite some reports that it might not always improve the learning a lot, it is still a very powerful tool that gained a lot of acceptance recently. No doubt that BN has been already used for autoencoder -AE- and friends, but most of the literature is focused on supervised learning. Thus, we would like to summarize our results for the domain of textual sparse input data, starting with warm-up that is soon followed by more details.
Because BN is applied before the (non-linear) activation, we will introduce some notation to illustrate the procedure. In the standard case, we have a projection layer (W,b) for an input x
g(x) -> W*x + b
and then we apply the activation function
f(x) -> maximum(0, g(x))
which is usually non-linear. For BN, it looks like this
g(x) -> W*x
h(x) -> (g(x) - mean(g(x))) / std(g(x))
f(x) -> maximum(0, a * h(x) + b)
The difference is that the projection “g” is without the bias, then “h” normalizes the pre-activation values with the statistics -mean and deviation of a mini-batch. Finally “f” is applied on the standardized output which is scaled with “a” and shifted with the bias “b”.
With this in mind, a ReLU layer can be expressed as:
bn = BatchNormalization(x) | Projection(x)
out = ReLU(bn)
The difference is that the activation function now needs to be a separate “layer” which does not have any parameter.
The use of sigmoid units has a bitter taste, but only because they saturate which slow down the learning or even stops it totally. However, with batch normalization, this can be avoided which is very beneficial if the data is binary and thus sigmoids are a natural choice.
Two issues we want to shed some light on are:
1) The need for dropout is reduced in case of BN, or it is even discouraged, but is it also true for the AE setting?
2) Usually BN is not applied for the top-layer, but at least one paper that is also about AE and BN mentions that they apply BN to all layers. Thus, we are interested to analyze the situation for AE setting.
As usual, we use Theano for all experiments, but don’t use any frameworks, to stay in full control of all parameters and to make sure we really understand what we are doing ;-).