For systems where training examples are delivered as a stream, it is not straightforward to include batch normalization (BN) in the network architecture. Why? BN requires a reasonable number of training samples to estimate the mean and variance and with a single instance this is not really possible. This is sad because BN turned out to be a pretty good companion to train arbitrary models in a faster and often more robust way. So, is there anything we can do? Not really since it lies in the nature of BN to use batches to estimate the statistics, but we can try to normalize something else…
Recently, a paper [arxiv:1607.06450] proposed layer normalization (LN), a method that also works for a batch size of 1 and that does not need the extra overhead to fix the mean/variance of the training data. But it should be noted that LN is no drop-in replacement since it changes the dynamic of ConvNets, but this is none of our concern since we only use fully connected layers.
The method is explained in a few words:
I) determine the mean/std for each sample in the batch with respect to the hidden units without the non-linearity:
linear = T.dot(x, W)
mean = T.mean(linear, axis=1)
std = T.std(linear, axis=1)
II) normalize the values in combination with a gain and a bias:
output_ln = gain * ((linear - mean) / std) + bias
III) apply the non-linearity if required:
output = T.maximum(0, output_ln)
As explained by the paper, each sample in the batch gets different values for mean/std, but for a specific sample, the values are used for all neurons. Thus, we do not need a running average for the statistics of each layer and the procedure during training is the same as in test mode.
Now, one might ask is there still a benefit, when for most models we can use BN? Yes, there is, because especially for unbounded ReLU units the correlation of the current layer with the previous layer can lead to drawbacks due to the high correlation and for streaming systems BN is no option to fix it.
To see if LN is really beneficial for our setup, we trained a simple ReLU auto-encoder with a fixed random seed for comparison. We used 1,992 input units and 128 hidden units. The learning rate was constant and we used the cross-entropy to measure the error. The training was done for only two epochs.
plain network cost: 2.6762, 1.5721
plain + LN: cost : 1.4755, 0.8076
At least for the loss, the benefit is clearly visible. Furthermore, the grow of the weights is also very different with LN enabled:
plain norm of W : 20.89, 21.36
plain + LN norm of W: 27.84, 31.24
A quick glimpse at the nearest neighbors of some selected samples also look promising, so we will continue our LN journey hopefully tomorrow.