Batch Normalization: The Untold Story

With all the success of BN, it is amazing and disappointing at the same time that there are so many fantastic results but so little practical advice, how to actually implement the whole pipeline. No doubt, BN can be implemented pretty easy in the training part of the network, but that is not the whole story. Furthermore, there are, at least, two ways to use BN during training. First, with a running average for mean/std values per layer which can later be used for unseen data. Second, to calculate the mean/std values for each mini-batch and then run a separate step to fix the statistics for the data at the end of the training.

Method ONE is self-contained, since it does not involve a second step to finalize the statistics, but it introduces additional hyper-parameters, like the decay value end an epsilon to ensure numeric stability. Furthermore, for some architectures we encountered numerical problems which lead to NaN values.

Method TWO is more straight-forward because the mean/std are determined for each mini-batch and no parameter updates are involved. However, this also forces us to run an extra step over the whole data to fix the statistics, per layer, for the mean and the standard deviation.

We are stressing all this because because posts regarding the use of BN for unseen/test data were pretty frequent in the near past. All in all, despite the fact that BN is not very complex, a guideline for beginners would still make sense to avoid common pitfalls and frustration. And since BN is becoming standard in the repertoire of training neural networks, researcher need a solid understand of it. Like the detail that with BN a batch size of one is not possible any longer, because the input data does not suffice to estimate the required statistics. The same might be true for a smaller batch sizes like 4, 8 when the original batch size was much larger.

With all this in mind, a possible pipeline could looke like that:

= Training =
– setup a network with separate BN layers as described in the previous post, follow by an activation layer
– sample random mini-batches and feed them to the network which will use per batch values for mean/std
– continue as long as the network did not converge

= Postproc =
– disable all weight updates and switch all BN layers to accumulate mean/std values by using a moving average with a decay of, say, 0.9
mean = decay * mean + (1-decay) * mean_batch #initialize with 0
std = decay * std + (1-decay) * std_batch #initialize with 1
– feed the whole training data in mini-batches to the network to update the statistics, the output can be ignored
– freeze the mean/std values for each BN layer and store them along with the network parameters

= Test =
– load the network weights from disk and set all BN layers to test-mode which uses the fixed mean/std values from Postproc and no updates of the values are performed
– feed new data to the network and do something with the output

The annoying parts are the additional passes of the data through the network, but as soon as this is done, the fprop part of unseen data is business as usual. However, in contrast to non-BN networks, the processing overhead is noticeable but usually worth it because of the benefits BN comes with.

Bottom line, BN is definitely a very powerful tool to tackle existing problems but it should be noted that it does not come for free. First, BN introduces some overhead during training in terms of space, extra parameters, and time, for the data normalization. Second, layers in networks needs to be adjusted because BN is applied before the activation and third, a post-processing step is often required.

Leave a comment