My Understanding of Normalization

ifeelfree
3 min readMay 4, 2021

--

Table of Contents

· Part 1: Why should we normalize neural network inputs?
· Part 2: Why do we need batch normalization?

Part 1: Why should we normalize neural network inputs?

To avoid the learning algorithm spend much time oscillating in the plateau, we normalize the input features such that all the features would be on the same scale. Since our inputs are on the same scale, the weights associated with them would also be on the same scale. Thus helping the network to train faster.

Question 1: what types of normalization are available?

(1) min-max scaling (2) z-score normalization

Question 2: do all machine learning algorithms need normalization?

No. For example, decision tree C 4.5 does not need normalization

see Understand Data Normalization in Machine Learning

Part 2: Why do we need batch normalization?

2.1 What is batch normalization?

In order to bring all the activation values to the same scale, we normalize the activation values such that the hidden representation doesn’t vary drastically and also helps us to get improvement in the training speed.

Since we are computing the mean and standard deviation from a single batch as opposed to computing it from the entire data. Batch normalization is done individually at each hidden neuron in the network.

Since we are normalizing all the activations in the network, are we enforcing some constraints that could deteriorate the performance of the network?

In order to maintain the representative power of the hidden neural network, batch normalization introduces two extra parameters — Gamma and Beta. Once we normalize the activation, we need to perform one more step to get the final activation value that can be feed as the input to another layer.

The parameters Gamma and Beta are learned along with other parameters of the network. If Gamma (γ) is equal to the mean (μ) and Beta (β) is equal to the standard deviation(σ) then the activation h_final is equal to the h_norm, thus preserving the representative power of the network.

Here you can find an implementation of batch normalization in JUPYTER NOTEBOOK.

2.2 Should batch normalization be enforced before nonlinear operation?

In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

2.3 What’s the proper order for different layers regarding to batch normalization and dropout?

There is a very interesting discussion Ordering of batch normalization and dropout? on StackOverflow, and many people suggests there are two schemes to follow:

Scheme 1

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

Scheme 2

  • > CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

--

--

No responses yet