- The importance of effective initialization
- The problem of exploding or vanishing gradients
- What is proper initialization?
- Mathematical justification for Xavier initialization

To build a machine learning algorithm, usually you'd define an architecture (e.g. Logistic regression, Support Vector Machine, Neural Network) and train it to learn parameters. Here is a common training process for neural networks:

- Initialize the parameters
- Choose an optimization algorithm
- Repeat these steps:
- Forward propagate an input
- Compute the cost function
- Compute the gradients of the cost with respect to parameters using backpropagation
- Update each parameter using the gradients, according to the optimization algorithm

Then, given a new data point, you can use the model to predict its class.

The initialization step can be critical to the model's ultimate performance, and it requires the right method. To illustrate this, consider the three-layer neural network below. You can try initializing this network with different methods and observe the impact on the learning.

What do you notice about the gradients and weights when the initialization method is zero?

In fact, any constant initialization scheme will perform very poorly. Consider a neural network with two hidden units, and assume we initialize all the biases to 0 and the weights with some constant . If we forward propagate an input in this network, the output of both hidden units will be . Thus, both hidden units will have identical influence on the cost, which will lead to identical gradients. Thus, both neurons will evolve symmetrically throughout training, effectively preventing different neurons from learning different things.

What do you notice about the cost plot when you initialize weights with values too small or too large?

Choosing proper values for initialization is necessary for efficient training. We will investigate this further in the next section.

Consider this 9-layer neural network.

At every iteration of the optimization loop (forward, cost, backward, update), we observe that backpropagated gradients are either amplified or minimized as you move from the output layer towards the input layer. This result makes sense if you consider the following example.

Assume all the activation functions are linear (identity function). Then the output activation is:

where and are all matrices of size because layers [1] to [L-1] have 2 neurons and receive 2 inputs. With this in mind, and for illustrative purposes, if we assume the output prediction is (where takes the matrix to the power of L-1, while denotes the Lth matrix).

What would be the outcome of initialization values that were too small, too large or appropriate?

Consider the case where every weight is initialized slightly larger than the identity matrix.

This simplifies to , and the values of increase exponentially with . When these activations are used in backward propagation, this leads to the exploding gradient problem. That is, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.

Similarly, consider the case where every weight is initialized slightly smaller than the identity matrix.

This simplifies to , and the values of the activation decrease exponentially with . When these activations are used in backward propagation, this leads to the vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.

All in all, initializing weights with inappropriate values will lead to divergence or a slow-down in the training of your neural network. Although we illustrated the exploding/vanishing gradient problem with simple symmetrical weight matrices, the observation generalizes to any initialization values that are too small or too large.

To prevent the gradients of the network's activations from vanishing or exploding, we will stick to the following rules of thumb:

- The mean of the activations should be zero.
- The variance of the activations should stay the same across every layer.

Under these two assumptions, the backpropagated gradient signal should not be multiplied by values too small or too large in any layer. It should travel to the input layer without exploding or vanishing.

More concretely, consider a layer . Its forward propagation is:

We would like the following to hold:^{2}

Ensuring zero-mean and maintaining the value of the variance of the input of every layer guarantees no exploding/vanishing signal, as we'll explain in a moment. This method applies both to the forward propagation (for activations) and backward propagation (for gradients of the cost with respect to activations). The recommended initialization is Xavier initialization (or one of its derived methods), for every layer
:
In other words, all the weights of layer
are picked randomly from a normal distribution with mean
and variance
where is the number of neuron in layer . Biases are initialized with zeros.

The visualization below illustrates the influence of the Xavier initialization on each layer’s activations for a five-layer fully-connected neural network.

You can find the theory behind this visualization in Glorot et al. (2010). The next section presents the mathematical justification for Xavier initialization and explains more precisely why it is an effective initialization.

In this section, we will show that Xavier Initialization keeps the variance the same across every layer. We will assume that our layer’s activations are normally distributed around zero. Sometimes it helps to understand the mathematical justification to grasp the concept, but you can understand the fundamental idea without the math.

Let’s work on the layer described in part (III) and assume the activation function is . The forward propagation is:

The goal is to derive a relationship between and . We will then understand how we should initialize our weights such that: .

Assume we initialized our network with appropriate values and the input is normalized. Early on in the training, we are in the linear regime of
. Values are small enough and thus
,^{5} meaning that:
Moreover,
where
. For simplicity, let’s assume that
(it will end up being true given the choice of initialization we will choose). Thus, looking element-wise at the previous equation
now gives:

A common math trick is to extract the summation outside the variance. To do this, we must make the following three assumptions^{6}:

Thus, now we have:

Another common math trick is to convert the variance of a product into a product of variances. Here is the formula for it:

Using this formula with and , we get:

We’re almost done! The first assumption leads to and the second assumption leads to because weights are initialized with zero mean, and inputs are normalized. Thus:

The equality above results from our first assumption stating that:

Similarly the second assumption leads to:

With the same idea:

Wrapping up everything, we have:

Voilà! If we want the variance to stay the same across layers (), we need . This justifies the choice of variance for Xavier initialization.

Notice that in the previous steps we did not choose a specific layer . Thus, we have shown that this expression holds for every layer of our network. Let be the output layer of our network. Using this expression at every layer, we can link the output layer's variance to the input layer's variance:

Depending on how we initialize our weights, the relationship between the variance of our output and input will vary dramatically. Notice the following three cases.

Thus, in order to avoid the vanishing or exploding of the forward propagated signal, we must set by initializing .

Throughout the justification, we worked on activations computed during the forward propagation. The same result can be derived for the backpropagated gradients. Doing so, you will see that in order to avoid the vanishing or exploding gradient problem, we must set by initializing .

In practice, Machine Learning Engineers using Xavier initialization would either initialize the weights as or as . The variance term of the latter distribution is the harmonic mean of and .

This is a theoretical justification for Xavier initialization. Xavier initialization works with tanh activations. Myriad other initialization methods exist. If you are using ReLU, for example, a common initialization is He initialization (He et al., Delving Deep into Rectifiers), in which the weights are initialized by multiplying by 2 the variance of the Xavier initialization. While the justification for this initialization is slightly more complicated, it follows the same thought process as the one for tanh.

↑ Back to top