Solving Gradient Problems in Neural Networks

Infographic listing out the techniques that can be used to solving gradient problems in neural networks.

Back-propagation calculates the gradient of the loss function with respect to the weights and updates the weights to reduce the error. The main mathematical principle used here is the chain rule. However, the repeated multiplication inherent in this process can lead to either the vanishing or the exploding gradient problem.

Solutions For Both Issues

Intelligent Weight Initialization

There are several methods that can used to solve these issues. Using intelligent weight initialization schemes like Xavier/Glorot and He initialization can help maintain a stable variance of activations and gradients as they propagate through the network.

Batch Normalization

Stabilizes the distribution of activations in each layer by normalizing inputs to a mean of zero and a standard deviation of one. This prevents activation saturation (fixing vanishing) and controls activation scale (fixing exploding).

Solutions for Vanishing Gradient

ReLU Activation

The ReLU activation function has a constant derivative of 1 for positive inputs. This means that for a significant portion of the input range, the ReLU function allows the gradient signal to pass through unimpeded.

Residual Networks (ResNets)

Making changes to the architecture of the network can also help solve the vanishing gradient problem. Residual networks (ResNets) introduce a ‘skip connection’ that adds the input of a network block directly to its output, bypassing a series of layers.

Solution for Exploding Gradient

Gradient Clipping

Gradient clipping helps resolve the explosing gradient problem. It acts as a safety mechanism to prevent the training process from becoming unstable or diverging. After the gradients are computed during backpropagation, their L2 norm is calculated. If this norm exceeds a predefined threshold, the entire gradient vector is scaled down to bring its norm back to the threshold value.

SolutionGradient ProblemMathematical Principle
ReLUVanishingDerivative is a constant for positive inputs, preventing multiplicative decay.
Residual NetworksVanishingProvides an additive identity term in the gradient path, ensuring the signal does not vanish.
Batch NormalizationBothNormalizes activations to prevent them from entering saturated regions (vanishing) and to control their magnitude (exploding).
Intelligent InitializationBothInitializes weights to a scale that maintains a stable variance of activations and gradients from the start.
Gradient ClippingExplodingImposes a threshold on the L2 norm of the gradient vector, preventing excessively large updates.

Leave a comment