What are the different ways of solving Gradient issues in RNN?
Ans: The lower the gradient is, the harder it is for the network to update the weights and the longer it takes to get to the final result. The output of the earlier layers is used as the input for the further layers. The training for the time point t is based on inputs that are coming from untrained layers. So, because of the vanishing gradient, the whole network is not being trained properly. If wrec is small, you have vanishing gradient problem, and If wrec is large, you have exploding gradient problem.
For the vanishing gradient problem, the further you go through the network, the lower your gradient is and the harder it is to train the weights, which has a domino effect on all of the further weights throughout the network.
In case of exploding gradient, you can:
- stop backpropagating after a certain point, which is usually not optimal because not all of the weights get updated;
- penalize or artificially reduce gradient;
- put a maximum limit on a gradient.
In case of vanishing gradient, you can:
- initialize weights so that the potential for vanishing gradient is minimized;
- have Echo State Networks that are designed to solve the vanishing gradient problem;
- have Long Short-Term Memory Networks (LSTMs).