Derivativation of Back Propagation Rule
Backpropagation’s purpose is to find the partial derivatives of the cost function C for every weight w or bias b in the network. It is a supervised learning algorithm used for Multilayer Perceptrons (Artificial Neural Networks).
In this blog, we’ll have a look at the Backpropagation rule and its derivation.
We’ll use the product of some constant alpha and the partial derivative of that quantity with respect to the cost function to update the weights and biases in the network once we get these partial derivatives. This is the famously-known gradient descent method.
The partial derivatives provide the largest ascending direction. As a result, we take a modest step in the opposite direction — the route of greatest descent, that is, the direction that will lead us to the cost function’s local minima.
What is Backpropagation, and how does it work?
Using a concept known as the delta rule or gradient descent, the Backpropagation algorithm hunts for the least value of the error function in weight space. The weights that minimize the error function are therefore regarded as a learning problem solution.
Let’s look at an example,
Consider you have a labeled data set.
Input | Desired Output |
0 | 0 |
1 | 2 |
2 | 4 |
When the value of “W” is 3, the following is the output of your model:
Input | Desired Output | Model Output(W = 3) |
0 | 0 | 0 |
1 | 2 | 3 |
2 | 4 | 6 |
The absolute error for three instances is 0, 1, 2 and the squared error is 0, 1, 4. As we increase the value of W, the error increases. But if you decrease the value of W, the error decreases.
Steps:
- We started by setting a random value to ‘W’ and then propagated forward.
- Then we realized there was a mistake. We propagated backward and raised the value of ‘W’ to lessen the mistake.
- We also found that the error had risen after that. We discovered that we are unable to increase the ‘W’ value.
- As a result, we propagated backward one more and lowered the ‘W’ value.
- We have now found that the error has decreased.
As a result, we’re attempting to find a weight value that minimizes the inaccuracy. Essentially, we must determine if the weight value should be increased or decreased.
Once we know that, we continue to update the weight value in that direction until the mistake is as little as possible. You may reach a point where updating the weight further increases the inaccuracy. You must stop at that point, and that is your ultimate weight value.
Consider the following graph:
We must achieve the ‘Global Loss Minimum.’ This is called Back Propagation.
Derivation of the BACKPROPAGATION Rule:
For each training example d descending the gradient of the error Ed with respect to this single example. In other words, for each training example d, every weight is updated by adding to it
Where Ed is the total of all output units in the network’s error on training example d.
Where,
- xij = the unit j’s ith input
- wji = the weight associated with unit j’s ith input
- netj =
(the weight sum of inputs for unit j)
- oj = Unit j’s output
- tj = Unit j’s target output
- σ = the sigmoid function
- outputs = the set of units in the final layer of the network
- downstream(j) = the set of units whose immediate inputs include the output of unit j
In order to implement the stochastic gradient descent rule, we must first develop an equation for
To begin, keep in mind that weight wji can only affect the rest of the network via netj. As a result, we may write using the chain rule.
Our final objective is to construct a suitable expression for
We investigate two scenarios in turn: one in which unit j is a network output unit, and another in which unit j is an internal unit.
Case 1: Output Unit Weights Training Rule:
Wji can only impact the rest of the network through netj, and netj can only influence the rest of the network through oj. As a result, we may use the chain rule to write again
To begin, consider just the first term in the above equation,
The derivatives
Next, consider the second term in Equation(4.23). Since
Therefore,
Now, substitute the expressions (4.24) and (4.25) into (4.23), to obtain
The stochastic gradient descent rule for output units is obtained by combining Equations (4.21) and (4.22).
Case 2: Hidden Unit Weights Training Rule
When j is a network’s internal, or hidden, unit, the training rule for wji must account for the indirect ways in which wji might impact the network outputs, and so Ed.
Downstream is the name given to this group of components ( j). Netj can only control the network outputs (and hence Ed) through the units in the Downstream (j). As a result, we may write