Machine Learning- Derivativation of Back Propagation Rule

Derivativation of Back Propagation Rule

Backpropagation’s purpose is to find the partial derivatives of the cost function C for every weight w or bias b in the network. It is a supervised learning algorithm used for Multilayer Perceptrons (Artificial Neural Networks).

In this blog, we’ll have a look at the Backpropagation rule and its derivation.

We’ll use the product of some constant alpha and the partial derivative of that quantity with respect to the cost function to update the weights and biases in the network once we get these partial derivatives. This is the famously-known gradient descent method.

The partial derivatives provide the largest ascending direction. As a result, we take a modest step in the opposite direction — the route of greatest descent, that is, the direction that will lead us to the cost function’s local minima.

What is Backpropagation, and how does it work?

Using a concept known as the delta rule or gradient descent, the Backpropagation algorithm hunts for the least value of the error function in weight space. The weights that minimize the error function are therefore regarded as a learning problem solution.

Let’s look at an example,

Consider you have a labeled data set.

Input	Desired Output
0	0
1	2
2	4

When the value of “W” is 3, the following is the output of your model:

Input	Desired Output	Model Output(W = 3)
0	0	0
1	2	3
2	4	6

The absolute error for three instances is 0, 1, 2 and the squared error is 0, 1, 4. As we increase the value of W, the error increases. But if you decrease the value of W, the error decreases.

Steps:

We started by setting a random value to ‘W’ and then propagated forward.
Then we realized there was a mistake. We propagated backward and raised the value of ‘W’ to lessen the mistake.
We also found that the error had risen after that. We discovered that we are unable to increase the ‘W’ value.
As a result, we propagated backward one more and lowered the ‘W’ value.
We have now found that the error has decreased.

As a result, we’re attempting to find a weight value that minimizes the inaccuracy. Essentially, we must determine if the weight value should be increased or decreased.

Once we know that, we continue to update the weight value in that direction until the mistake is as little as possible. You may reach a point where updating the weight further increases the inaccuracy. You must stop at that point, and that is your ultimate weight value.

Consider the following graph:

We must achieve the ‘Global Loss Minimum.’ This is called Back Propagation.

Derivation of the BACKPROPAGATION Rule:

For each training example d descending the gradient of the error Ed with respect to this single example. In other words, for each training example d, every weight is updated by adding to it .

Where Ed is the total of all output units in the network’s error on training example d.

Where,

xij = the unit j’s ith input
wji = the weight associated with unit j’s ith input
netj = (the weight sum of inputs for unit j)

oj = Unit j’s output
tj = Unit j’s target output
σ = the sigmoid function
outputs = the set of units in the final layer of the network
downstream(j) = the set of units whose immediate inputs include the output of unit j

In order to implement the stochastic gradient descent rule, we must first develop an equation for .

To begin, keep in mind that weight wji can only affect the rest of the network via netj. As a result, we may write using the chain rule.

Our final objective is to construct a suitable expression for given the above equation.

We investigate two scenarios in turn: one in which unit j is a network output unit, and another in which unit j is an internal unit.

Case 1: Output Unit Weights Training Rule:

Wji can only impact the rest of the network through netj, and netj can only influence the rest of the network through oj. As a result, we may use the chain rule to write again

To begin, consider just the first term in the above equation,

The derivatives will be zero for all output units k except when k = j. We, therefore, drop the summation over output units and simplify set k = j.

Next, consider the second term in Equation(4.23). Since , the derivative is just the derivative of the sigmoid function, which we have already noted is equal to.

Therefore,

Now, substitute the expressions (4.24) and (4.25) into (4.23), to obtain

The stochastic gradient descent rule for output units is obtained by combining Equations (4.21) and (4.22).

Case 2: Hidden Unit Weights Training Rule

When j is a network’s internal, or hidden, unit, the training rule for wji must account for the indirect ways in which wji might impact the network outputs, and so Ed.

Downstream is the name given to this group of components ( j). Netj can only control the network outputs (and hence Ed) through the units in the Downstream (j). As a result, we may write