Site icon i2tutorials

Machine Learning- Backpropagation Algorithm and Convergence

Backpropagation Algorithm and Convergence

 

Using a concept known as the delta rule or gradient descent, the Backpropagation algorithm hunts for the least value of the error function in weight space. The weights that minimize the error function are therefore regarded as a learning problem solution.

 

In this blog, we’ll have a look at the algorithm for Back Propagation, the concept of Convergence. 

 

Algorithm For Backpropagation: 

The approach starts by building a network with the necessary number of hidden and output units, as well as setting all network weights to tiny random values.

 

The main loop of the algorithm then iterates over the training instances using this fixed network topology.

 

It applies the network to each training example, determines the network output error for this example, computes the gradient with regard to the error for this example, and then updates all network weights.

 

This gradient descent phase is repeated until the network performs satisfactorily (sometimes thousands of times, using the same training samples each time).

 

The delta training rule is comparable to the gradient descent weight-update rule. It changes each weight according to the learning rate n, the input value xji to which the weight is applied, and the error in the unit’s output, just as the delta rule.

 

The main change is that in the delta rule, the error (t – o) is substituted with a more complicated error term , whose exact form of derives from the weight tuning rule’s derivation.

 

Consider how  is computed for each network output unit k to get a sense of how it works.

 

is simply (tk – ok) from the delta rule multiplied by the quantity  that is the sigmoid squashing function’s derivative.

 

The value for each hidden unit h  However, because target values tk are only provided for network outputs in training instances, no target values are explicitly accessible to signal the inaccuracy of concealed unit values.

 

Rather, the error term for hidden unit h is determined by adding the error  terms for each output unit impacted by h and weighting each by Wkh, the weight from hidden unit h to output unit k. The degree to which hidden unit h is “responsible” for the inaccuracy in output unit k is represented by this weight.

 

Convergence:

 

 

 

 

 

 

 

 

 

The following are some common strategies used to try to solve the problem of local minima:

 

 

The stochastic approximation to gradient descent descends a   distinct error surface for each training sample, depending on the average of them to approximate the gradient for the whole training set. 

 

These error surfaces will generally have distinct local minima, making it less likely that the process will become trapped in one of them.

 

 

Exit mobile version