/    /  Machine Learning- Backpropagation Algorithm and Convergence

Backpropagation Algorithm and Convergence

 

Using a concept known as the delta rule or gradient descent, the Backpropagation algorithm hunts for the least value of the error function in weight space. The weights that minimize the error function are therefore regarded as a learning problem solution.

 

In this blog, we’ll have a look at the algorithm for Back Propagation, the concept of Convergence. 

 

Algorithm For Backpropagation: 

The approach starts by building a network with the necessary number of hidden and output units, as well as setting all network weights to tiny random values.

 

The main loop of the algorithm then iterates over the training instances using this fixed network topology.

 

It applies the network to each training example, determines the network output error for this example, computes the gradient with regard to the error for this example, and then updates all network weights.

 

This gradient descent phase is repeated until the network performs satisfactorily (sometimes thousands of times, using the same training samples each time).

 

The delta training rule is comparable to the gradient descent weight-update rule. It changes each weight according to the learning rate n, the input value xji to which the weight is applied, and the error in the unit’s output, just as the delta rule.

 

The main change is that in the delta rule, the error (t – o) is substituted with a more complicated error term , whose exact form of derives from the weight tuning rule’s derivation.

 

Consider how  is computed for each network output unit k to get a sense of how it works.

 

is simply (tk – ok) from the delta rule multiplied by the quantity  that is the sigmoid squashing function’s derivative.

 

The value for each hidden unit h  However, because target values tk are only provided for network outputs in training instances, no target values are explicitly accessible to signal the inaccuracy of concealed unit values.

 

Rather, the error term for hidden unit h is determined by adding the error  terms for each output unit impacted by h and weighting each by Wkh, the weight from hidden unit h to output unit k. The degree to which hidden unit h is “responsible” for the inaccuracy in output unit k is represented by this weight.

 

Convergence:

  • The BACKPROPAGATION technique uses a gradient descent search to reduce the error E between the training example target values and the network outputs by iteratively lowering the set of feasible network weights.

 

  • Gradient descent can become caught in any of the many possible local minima that exist on the error surface for multilayer networks. As a result, BACKPROPAGATION over multilayer networks can only converge to a local minimum in E, not to the global minimum error.

 

  • Consider how big weighted networks relate to error surfaces in very high-dimensional spaces (one dimension per weight).

 

  • When gradient descent reaches a local minimum with one of these weights, it does not always reach a local minimum with the other weights.

 

  • In reality, the more dimensions in the network, the more “escape routes” for gradient descent to fall away from the local minimum with regard to this single weight.

 

  • Consider how network weights vary as the number of training iterations rises for a second viewpoint on local minima.

 

  • If the network weights are set to approach zero, the network will reflect a highly smooth function with nearly linear inputs during the early gradient descent phases.

 

  • This is due to the fact that when the weights are close to zero, the sigmoid threshold function is almost linear.

 

  • The weights will only be able to represent highly nonlinear network functions when they have had time to mature.

 

The following are some common strategies used to try to solve the problem of local minima:

  • To the weight-update rule, add a momentum term. The gradient descent technique can occasionally be carried by momentum across small local minima.

 

  • Instead of using genuine gradient descent, using stochastic gradient descent. 

 

The stochastic approximation to gradient descent descends a   distinct error surface for each training sample, depending on the average of them to approximate the gradient for the whole training set. 

 

These error surfaces will generally have distinct local minima, making it less likely that the process will become trapped in one of them.

 

  • Train several networks using the same data, but different random weights for each network. If the various training approaches result in distinct local minima, the network with the highest performance over a separate validation data set can be chosen.