/    /  Machine Learning- Backpropagation – Generalization

Backpropagation – Generalization

 

The goal of backpropagation is to obtain partial derivatives of the cost function C for each weight w and bias b in the network. Multilayer Perceptrons use this supervised learning approach (Artificial Neural Networks).

 

In this blog, we’ll have a look at the concept of Generalization.

 

Why Generalization? 

What is an appropriate condition for the weight update loop to be terminated? 

 

– One option is to keep training until the error E on the training examples drops below a certain level. 

 

– This is a bad method since BACKPROPAGATION is prone to over-fitting the training instances at the expense of generalization accuracy over unknown cases.

 

Consider how the error E varies with the number of weight iterations to show the hazards of minimizing the error across the training data. 

 

The below figure depicts this difference for two common BACKPROPAGATION applications. Take a look at the top plot in this diagram.

 

As the number of gradient descent iterations increases, the lower of the two lines shows the error E decreasing monotonically over the training set. The error E measured over a different validation set of examples, distinct from the training examples, is shown on the top line.

 

This line represents the network’s generalization accuracy, or how well it matches examples outside of the training data.

 

The graph: 

For two independent robot perception tasks, plots show error E as a function of the number of weight changes. As gradient descent minimizes this measure of error, error E over the training examples reduces monotonically in both learning cases. 

 

Due to overfitting the training examples, errors over the separate “validation” set of examples are normally lower at first, then may increase afterward. 

 

The network with the lowest error over the validation set is the most likely to generalize appropriately to unseen input. When the validation set error begins to increase, one must be careful not to terminate training too soon, as shown in the second plot.

 

How does it work? 

  • Even when the error over the training instances continues to reduce, the generalization accuracy assessed over the validation examples initially declines, then climbs.

 

  • This happens when the weights are adjusted to fit training instances that aren’t typical of the whole distribution.

 

  • Overfitting is more common in later iterations than in earlier ones.

 

  • The weights of the network are initially set to modest random numbers. Only very smooth decision surfaces can be described with approximately comparable weights.

 

  • As training progresses, some weights rise in order to reduce error over the training data, and the learned decision surface becomes more complicated as the number of weight-tuning iterations increases.

 

How to address the overfitting problem?

There are several approaches for dealing with the overfitting problem in BACKPROPAGATION learning.

 

Weight decay is a technique that involves reducing each weight by a little amount throughout each iteration.

 

This is the same as adding a penalty term to the definition of E that corresponds to the total magnitude of the network weights. The goal of this method is to maintain weight values low so that learning is biased towards complex decision surfaces.

 

Unfortunately, for small training sets, the problem of overfitting is the most acute.

 

A k-fold cross-validation strategy, in which cross-validation is conducted k times, is sometimes employed in these instances. In one variant of this approach, the m available examples are partitioned into k disjoint subsets, each of size m/k. 

 

The cross-validation technique is then repeated k times, each time with a new subset as the validation set and the other subsets combined as the training set. 

 

As a result, each example is included in the validation set for one of the trials and the training set for the remaining k – 1 experiment.

 

The cross-validation strategy is employed in each experiment to find the number of iterations I that produce the best results on the validation set.

 

The mean I of these estimates is then determined, and a final run of BACKPROPAGATION is run with no validation set, training on all n cases for I iterations.

 

Remarks on the Back Propagation Algorithm: 

We have a total of five remarks, 

 

Convergence and local minima 

Backpropagation is a multi-layer algorithm. In multi-layer neural networks, it can go back and change the weights. 

 

All neurons are interconnected to each other and they converge at a point so that the information is passed onto every neuron in the network. 

 

Using the backpropagation algorithm we are minimizing the errors by modifying the weights. This minimization of errors can be done only locally but not globally. 

 

The Representational Power of the Feed-Forward Networks 

The representation power is how effectively you are representing the neural network. It depends on the depth and width of the network. 

 

We use boolean, continuous, and arbitrary functions in order to represent the network. 

 

Hypothesis Space search and Inductive Bias (#Link to the Hypothesis space and Inductive Bias article)

Inductive Bias

Hypothesis Space search

 

Hidden Layer Representation

By using backpropagation algorithm one can define new features in the hidden layer which are not explicitly represented in the input. 

 

Generalization, Overfitting and Stopping criteria 

Generalization

Overfitting

Stopping criteria

 

Reference

Backpropagation – Generalization