/    /  Machine Learning- Gradient descent and Delta Rule 

Gradient descent and Delta Rule 

 

If a set of data points can be separated into two groups using a straight line, the data is said to be linearly separable. Non-linearly separable data is defined as data points that cannot be split into two groups using a straight line.

Figure (a) -> Training Set is Linearly Separable 

Figure (b) -> Training Set is non-linearly Separable

 

When the training instances are linearly separable, the perceptron algorithm finds a successful weight vector; however, if the examples are not linearly separable, they may fail to converge.

 

The delta rule, a second training rule, is meant to address this challenge. In this blog, we’ll have a brief look at Gradient Descent and Delta Rule. 

 

The delta rule converges toward a best-fit approximation to the target concept if the training instances are not linearly separable.

 

Delta Rule’s Main Idea:

The Delta rule’s main idea is to explore the hypothesis space of potential weight vectors using gradient descent to discover the weights that best suit the training instances.

 

This criterion is significant because the BACKPROPAGATION algorithm, which can train networks with many linked units, is based on gradient descent.

 

DERIVATION OF DELTA RULE: 

Consider the job of training a threshold perceptron, which is a linear unit whose output o is given by,

 

to better understand the delta training algorithm.

 

Let’s start by providing a measure for the training error of a hypothesis(weight vector) relative to the training instances in order to build a weight learning algorithm for linear units. Although there are a variety of methods to describe this inaccuracy, one frequent metric that will prove to be particularly useful is

 

Where D represents the collection of training instances, td represents the target output for training example d, and od represents the linear unit’s output for training example d.

 

According to this definition,   is half the squared difference between the target output td and the linear output od, summed by all training samples.

 

How do you compute the steepest descent direction along the error surface?

 

Computing the derivative of E with respect to each component of the vector w yields the steepest direction. The gradient of E with respect to w, denoted as is the vector derivative of this vector.

The gradient determines the direction of the sharpest rise in E, hence the gradient descent training rule is

The learning rate, or n, is a positive constant that controls the step size in the gradient descent search.

 

Because we wish to shift the weight vector in the direction of decreasing E, the negative sign is present.

 

This training rule may also be expressed as a collection of components.

We need an efficient technique of computing the gradient at each step to build a realistic algorithm for iteratively updating weights according to Equation (4.5).

 

Differentiating E from Equation yields the vector of derivatives that make up the gradient (4.2).

For training example d, xid signifies the single input component xi.

 

We now have an equation  that expresses the training examples’ inputs, outputs, and target values td in terms of linear unit inputs xid, outputs od, and target values td.

 

The weight update rule for gradient descent is obtained by substituting Equation (4.6) into Equation (4.5).

Gradient-descent algorithm for training a linear unit: 

To implement the stochastic approximation to gradient descent, Equation (T4.2) is deleted, and Equation (T4.1) is replaced by 

To recapitulate, the following is the gradient descent algorithm for training linear units: Choose a random weight vector as your starting point.

 

After applying the linear unit to all training samples, determine each weight using Equation (4.7).

 

By adding to each weight Wi, you may update it and then repeat the procedure.

 

The above diagram contains the algorithm. Because the error surface only has one global minimum, this approach will converge to a weight vector with the lowest error, regardless of whether the training instances are linearly separable, if the learning rate n is small enough.

 

If n is too big, the gradient descent search may end up overstepping the error surface’s minimum instead of settling into it.

 

When a result, as the number of gradient descent steps rises, one typical tweak to the method is to progressively lower the value of n.