/    /  Machine Learning- Multilayer Neural Networks 

Multilayer Neural Networks 

 

A single-layer neural network will work only for linearly separable data and not for non-linearly separable data. Hence there is a need for Multilayer Neural Networks, to be able to work with non-linearly separable data.

Figure (a) -> Training Set is Linearly Separable 

Figure (b) -> Training Set is non-linearly Separable

 

A multi-layer Neural Network has two hidden layers. Hidden layers, whose neurons are not directly linked to the output, are used in multilayer networks to address the classification issue for non-linear data. 

 

The hidden layers can be understood geometrically as extra hyper-planes that increase the network’s separation capability. Typical multilayer network designs are seen in the Figure below.

 

This new design raises a new challenge: how to train concealed units whose expected output is unknown. This problem can be solved using the Backpropagation technique.

 

Backpropagation technique:

Given a network with a defined set of units and linkages, the BACKPROPAGATION algorithm learns the weights for a multilayer network. 

 

It uses gradient descent to try to reduce the squared error between the network output values and the outputs’ goal values.

 

We begin by redefining E to total the errors over all of the network output units because we are examining networks with multiple output units rather than single units as previously.

 

tkd and Okd are the target and output values associated with the kth output unit and training example d, respectively, and outputs are the set of output units in the network.

 

In contrast to the single-minimum parabolic error surface, the error surface in multilayer networks might contain many local minima.

 

However, this means that gradient descent is only guaranteed to converge toward a local minimum, not the global minimum error.

 

Despite this stumbling block, BACKPROPAGATION has been proven to deliver outstanding outcomes in a variety of real-world scenarios.

 

The technique presented here is applicable to layered feedforward networks with two levels of sigmoid units, each layer’s units being linked to all units from the previous layer.

 

Each node in the network is given an index (for example, an integer), where a “node” is either a network input or the output of a network unit.

 

The input from node I to unit j is denoted by xji, while the associated weight is denoted by wji.

 

The error term associated with unit n is denoted by . It functions similarly to the amount (t – o) from our previous explanation of the delta training rule. We’ll see what happens afterward.

 

ALGORITHM: 

The approach starts by building a network with the necessary number of hidden and output units, as well as setting all network weights to tiny random values.

 

The main loop of the algorithm then iterates over the training instances using this fixed network topology.

 

It applies the network to each training example, determines the network output error for this example, computes the gradient with regard to the error for this example, and then updates all network weights.

 

This gradient descent phase is repeated until the network performs satisfactorily (sometimes thousands of times, using the same training samples each time).

 

The delta training rule is comparable to the gradient descent weight-update rule. It changes each weight according to the learning rate n, the input value xji to which the weight is applied, and the error in the unit’s output, just as the delta rule.

 

The main change is that in the delta rule, the error (t – o) is substituted with a more complicated error term , whose exact form of derives from the weight tuning rule’s derivation.

 

Consider how   is computed for each network output unit k to get a sense of how it works.

 

is simply (tk – ok) from the delta rule multiplied by the quantity   that is the sigmoid squashing function’s derivative.

 

The     value for each hidden unit h  However, because target values tk are only provided for network outputs in training instances, no target values are explicitly accessible to signal the inaccuracy of concealed unit values.

 

Rather, the error term for hidden unit h is determined by adding the error  terms for each output unit impacted by h and weighting each  by Wkh, the weight from hidden unit h to output unit k. The degree to which hidden unit h is “responsible” for the inaccuracy in output unit k is represented by this weight.