/    /  Gradient Descent

Gradient Descent

Gradient descent is an optimization algorithm used to minimize or reduce some function by iteratively moving in the direction of steepest descent or minimum point as defined by the negative of the gradient.

Let us understand about Gradient Descent in a more practical way to know more detail.

Consider the figure below in the context of a cost function. Our goal is to move from the mountain in the top corner (high cost) to the dark blue sea in the bottom (low cost). The arrows denote the direction of steepest descent (negative gradient) from any given point, the direction that decreases the cost function as early as possible.

Gradient Descent 1 (i2tutorials)

Starting from the top of the mountain, we have to take our step downwards in the direction specified by the negative gradient which means to the minimum point. Following, we recalculate the negative gradient and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph which means a local minimum.

Learning rate

The size of these steps towards minimum point is called the learning rate. With a high learning rate (steep slope), we can cover more ground each step, but we risk passing the lowest point since the slope of the hill is constantly changing. With a very low learning rate (less steep), we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more accurate, but calculating the gradient is time-consuming, so it will take us a very long time to reach the bottom.

Gradient Descent 2 (i2tutorials)

Cost function

A Loss Functions or cost function tells us how accurately predictions can be made for a given set of parameters. The cost function has its own curve and own gradients. The slope of this curve tells us upgradation of parameters to make the model more accurate.

Now run gradient descent using new cost function. There are two mutable parameters in our cost function, mm (weight) and bb (bias). As we need to consider the influence of weight and bias on the final prediction, we have to use partial derivatives. Compute the partial derivatives of the cost function with respect to each parameter and save the results in a gradient.

Metrics behind the cost function

f(m,b)=1N∑i=1n(yi−(mxi+b))2f(m,b)=1N∑i=1n(yi−(mxi+b))2

The gradient can be calculated as:

f′(m,b)=⎡⎣dfdmdfdb⎤⎦=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]f′(m,b)=[dfdmdfdb]=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]

Solution for a gradient can be obtained by iteration of our data points using our new weight (mm) and bias (bb) values and calculate the partial derivatives. The generated new gradient gives the slope of cost function at current position and the direction we have to move to update our parameters. The size of our update is measured by the learning rate.

There are various implementations of Gradient Descent which are used for minimizing cost functions. They are

Batch Gradient Descent

Stochastic Gradient Descent

Mini Batch Gradient Descent

Batch Gradient Descent

It processes all the training data for each iteration of gradient descent. If the number of training examples or data is large, then batch gradient descent is very expensive. This is the disadvantage of Batch gradient Descent. Hence, in case of large training examples it is better to use stochastic gradient descent or mini-batch gradient descent.

Gradient Descent 3 (i2tutorials)

Stochastic Gradient Descent 

This is one of the types of gradient descents which processes one training data per one iteration. Hence, the parameters are being updated after each iteration. This is comparatively faster than batch gradient descent. When the number of training examples or data is large, even then it processes only one example per iteration which can be additional overhead for the system as the number of iterations will be quite large.

Gradient Descent 4 (i2tutorials)

Mini Batch gradient descent

Gradient Descent 5 (i2tutorials)

It which works faster than both batch gradient descent and stochastic gradient descent. Here b examples where b<m are processed per iteration. Even if the number of training examples is large, it is processed in batches of b training examples at a time. Therefore, it works for larger training examples and that too with lesser number of iterations.

Gradient Descent 6 (i2tutorials)