Gradient descent is an optimization algorithm used to minimize or reduce some function by iteratively moving in the direction of steepest descent or minimum point as defined by the negative of the gradient.

Let us understand about Gradient Descent in a more practical way to know more detail.

Consider the figure below in the context of a cost function. Our goal is to move from the mountain in the top corner (high cost) to the dark blue sea in the bottom (low cost). The arrows denote the direction of steepest descent (negative gradient) from any given point, the direction that decreases the cost function as early as possible.

Starting from the top of the mountain, we have to take our step downwards in the direction specified by the negative gradient which means to the minimum point. Following, we recalculate the negative gradient and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph which means a local minimum.

### Learning rate

The size of these steps towards minimum point is called the learning rate. With a high learning rate (steep slope), we can cover more ground each step, but we risk passing the lowest point since the slope of the hill is constantly changing. With a very low learning rate (less steep), we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more accurate, but calculating the gradient is time-consuming, so it will take us a very long time to reach the bottom.

### Cost function

A Loss Functions or cost function tells us how accurately predictions can be made for a given set of parameters. The cost function has its own curve and own gradients. The slope of this curve tells us upgradation of parameters to make the model more accurate.

Now run gradient descent using new cost function. There are two mutable parameters in our cost function, mm (weight) and bb (bias). As we need to consider the influence of weight and bias on the final prediction, we have to use partial derivatives. Compute the partial derivatives of the cost function with respect to each parameter and save the results in a gradient.

### Metrics behind the cost function

f(m,b)=1N∑i=1n(yi−(mxi+b))2f(m,b)=1N∑i=1n(yi−(mxi+b))2

The gradient can be calculated as:

f′(m,b)=⎡⎣dfdmdfdb⎤⎦=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]f′(m,b)=[dfdmdfdb]=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]

Solution for a gradient can be obtained by iteration of our data points using our new weight (mm) and bias (bb) values and calculate the partial derivatives. The generated new gradient gives the slope of cost function at current position and the direction we have to move to update our parameters. The size of our update is measured by the learning rate.

There are various implementations of Gradient Descent which are used for minimizing cost functions. They are