Machine Learning- Regularization

Home / Machine Learning – Tutorial / Machine Learning- Regularization

Regularization

Regularization is one of the important concepts in Machine Learning. It deals with the over fitting of the data which can leads to decrease model performance. It is a type of Regression which constrains or reduces the coefficient estimates towards zero. By the process of regularization, reduce the complexity of the regression function without actually reducing the degree of the underlying or original polynomial function.

Regularization technique is based on the fact that if the highest order terms in a polynomial equation have very small coefficients, then the function will approximately behave like a polynomial function with a smaller degree.

Generally, regularization is done by adding a complexity term to the cost function which will give a higher cost due to which the complexity of the underlying polynomial function increases.

In Regularization, we have two types of techniques which can deal with under fitting and over fitting of the data independently. They are:

L1 Regularization or Lasso Regression
L2 Regularization or Ridge Regression

Let us study about them in more detail

Lasso Regression / L1 Regularization

LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso Regression is almost identical to Ridge Regression, the only difference is the absolute value as opposed to the squaring the weights when computing the ridge regression penalty.

Lasso regression performs L1 regularization. Lasso is a regression analysis method which performs both variable selection and regularization in order to improve the prediction accuracy. It prevents under fitting of the Data, which is a problem in the analysis of the data.

Thus, lasso regression optimizes the following:

Lasso regression = RSS + α * (sum of absolute value of coefficients)

Here, α works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients.

α = 0, Same coefficients as simple linear regression.

α = ∞, All coefficients zero.

0 < α < ∞, coefficients between 0 and that of simple linear regression.

Why Lasso Regression?

When we have less or insufficient data, the model suffers from underfitting. Underfitting reduces the accuracy of our machine learning model. Its occurrence simply means that our model does not fit the data well enough.

Did you ever tried to fit in over-sized clothes?

A normal Person trying to fit in an extra-large dress refers to the underfitting problem. The same problem occurs in the dataset if you increase number of features to decrease cost function.

Underfit happens in linear models when dealing with less data. If we cannot get rid of this problem, it effects the model performance. Here, Lasso regression comes into the picture. It reduces the underfitting problem in a dataset by using some metrics.

L1 regularization adds penalty equivalent to absolute value of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

Here ‘LS Obj’ refers to ‘least squares objective’, i.e. the linear regression objective without regularization.

Metric used for Lasso regression is

Lasso is an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant exists for every value of shrinkage factor λ. These equations are also referred as constraint functions.

For lasso, the equation becomes, |β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS (loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.

Ridge Regression / L2 Regularization

The Ridge regression is a specialized technique used to analyze multiple regression data which is multicollinearity in nature. The term multicollinearity refers to collinearity which means, one predicted value in multiple regression models is linearly predicted with others to attain a certain level of accuracy. Multicollinearity occurs when there are high correlations between more than two predicted variables.

Ridge Regression is also called as L2 Regularization. It prevents over fitting of the Data, which is a major problem in the analysis of the data.

Why Ridge Regression?

Ridge regression is an extension to the Linear Regression. The basic idea of linear regression model revolves around minimizing the cost function’s value. Lower the cost function value, better the linear regression model. Cost function (Loss function) is the function to find maximum or minimum of a specific function.

By increasing number of Features, we can decrease the cost function. But if we keep on increasing the features in model, model starts fitting the training data set well as well as it leads to overfitting of the data which effects the model performance. To overcome this problem, we go for Ridge Regression.

It Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients
Minimization objective = LS Obj + α * (sum of square of coefficients)

Did you ever tried to fit in under-sized clothes?

A normal Person trying to fit in an extra-small dress refers to the overfitting problem. The same problem occurs in the dataset if you increase number of features to decrease cost function.

Overfit happens in linear models when dealing with multiple features. If we cannot get rid of this problem, some features can be more destructive than helpful, Information repeated by other features will add high noise to the dataset. Here, Ridge regression comes into the picture. It reduces the Overfitting problem in a dataset by using some metrics.

To fix the problem of overfitting, we need to balance two things:
1. How well function/model fits data.
2. Magnitude of coefficients.

Metrics used for Ridge Regression is

where the RSS is modified by adding the shrinkage quantity. The coefficients are now estimated by minimizing this function. Here, λ is the tuning parameter which decides penalization the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. β0 intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0 which cannot be shrinked.

When λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the effect of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero. Selection of value of λ is critical. The coefficient estimates produced by this method are also known as the L2 norm.

For performing ridge regression, the formula used to do this is given below.