Bagging and Boosting
Bootstrap means random sampling with replacement. It allows us to understand the bias and the variance better with the dataset. Boot Strap means selection of small subset of the data from the original dataset. This subset may be replaced. We can understand the mean and standard deviation from the dataset in a better way.
To get an estimate of the mean of the sample, we require a sample of ‘n’ values (x)
mean(x) = 1/n * sum(x)
If our sample is small and if the mean has error in it. We can improve the estimate of the mean by using the bootstrap procedure:
- Create many random sub-samples of our dataset with replacement so that same sample can be selected more than once.
- Compute the mean of each sub-sample.
- Calculate the average of all of our collected means and refer that as our estimated mean for the data.
Bootstrap Aggregation also called as Bagging is a simple yet powerful ensemble method. It is one of the applications of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.
In Bagging, several Subsets of the data are created from Training sample chosen randomly with replacement.
Each Subset data is used to Train their Decision Trees.
Decision Trees suffer from Bias and Variance.
Simple Trees suffer with Large Bias.
Complex Trees suffer with Large Variance.
Several Decision Trees are combined to get the correct result rather than taking a single Decision Tree.
Bagging is used to reduce Variance of a Decision Tree.
When Bagging with Decision Tree, we are less concerned about Individual Trees which Overfits the data.
When we are bagging with decision trees, we are less worried about individual trees that leads to overfitting of the training data. Due to this reason and for efficiency, the individual decision trees are grown deep and the trees are not pruned or clipped. These trees will have both high variance and low bias which is beneficial. The parameters needed for bagging decision trees is the number of samples and hence the number of trees to include. This can be selected by increasing the number of trees on run after run until the accuracy begins to stop showing any improvement.
- If there are N observations and M features in training data set. A sample from training data set is taken randomly with replacement.
- A subset of M features is selected randomly and the feature which gives the best split is used to split the node iteratively or repeatedly.
- The tree can be grown to the largest.
- Above mentioned steps are repeated n times and prediction is given based on the aggregation or average of predictions from n number of trees.
- It reduces over-fitting of the model.
- It can handle higher dimensionality data very well.
- Maintains accuracy even for missing data.
- Since final prediction is based on the mean predictions from subset trees, it can’t give precise values for the classification and regression model.
Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea of boosting is to train weak learners sequentially, each trying to correct its predecessor. Boosting is used to create a collection of predictors.
It refers to a group of algorithms that will make use of weighted averages to convert weak learners into stronger learners. Unlike bagging which runs each model independently and then aggregate the outputs at the end without preference to any model. Boosting is all about teamwork, it runs each model and dictates what features the next model will focus on.
In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. This process converts weak learners into better performing model.
Box 1: We can see that we have assigned equal weights to each data point and applied a decision stump to classify them as (+) plus or (—) minus. The decision stump (D1) has generated vertical line at left side to classify the data points. We see that, this vertical line has incorrectly predicted three (+) plus as (—) minus. In these types of cases, we’ll assign higher weights to these three (+) plus and apply another decision stump.
Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But again, it has caused mis-classification errors. This time with three (-) minus. Again, we will assign higher weight to three (—) minus and apply another decision stump.
Box 3: Here, three (—) minus are given higher weights. A decision stump (D3) is applied to predict these incorrectly classified observations correctly. This time a horizontal line is generated to classify (+) plus and (—) minus based on higher weight of mis-classified observation.
Box 4: We have combined D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. We can see that this algorithm has classified these observations quite well as compared to any of individual weak learner.
- In the first step, draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1.
- Draw second random training subset d2 without replacement from the training set and add 50 percent of the samples which were previously wrongly classified / misclassified to train a weak learner C2.
- In the third step, find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak learner C3.
- Finally, combine all the weak learners via majority voting.
- Boosting supports different loss functions.
- It can work well with interactions.
- It prone to over-fitting.
- It requires careful tuning of different hyper-parameters.
Differences Between Bagging and Boosting –
|1.||Simplest way of combining predictions that|
belong to the same type.
|A way of combining predictions that|
belong to the different types.
|2.||Aim to decrease variance, not bias.||Aim to decrease bias, not variance.|
|3.||Each model receives equal weight.||Models are weighted according to their performance.|
|4.||Each model is built independently.||New models are influenced|
by performance of previously built models.
|5.||Different training data subsets are randomly drawn with replacement from the entire training dataset.||Every new subset contains the elements that were misclassified by previous models.|
|6.||Bagging tries to solve over-fitting problem.||Boosting tries to reduce bias.|
|7.||If the classifier is unstable (high variance), then apply bagging.||If the classifier is stable and simple (high bias) the apply boosting.|
|8.||Random forest.||Gradient boosting.|