Machine Learning- Random Forest

Random Forest

Random forest consists of a large number of individual decision trees that operate as a group or ensemble. Each individual tree in the random forest results out a predicted class and the class with the most votes becomes our model’s prediction.

Although Decision trees are suitable and easily implemented, they lack accuracy and suffer over fitting. Even though Decision trees work very effectively with the training data which was used to build them, but they’re not flexible when it comes to classifying the new sample. The accuracy during testing phase is very low. This happens because of a process called Over-fitting.

This means that the disturbance in the training data is saved and learned as concepts by the model. But these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data. Hence there by reduces the accuracy on the testing data.

Here comes the process called Random Forest. It is based on the reduction of variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.

The fundamental concept behind random forest is A large number of relatively uncorrelated models or trees operating as a group will outperform any of the individual basic models.

Feature Randomness

In a decision tree, to split a node, we consider every possible feature and select the one that produces the most separation between the observations in the left node versus those in the right node. Whereas, each tree in a random forest can select only from a random subset of features. This forces even more variation amongst the trees in the model and eventually results in lower correlation across trees and more variation. So, in random forest, we end up with trees that are not only trained on different sets of data but also use different features to make decisions.

The random forest combines number of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.

Random Forest mainly uses two key concepts. They are:

Random sampling of training data points when building trees.
Random subsets of features considered when splitting nodes.

Random sampling of training observations

While training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement, that means some samples will be used multiple times in a single tree. By training each tree on different samples, even though each tree might have high variance with respect to a specific set of the training data. Overall, the entire forest will have lower variance but not at the cost of increasing the bias.

While testing, predictions are made by averaging the predictions of each decision tree. This procedure of training each individual learner on different aggregated subsets of the data and then averaging the predictions is known as bagging, short for bootstrap aggregating.

Random Subsets of features for splitting nodes

The main concept behind the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally, this is set to sqrt(n_features) for classification, at each node in each tree, only random features will be considered for splitting the node. Even though the random forest overfits, it is able to generalize much better to the testing data than the single decision tree. The random forest has lower variance while maintaining the same low bias of a decision tree.

Creating A Random Forest

Step 1: Create a Bootstrapped Data Set

Bootstrapping is an estimation method used to make predictions on a dataset by re-sampling the data. To create a bootstrapped data set, we should select random samples from the original data set. Sample can be selected more than once.

Step 2: Creating Decision Trees

Next task is to build a Decision Tree by using the bootstrapped dataset created in the previous step. Since we’re making a Random Forest, we will not consider the entire dataset created, instead we’ll use only a random subset of variables at each step.
Repeat the same process for each of the upcoming branch nodes. Here, we select two variables at random as candidates for the branch node and then choose a variable that best separates the samples.

Step 3: Go back to Step 1 and Repeat

As mentioned earlier, Random Forest is a collection of Decision Trees. Each Decision Tree predicts the output class based on the respective predictor variables used in that tree. Last, the outcome of all the Decision Trees in a Random Forest is saved and the class with the majority votes is computed as the output class.
Thus, we should create more decision trees by considering a subset of random predictor variables at each step. Go back to step 1, create a new bootstrapped data set and then build a Decision Tree by considering only a subset of variables at each step.

Step 4: Predicting the outcome of a new data point

Now random forest is created.
Similarly, we run this data down the other decision trees and keep a track of the class predicted by each tree. After running the data down all the trees in the Random Forest, we check which class got the majority votes.
We bootstrapped the data and used the aggregate from all the trees to make a decision, this process is known as Bagging.

Step 5: Evaluate the Model

Final step is to evaluate the Random Forest model. In a real-world problem, about 1/3rd of the original data set is not included in the bootstrapped data set.

Pros and Cons of Random Forest

Pros

It overcomes the problem of overfitting by averaging the results of different decision trees.
It works well for a large range of data items than a single decision tree does.
Random forest has less variance than that of single decision tree.
It is very flexible and possess very high accuracy.
Scaling of data is not required in random forest algorithm. It can maintain good accuracy even after providing data without scaling.
Random Forest algorithms maintains good accuracy even when a large proportion of the data is missing.

Cons

Complexity is the main disadvantage of Random forest.
Construction is much harder and time-consuming process than decision trees.
Computational resources are required more to implement Random Forest algorithm.
It is less intuitive in case when we have a large group of decision trees.
The prediction process is very time-consuming when compared with other algorithms.