i2tutorials

Machine Learning- Random Forest

Random Forest

 

Random forest consists of a large number of individual decision trees that operate as a group or ensemble. Each individual tree in the random forest results out a predicted class and the class with the most votes becomes our model’s prediction.

 

Although Decision trees are suitable and easily implemented, they lack accuracy and suffer over fitting. Even though Decision trees work very effectively with the training data which was used to build them, but they’re not flexible when it comes to classifying the new sample. The accuracy during testing phase is very low. This happens because of a process called Over-fitting.

 

This means that the disturbance in the training data is saved and learned as concepts by the model. But these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data. Hence there by reduces the accuracy on the testing data.

 

Here comes the process called Random Forest. It is based on the reduction of variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.

 

The fundamental concept behind random forest is A large number of relatively uncorrelated models or trees operating as a group will outperform any of the individual basic models.

Feature Randomness

In a decision tree, to split a node, we consider every possible feature and select the one that produces the most separation between the observations in the left node versus those in the right node. Whereas, each tree in a random forest can select only from a random subset of features. This forces even more variation amongst the trees in the model and eventually results in lower correlation across trees and more variation. So, in random forest, we end up with trees that are not only trained on different sets of data but also use different features to make decisions.

 

The random forest combines number of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.

 

Random Forest mainly uses two key concepts. They are:

  1. Random sampling of training data points when building trees.
  2. Random subsets of features considered when splitting nodes.

 

Random sampling of training observations

While training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement, that means some samples will be used multiple times in a single tree. By training each tree on different samples, even though each tree might have high variance with respect to a specific set of the training data. Overall, the entire forest will have lower variance but not at the cost of increasing the bias.

 

While testing, predictions are made by averaging the predictions of each decision tree. This procedure of training each individual learner on different aggregated subsets of the data and then averaging the predictions is known as bagging, short for bootstrap aggregating.

 

Random Subsets of features for splitting nodes

The main concept behind the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally, this is set to sqrt(n_features) for classification, at each node in each tree, only random features will be considered for splitting the node. Even though the random forest overfits, it is able to generalize much better to the testing data than the single decision tree. The random forest has lower variance while maintaining the same low bias of a decision tree.

 

Creating A Random Forest

 

Step 1: Create a Bootstrapped Data Set

Bootstrapping is an estimation method used to make predictions on a dataset by re-sampling the data. To create a bootstrapped data set, we should select random samples from the original data set. Sample can be selected more than once.

 

Step 2: Creating Decision Trees

 

Step 3: Go back to Step 1 and Repeat

 

Step 4: Predicting the outcome of a new data point

 

Step 5: Evaluate the Model

 

Pros and Cons of Random Forest

 

Pros

 

 Cons

Exit mobile version