Simple Ways to Split a Decision Tree in Machine Learning

October 10, 2020

Simple Ways to Split a Decision Tree in Machine Learning

What is a decision tree?

Decision trees are a machine learning technique for making predictions. They are built by repeatedly splitting training data into smaller and smaller samples. This post will explain how these splits are chosen.

Decision Tree algorithm comes under the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm is often used for solving regression and classification problems too.

Basic Decision Tree Terminologies

This process is illustrated below:

Simple Ways to Split a Decision Tree in Machine Learning

Parent and Child Node: A node that gets divided into sub-nodes is known as Parent Node, and these sub-nodes are known as Child Nodes. Since a node can be divided into multiple sub-nodes, therefore a node can act as a parent node of numerous child nodes
Root Node: The first node of a decision tree. It does not have any parent node. It represents the entire population or sample
Leaf / Terminal Nodes: Nodes that do not have any child node are known as Terminal/Leaf Nodes

What is Node Splitting in a Decision Tree & Why is it Done?

In Decision trees data is passed from a root node to leaves for training. The data is recurrently split according to predictor variables so that child nodes are more “pure” in terms of the outcome variable.

Therefore, node splitting is a key concept that everyone should know.Node splitting, or simply splitting, is the process of dividing a node into multiple sub-nodes to create relatively pure nodes.

There are multiple ways of doing this, which can be broadly divided into two categories based on the type of target variable:

Continuous Target Variable

Reduction in Variance

Categorical Target Variable

Gini Impurity
Information Gain
Chi-Square

Decision Tree Splitting Method #1: Reduction in Variance

Reduction in Variance is a method for splitting the node used when the target variable is continuous, i.e., regression problems. It is so-called because it uses variance as a measure for deciding the feature on which node is split into child nodes.

Simple Ways to Split a Decision Tree in Machine Learning

Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous, then the variance is zero.

Here are the steps to split a decision tree by means of reduction in variance:

For each split, individually calculate the variance of each child node
Calculate the variance of each split as the weighted average variance of child nodes
Select the split with the lowest variance
Perform steps 1-3 until completely homogeneous nodes are achieved

Decision Tree Splitting Method #2: Information Gain

Now, what if we have a categorical target variable? Reduction in variation won’t rather cut it.
Well, the answer to that is Information Gain. Information Gain is used for splitting the nodes when the target variable is categorical. It works on the perception of the entropy and is given by:

Simple Ways to Split a Decision Tree in Machine Learning

Entropy is used for calculating the purity of a node. Lower the value of entropy, higher is the purity of the node. The entropy of a homogeneous node is zero. Since we subtract entropy from 1, the Information Gain is higher for the purer nodes with a maximum value of 1. Now, let’s take a look at the formula for calculating the entropy:

Simple Ways to Split a Decision Tree in Machine Learning

Steps to split a decision tree with Information Gain:

For each split, individually calculate the entropy of each child node
Calculate the entropy of each split as the weighted average entropy of child nodes
Select the split with the lowest entropy or highest information gain
Until you achieve homogeneous nodes, repeat steps 1-3

Decision Tree Splitting Method #3: Gini Impurity

Gini Impurity is a method for splitting the nodes when the target variable is categorical. It is the most popular and the easiest way to split a decision tree. The Gini Impurity value is:

Simple Ways to Split a Decision Tree in Machine Learning

Wait – what is Gini?

Gini is the probability of correctly labeling a randomly chosen element if it was randomly labeled according to the distribution of labels in the node. The formula for Gini is:

Simple Ways to Split a Decision Tree in Machine Learning

And Gini Impurity is:

Simple Ways to Split a Decision Tree in Machine Learning

Lower the Gini Impurity, higher is the homogeneity of the node. The Gini Impurity of a pure node is zero. Now, you might be thinking we already know about Information Gain then, why do we need Gini Impurity?

Gini Impurity is preferred to Information Gain because it does not contain logarithms which are computationally intensive.

Here are the steps to split a decision tree with Gini Impurity:

Similar to what we did in information gain. For each split, individually calculate the Gini Impurity of each child node
Calculate the Gini Impurity of each split as the weighted average Gini Impurity of child nodes
Select the split with the lowest value of Gini Impurity
Until you achieve homogeneous nodes, repeat steps 1-3

Decision Tree Splitting Method #4: Chi-Square

Chi-square is another method of splitting nodes in a decision tree for datasets having categorical target values. It can make two or more than two splits. It works on the statistical significance of differences between the parent node and child nodes.

Chi-Square value is:

Simple Ways to Split a Decision Tree in Machine Learning

Here, the Expected is the expected value for a class in a child node based on the distribution of classes in the parent node, and Actual is the actual value for a class in a child node.

The above formula gives us the value of Chi-Square for a class. Take the sum of Chi-Square values for all the classes in a node to calculate the Chi-Square for that node. Higher the value, higher will be the differences between parent and child nodes, i.e., higher will be the homogeneity.

Here are the steps to split a decision tree with Chi-Square:

For each split, individually calculate the Chi-Square value of each child node by taking the sum of Chi-Square values for each class in a node
Calculate the Chi-Square value of each split as the sum of Chi-Square values for all the child nodes
Select the split with higher Chi-Square value
Until you achieve homogeneous nodes, repeat steps 1-3

Simple Ways to Split a Decision Tree in Machine Learning

What is a decision tree?

Basic Decision Tree Terminologies

What is Node Splitting in a Decision Tree & Why is it Done?

Decision Tree Splitting Method #1: Reduction in Variance

Decision Tree Splitting Method #2: Information Gain

Decision Tree Splitting Method #3: Gini Impurity

Decision Tree Splitting Method #4: Chi-Square

Leave a comment Cancel reply

Top Tutorials

Recent Posts

Cloud Data Analytics: Driving Smarter Business Decisions

Cloud Data Analytics Driving Smarter Business Decisions

Cloud Data Engineering and Analytics: Powering the Future of Data-Driven Decisions

Cloud Data Engineering and Analytics Shaping the Future of Data-Driven Innovation

The Rise of Generative AI in Modern Technology

Work with us

Contact Us

Simple Ways to Split a Decision Tree in Machine Learning

What is a decision tree?

Basic Decision Tree Terminologies

What is Node Splitting in a Decision Tree & Why is it Done?

Decision Tree Splitting Method #1: Reduction in Variance

Decision Tree Splitting Method #2: Information Gain

Decision Tree Splitting Method #3: Gini Impurity

Decision Tree Splitting Method #4: Chi-Square

Related Posts

Leave a comment Cancel reply