Home / Machine Learning – Tutorial / Machine Learning- Decision Trees

Decision Trees

Decision tree is a decision tool that uses a tree-like graph to represent their possible consequences or outcomes, including chance event outcomes, resource costs, and effectiveness. It is a like flowchart structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a decision taken after computing all attributes.

Decision Tree algorithms are considered to be one of the best and mostly used supervised learning methods. Tree type of methods empower predictive models with high accuracy, stability and easy understanding. Decision trees can also map non-linear relationships quite well. They can handle both regression and classification problems. Hence, they are referred as CART (Classification and Regression Trees).

Notation used in Decision trees:

Root Node: It represents whole population or sample and further gets divided into more homogeneous sets.
Splitting: Procedure of dividing a node into two or more sub-nodes.
Decision Node: When a node splits further into sub-nodes, then the node is called decision node.
Leaf/ Terminal Node: Nodes which cannot be split further is called Leaf or Terminal node.
Pruning: Process of removing sub-nodes of a decision node, is called pruning. To make it opposite process of splitting.
Branch / Sub-Tree: A part of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is split into sub-nodes is called parent whereas sub-nodes are the child of parent node.

Types of Decision Trees

Decision Trees are classified based on type of Target variables. It is classified into two types:

Categorical Variable Decision Tree: Decision Tree which has target as categorical variable is called a Categorical variable decision tree.
Continuous Variable Decision Tree: Decision Tree which has target as continuous variable then it is called Continuous Variable Decision Tree.

Attribute Selection Measures

If the dataset consists of N attributes or features then deciding which attribute to put at the basis and at nodes. By just randomly selecting any node to be the basis can’t solve the difficulty. By following random approach, it’s going to give us bad results with low accuracy or performance.

For selecting attributes, we consider following measures:

Entropy

Information gain

Gini index

Gain Ratio

Reduction in Variance

Chi-Square

These criteria will calculate values for each attribute. The values are sorted, and attributes are placed within the tree by following the order i.e., the attribute with a high value is placed at the basis.

While using Information Gain as a main point, we assume attributes to be categorical, and for the Gini index, attributes are assumed to be continuous.

Entropy

Entropy may be a measure of the randomness within the information being processed. the upper the entropy, the harder it’s to draw any conclusions from that information. Flipping a coin is an example of an action that gives information that’s random.

The Entropy will be maximum when the probability is 0.5 because it displays perfect randomness in the data and there is no chance if perfectly defining the outcome.

A node with zero entropy is known as leaf node.

Mathematically Entropy for 1 attribute is represented as:

Where S → Current state, and Pi → Probability of an event i of state S or Percentage of class i in a node of state S.

Mathematically Entropy for multiple attributes is represented as:

where T→ Current state and X → Selected attribute

Information Gain

Information gain or IG is a statistical property which measures how better the given attribute differentiate the training examples according to their target classification. Building a decision tree is all about finding an attribute that gives the outcome as highest information gain and the smallest entropy.

Information gain is a reduce in entropy. It calculates the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

Mathematically, IG is represented as:

In a much simpler way, we can conclude that:

Where before is the dataset before the split, K is the number of subsets generated by the split, and j after is subset j after the split.

Gini Index

Gini index is a type of cost function which is used to evaluate splits in the dataset. It is computed by subtracting the sum of the squared probabilities of each class from one. It supports larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values.

Gini Index deals with the categorical target variable Success or Failure. It performs only Binary separations.

Gini index is proportional to Homogeneity, higher the value of Gini index higher the homogeneity.

Steps to Calculate Gini index

Compute Gini for sub-nodes, using the formula for success(p) and failure(q) (p²+q²).
Calculate the Gini index for split using the weighted Gini score of each split node.
(CART) Classification and Regression Tree uses the Gini index method to create split points.

Gain ratio

Information gain is biased towards choosing attributes with a large number of values as root nodes which means it only prefers the attribute with a large number of distinct values.

Gain ratio overcomes the limitations of information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into consideration.

Where before is the dataset before the split, K is the number of subsets generated by the split, and j after is subset j after the split.

Reduction in Variance

Reduction in variance is an algorithm which can be used for continuous target variables that is regression problems. This algorithm uses the general formula of variance to choose the best split. The split which has lower variance is selected as the criteria to split the population:

Where X-bar is the mean of the values, X is actual and n is the number of values.

Chi-Square

CHAID abbreviated as Chi-squared Automatic Interaction Detector. It finds out the statistical significance of the differences between child nodes and parent node. We can measure it by the sum of squares of standardized differences between observed and expected frequencies of the target variable.

It deals with the categorical target variable Success or Failure. It can able to perform two or more splits. Higher the Chi-Square value, higher the statistical significance of differences between child node and Parent node.

Mathematically, Chi-squared is represented as:

Steps to Calculate Chi-square for a split:

Compute Chi-square for an individual node by calculating the deviation for Success and Failure.
Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each split node.

Advantages of Decision Tree:

Easy to Understand: Decision tree output is very easy to understand. It may not require any statistical knowledge to read and analyze them. Its graphical representation is very spontaneous and users can easily relate their hypothesis.
Useful in Data exploration: Decision tree is one of the fastest ways to identify most significant variables and relation between them. With the help of decision trees, we can create new variables / features that has better power to predict target variable. It is also used in data exploration stage.
Decision trees implicitly perform variable broadcast or feature selection.
Decision trees need relatively little effort from users for data preparation.
Less data cleaning required: It does not require more data cleaning as some other modeling techniques. It cannot be influenced by outliers and missing values to a fair degree.
Data type is not a constraint: It can handle both numerical and categorical variables.
Non-Parametric Method: Decision tree is considered to be a non-parametric method which means that decision trees have no assumptions about the space distribution and the classifier structure.
Non-linear relationships between parameters do not affect tree performance.
The number of hyper-parameters to be tuned is almost null.

Disadvantages of Decision Tree:

Over fitting: Decision tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
Not fit for continuous variables: While working with continuous numerical variables, decision tree tends to lose information.
Decision trees can be unstable because small variations in the data might results to generate a completely different.
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate.
Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories which results poor performance.
Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
Calculations can become complex when there are many class labels.