Machine Learning- ID3 Algorithm and Hypothesis space in Decision Tree Learning

ID3 Algorithm and Hypothesis space in Decision Tree Learning

The collection of potential decision trees is the hypothesis space searched by ID3. ID3 searches this hypothesis space in a hill-climbing fashion, starting with the empty tree and moving on to increasingly detailed hypotheses in pursuit of a decision tree that properly classifies the training data.

In this blog, we’ll have a look at the Hypothesis space in Decision Trees and the ID3 Algorithm.

ID3 Algorithm:

The ID3 algorithm (Iterative Dichotomiser 3) is a classification technique that uses a greedy approach to create a decision tree by picking the optimal attribute that delivers the most Information Gain (IG) or the lowest Entropy (H).

What is Information Gain and Entropy?

Information Gain:

The assessment of changes in entropy after segmenting a dataset based on a characteristic is known as information gain.

It establishes how much information a feature provides about a class.

We divided the node and built the decision tree based on the value of information gained.

The greatest information gain node/attribute is split first in a decision tree method, which always strives to maximize the value of information gain.

The formula for Information Gain:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy is a metric for determining the degree of impurity in a particular property. It denotes the unpredictability of data. The following formula may be used to compute entropy:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S stands for “total number of samples.”

P(yes) denotes the likelihood of a yes answer.

P(no) denotes the likelihood of a negative outcome.

ID3 Algorithm:

Calculate the dataset’s entropy.
For each feature/attribute.

Determine the entropy for each of the category values.

Calculate the feature’s information gain.

Find the feature that provides the most information.
Repeat it till we get the tree we want.

Characteristics of ID3:

ID3 takes a greedy approach, which means it might become caught in local optimums and hence cannot guarantee an optimal result.
ID3 has the potential to overfit the training data (to avoid overfitting, smaller decision trees should be preferred over larger ones).
This method creates tiny trees most of the time, however, it does not always yield the shortest tree feasible.
On continuous data, ID3 is not easy to use (if the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by takes a lot of time).

Over Fitting:

Good generalization is the desired property in our decision trees (and, indeed, in all classification problems), as we noted before.

This implies we want the model fit on the labeled training data to generate predictions that are as accurate as they are on new, unseen observations.

Capabilities and Limitations of ID3:

In relation to the given characteristics, ID3’s hypothesis space for all decision trees is a full set of finite discrete-valued functions.
As it searches across the space of decision trees, ID3 keeps just one current hypothesis. This differs from the prior version space candidate Elimination approach, which keeps the set of all hypotheses compatible with the training instances provided.
ID3 loses the capabilities that come with explicitly describing all consistent hypotheses by identifying only one hypothesis. It is unable to establish how many different decision trees are compatible with the supplied training data.
One benefit of incorporating all of the instances’ statistical features (e.g., information gain) is that the final search is less vulnerable to faults in individual training examples.
By altering its termination criterion to allow hypotheses that inadequately match the training data, ID3 may simply be modified to handle noisy training data.
In its purest form, ID3 does not go backward in its search. It never goes back to evaluate a choice after it has chosen an attribute to test at a specific level in the tree. As a result, it is vulnerable to the standard dangers of hill-climbing search without backtracking, resulting in local optimum but not globally optimal solutions.
At each stage of the search, ID3 uses all training instances to make statistically based judgments on how to refine its current hypothesis. This is in contrast to approaches that make incremental judgments based on individual training instances (e.g., FIND-S or CANDIDATE-ELIMINATION ).

Hypothesis Space Search by ID3:

ID3 climbs the hill of knowledge acquisition by searching the space of feasible decision trees.
It looks for all finite discrete-valued functions in the whole space. Every function is represented by at least one tree.
It only holds one theory (unlike Candidate-Elimination). It is unable to inform us how many more feasible options exist.
It’s possible to get stranded in local optima.
At each phase, all training examples are used. Errors have a lower impact on the outcome.

Reference

ID3 Algorithm and Hypothesis space in Decision Tree Learning