/    /  Machine Learning- Evaluating Hypotheses: Comparing learning algorithms

Evaluating Hypotheses: Comparing learning algorithms

 

When we train a model, there are tons of learning algorithms that can be used to train the model correctly, and we’ll want to choose the algorithm that best fits the universal dataset(training and testing dataset). 

 

To find the best learning algorithm, we compare different learning algorithms. In this blog, we’ll have a look at some parameters we need to keep in my to compare learning algorithms. 

 

Why rely on Statistical methods to compare learning algorithms? 

The mean performance of machine learning models is commonly calculated using k-fold cross-validation.

 

The algorithm with the best average performance should outperform those with the worst average performance. But what if the discrepancy in average performance is due to a statistical anomaly?

 

To determine whether the difference in mean performance between any two algorithms is real or not, a statistical hypothesis test is used.

 

Comparing of learning Algorithms: 

We want to know which of the Learning algorithm is the better learning approach for learning a specific target function f on average. Average the performance of these two algorithms over all the training sets of size n selected from the instance distribution D.

 

We use this formula to estimate the expected value of the difference in the errors. 

 

Where L(S) signifies the hypothesis generated by learning technique L given a sample of training data of size S.

 

When comparing learning algorithms, we only have a small sample Do of data to work with.

 

Do can be divided into two sets: a training set So and a disjoint test set To. The test data can be used to assess the accuracy of the two hypotheses, while the training data can be used to train both LA and LB (the learning algorithms).

 

There are two major distinctions between this estimator and the quantity in Equation (5.14):

  • To begin, we’ll use error(h) to approximate errorD(h).
  • Second, rather than considering the anticipated value of this difference overall samples S chosen from the distribution D, we just measure the difference in errors for one training set S0.

 

To make the estimator in Equation better (5.15)

  • split the data on a regular basis and take the mean of the test set errors for these individual studies and divide them into disjoint training and test sets.

 

The quantity returned by the procedure of Table 5.5 can be taken as an estimate of the desired quantity from Equation 5.14. More appropriately, we can view as an estimate of the quantity. 

 

Where S represents a random sample of size ((k-1)/k)*|D0| drawn uniformly from D0

 

The approximate N% confidence interval for estimating the quantity in Equation 5.16 using is given by

Where tN, k-1 is a constant that plays a role analogous to that of Zn in our earlier confidence interval expressions, and where is an estimate of the standard deviation of the distribution governing . In particular, is defined as

 

Notice the constant tN,k-1 in equation (5.17) has two subscripts. The first one specifies the desired confidence level, as it did for our earlier constant zN

 

The second parameter called the number of degrees of freedom and usually denoted by v, is related to the number of independent random events that go into producing the value for the random variable

 

The number of degrees of freedom is k-1. Table 5.6 contains the selected values for the parameter t. Notice that as k -> , the value of tN, k-1 approaches the constant zN.

Paired tests are tests in which the hypotheses are assessed over identical samples.

 

Because any variations in observed errors are related to differences between the hypotheses, paired tests often provide tighter confidence ranges.

 

 

Paired t-Tests:

Consider the following estimation problem to better understand the justification for the confidence interval estimate given by Equation (5.17):

 

 

  • The observed values of a collection of independent, identically distributed random variables Y1, Y2,…, Yk are presented to us.
  • We want to figure out what the mean of the probability distribution that governs this Yi is.
  • The sample mean will be our estimator.

 

the individual Yi follow a Normal distribution.

 

In this idealized method, we modify the procedure of Table 5.5.

 

Instead of drawing from the fixed sample Do. Each iteration through the loop generates a new random training set Si and a new random test set Ti by pulling from the underlying instance distribution.

 

In particular, the 𝛿𝑖 measured by the procedure now correspond to the independent, identically distributed random variables Yi

 

The mean 𝜇 of their distribution corresponds to the expected difference in error between the two learning methods [i.e., Equation (5.14)]. 

 

The sample means 𝑌 ̅ is the quantity  (𝛿 ) ̅computed by this idealized version of the method.

  • The t distribution is a bell-shaped distribution similar to the Normal distribution, but wider and shorter to reflect the greater variance introduced by using s.
  • To approximate the true standard deviation 𝜎. 
  • The t distribution approaches the Normal distribution when k approaches infinity.

 

Reference

Evaluating Hypotheses: Comparing learning algorithms