Machine Learning- Instance-based Learning: k-Nearest Neighbor Algorithm – 2: Distance-Weighted Nearest Neighbor Algorithm

Instance-based Learning: k-Nearest Neighbor Algorithm – 2: Distance-Weighted Nearest Neighbor Algorithm

The simplest fundamental instance-based strategy is the K-NEAREST NEIGHBOR algorithm. This approach assumes that all instances correspond to points in n-dimensional space. In this blog, we’ll have a closer look at weighted KNN and the curse of dimensionality.

A modified form of k closest neighbors is weighted kNN. The choice of the hyperparameter k is one of the numerous factors that influence the performance of the kNN algorithm.

The method will be more sensitive to outliers if k is too small. If k is set too high, the neighborhood may contain too many points from different classes.

For many practical applications, the distance-weighted k-NN is a highly effective inductive inference approach. When given a sufficiently big collection of training data, it is resilient to noisy training data and extremely successful.

It can smooth out the influence of isolated noisy training samples by calculating the weighted average of the k neighbors closest to the query location.

The inductive bias is the belief that the classification of an instance xq will be most similar to the classification of other examples within Euclidean distance.

The contribution of each of the k neighbors is weighted according to their distance from the query pointxq in the Nearest Neighbor algorithm, with closer neighbors receiving more weight.

This may be performed by changing the algorithm’s last line with,

for target functions with real values,

If f(xi) = c for all training cases, then f(xq) ← c.

The sole drawback of taking into account all samples is that our classifier will run slower. While all training instances are taken into account when categorizing a new query instance, the algorithm is referred to as a global method.

We call it a local technique if we just consider the training examples that are closest to us. Shepard’s method is the name given to a rule that is applied as a global method to all training samples.

Let’s have a look at the CURSE of Dimensionality.

CURSE of Dimensionality:

The difficulty produced by the exponential rise in volume associated with adding more dimensions to Euclidean space is known as the curse of dimensionality.

The curse of dimensionality states that as the number of characteristics grows, the error grows as well. It refers to the fact that high-dimensional algorithms are more difficult to build and typically have a running duration that is proportional to the dimensions.

A larger number of dimensions theoretically allows for more information to be stored, but in practice, it seldom helps since real-world data contains more noise and redundancy.

Consider a k-NN situation in which each instance is defined by 20 qualities, but only two of them are essential to deciding the classification for the specific target function.

In this situation, instances with identical values for the two important properties might nonetheless be far apart in the 20-dimensional instance space. As a result, the k-NEAREST NEIGHBOR similarity score, which is based on all 20 traits, will be deceptive.

The vast amount of irrelevant qualities will influence the distance between neighbors. As the curse of dimensionality refers to the problem that comes when a large number of unnecessary qualities are present. Nearest-neighbor techniques are particularly vulnerable to this issue.

Why is it difficult to analyze high-dimensional data?

The combination of two factors causes difficulty in analyzing high-dimensional data.

High-dimensional spaces exhibit geometrical qualities that are counterintuitive and different from those seen in two- or three-dimensional spaces.

Data analysis tools are frequently created with understandable features and examples in low-dimensional spaces in mind, and they are typically best explained in 2- or 3-dimensional environments.

The problem is that those tools are also utilized when data is high-dimensional and complicated, which increases the risk of losing intuition about the tool’s behavior and drawing inaccurate conclusions.

How to Overcome the Dimensionality Curse

1. Weigh each property differently:

When determining the distance between two instances, one intriguing technique to solving this problem is to weigh each property differently.

Stretching the axes in Euclidean space, shortening the axes that correspond to less significant traits, and extending the axes that belong to more relevant attributes is equivalent to this.

A cross-validation technique may be used to automatically calculate the amount by which each axis should be extended.

To demonstrate how, consider that we want to stretch (multiply) the jth axis by some factor zj, with the values zl… z selected to minimize the learning algorithm’s real classification error. Second, cross-validation may be used to estimate this genuine error.

As a result, one approach is to choose a random portion of the available data as training examples, then calculate the values of zl… z that results in the least amount of error when categorizing the remaining cases.

The estimate for these weighting factors can be improved by repeating this approach several times. Stretching the axes to improve the performance of k-NN gives a way for reducing the influence of irrelevant qualities.

2. Remove the attributes that aren’t important

An even more severe option is to remove the least relevant properties from the instance space entirely. This is the same as turning off some of the zi scaling factors.

Moore and Lee (1994) demonstrate how to identify meaningful subsets of the characteristics using efficient cross-validation methods.

In all feasible approaches, investigate strategies based on leave-one-out cross-validation, in which a collection of m training examples is regularly divided into a training set of size m – 1 and a test set of size 1.

Because no extra training effort is required each time the training set is redefined, this leave-one-out strategy is simple to apply in k-NEAREST NEIGHBOR algorithms.

Stretch each axis by a different amount over the instance space. However, as the number of degrees of freedom available to the algorithm for redefining its distance measure in this way grows, the danger of overfitting grows as well. As a result, expanding the axes locally is a considerably less usual strategy.

Reference

Instance-based Learning: k-Nearest Neighbor Algorithm – 2: Distance-Weighted Nearest Neighbor Algorithm-1

Instance-based Learning: k-Nearest Neighbor Algorithm – 2: Distance-Weighted Nearest Neighbor Algorithm-2