K-Nearest Neighbors is one of the most important classification algorithms in Machine Learning. It belongs to the supervised learning field and has a powerful application in pattern recognition, data mining and disturbance detection.
The k-nearest-neighbor is a “lazy learner” algorithm, as it does not build a model using the training set until a request of the data set is performed.
KNN algorithm can be used for both classification and regression problems. However, it is more widely used in classification problems. When building a KNN model there are only a few parameters that need to be chosen to improve performance of a model.
K the number of neighbors
There is no particular single value for K in the algorithm. For Classification models, specifically if there are only two classes present, then generally odd number is chosen for K. Increasing the value of K will tend to smooth the decision boundaries and avoids overfitting at the cost of some resolution.
There are different methods to measure minimum distance between two points. The differences between these methods can become important when it comes to the higher dimensions. Most commonly used method is Euclidean distance which uses Pythagoras theorem. Another metric is Manhattan distance, which measures the distance taken in each basic direction, instead of direction along the diagonal. And the third type of distance measure is Minkowski which is somewhat similar to Euclidean distance.
Here when the value of q is set to 1, Minkowski formula is the same as Manhattan distance, and when set to two, it changes to Euclidean distance measure.
The important point to be noted that all three distance measures are only valid for continuous
variables. In the occurrence of categorical variables, to calculate the distance between the
variables we have to use hamming distance. When there is a combination of numerical and
categorical variables in the dataset, it brings up the problem of standardization of the numerical
variables between 0 and 1.
By checking the data thoroughly, we can choose the optimal value for K. Generally, a large value of K is more suitable because it may reduce the entire noise. On a second thought we have another way to determine optimal value is Cross-Validation. It can be done by using independent dataset to evaluate the K value. Usually, the optimal values of K for most datasets has been between 3-10.
With weights, the points which are nearer will count more than the points which are farther away. The algorithm will still look at all k nearest neighbors, but the closer neighbors will have more of a vote than those of farther away. This isn’t a perfect solution and there is a chance of overfitting of the data.
Now our predictions go to the edge of the data set, but you can see that our predictions now move much closer to the individual points. The weighted method works reasonably when you are between points, but as you get closer to any particular point, that point’s value has more influence on the algorithm’s prediction. If you get close enough to a point, it’s almost like setting value of K to one, as that point has much influence on it.
Finally, KNN models cannot work well if different feature variables have different scales. We have to scale or normalize the features in order to make model performs well with the predictions.
- KNN models are easy to implement.
- Can able to handle non-linearities well.
- It can fit to the model quickly.
- The computer doesn’t have to calculate any particular parameters or values.
- While the model is quick to set up, it’s slower to predict, as in order to predict an outcome for a new value, it needs to search through all the points in its training dataset in order to find the nearest ones.
- For large datasets, KNN can therefore be a relatively slow method.
- It is a lazy learner method.