K Fold Cross Validation
Cross Validation is a technique which involves reserving a specific sample of a dataset on which we do not train the model which is used to evaluate machine learning models on a limited data sample. It is commonly used in Machine learning to compare and choose a model for given predictive modeling problem this method is easy to understand and implement. Generally, the results in skill estimates have a lower bias compared to other methods.
K Fold Cross Validation method ensures that each and every observation from the original dataset has a chance of appearing in training and test set. It is one of the best methods if we have very limited input data.
K Fold Cross Validation procedure has a single parameter called k which refers to the number of groups that a given data sample is to be split into. Hence, the procedure is called as K Fold Cross Validation. When a specific value for K is selected, it may be used in place of K in the reference of the model, such as K=10 becoming 10-fold cross validation.
The steps followed in K Fold Cross Validation are discussed below:
- Split the entire data into K Folds randomly. The value of K should not be too small or too high, generally, we choose the value between 5 to 10 depending on the size of the data.
- If the value of K is high, it leads to less biased model but large variance which may leads to Overfit. Whereas the lower value of K is similar to the train test split.
- For every K fold in our Dataset, build the model on K-1 folds and validate or check the model for effectiveness at Kth fold.
- Note the scores or errors on each of the predictions.
- Repeat this process until each of the K Folds has served as the test set.
- The average of K recorded errors is called as Cross Validation error which acts as performance metric of the model.
Finally, we test our model on this sample before finalizing it. Cross Validation is a statistical approach used to estimate the skill of models.
There are number of variations in the K Fold Cross Validation procedure. The three generally used variations are explained below
- Train/Test Split
To one extreme, value of K may be set to 2 not 1 such that a Single Train/Test split is created in order to evaluate the model.
LOOCV is called as Leave one out Cross Validation. To the another extreme, value of K may be set to the total number of observations in the dataset so that every observation is given a chance to be held out of the dataset.
The splitting of data into number of folds may be governed by criteria like ensuring that every fold has the same ratio of observations with a given categorical value like class outcome value. This procedure is called Stratified Cross Validation.
In this process, K fold cross validation procedure is iterated or repeated n number of times. The data sample is shuffled prior to each iteration which results in a different split of a sample.