Over Fitting and Under Fitting
Overfitting in Machine Learning
Overfitting refers to a model which learns the training data too well.
It happens when a model learns every detail and noise in the training data through which it negatively impacts or effects the performance of the model on new data. Here, it means that the noise or random fluctuations in the training data is collected and learned as concepts by the model. The problem here is that these concepts do not apply to new data and negatively impact the model’s ability to generalize.
Overfitting is more likely with the models which has no parameters and non-linear in nature that have more flexibility when learning a target function. Similarly, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain the detail the model learns.
The causes of overfitting are the methods which has no parameters and non-linear in nature. Because these types of machine learning algorithms have more freedom in building the model based on the dataset and hence, they can build unrealistic models.
For example, decision trees are also a nonparametric machine learning algorithm which is very flexible and is prone to overfitting training data. This problem can be rectified by pruning a tree after it has learned in order to remove some of the detail it has picked up.
Underfitting in Machine Learning
Underfitting refers to a model which can neither learn the training data nor generalize to new data. A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. It decreases the accuracy of our machine learning model. Under fitting simply means that our model or the algorithm does not fit the data well. Generally, it happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong predictions if the testing data is beyond the Training data. This can be avoided by using more data and also reducing the features by feature selection method.
The basic method to avoid Over fitting and Under fitting is Validation.
We need to create a model with the best settings which means the degree, but we don’t want to have to keep going through training and testing of the data. We require some sort of pre-check to use for model optimization and evaluate. This pre-test or checking in advance is known as a validation set.
A basic approach is to use a validation set in addition to the training and testing set. This may also lead to a few problems such as we could just end up overfitting to the validation set and also may be, we would have less training data. For implementation of the validation concept, the smart way is k-fold cross-validation.
The idea of cross validation is rather than using a separate validation set, we split the training set into a number of subsets which are also called as folds.
For example, let’s take five samples. We perform a series of train and evaluate where each time we train on 4 of the folds and test on the 5th fold which is called as the hold-out set. We iterate this process 5 times, each time using a different fold for evaluation. Finally, we average the scores for each of the folds to determine the overall performance of a given model. This will allow us to optimize the model before deployment without using additional data.
Here, we can use cross-validation to choose the best model by creating models with a range of different degrees, and evaluate each one using 5-fold cross-validation. The model which has the lowest cross-validation score will perform best on the testing data and will achieve a balance between underfitting and overfitting. To cover wide range, we can choose degrees of the models from 1 to 40. For comparing models, we calculate the mean-squared error, the average distance between the prediction and the real value squared. The below table shows the cross-validation results ordered by lowest error and the graph shows all the results with error on the y-axis.
Cross Validation Results
The cross-validation error with the underfit and overfit models is given in the chart. A model with 4 degrees appears to be optimal or best. To test the results, we can make a model with 4 degrees and view the training and testing predictions.
To verify that if we have the optimal model then, we can also plot training and testing curves. These curves will show the model setting we tuned on the x-axis and both the training and testing error on the y-axis. A model which is underfit will have high training and high testing error while an overfit model will have extremely low training error but a high testing error.
This graph explains well about the problem of overfitting and underfitting. As the flexibility in the model increases by increasing the polynomial degree, the training error continually decreases because of the increased flexibility. However, the error on the testing set only decreases when we add flexibility up to a certain point. In this case, that occurs at 5 degrees. As the flexibility increases beyond this point, the training error increases as the model has memorized the training data along with the noise. The exact metrics depend on the testing set, the best model from cross-validation will beat all other models.