Model selection is the process of choosing between different machine learning methods or choosing between different hyperparameters or sets of features for the same machine learning method for a training dataset.
There are different model selection approaches for machine learning model. Those different approaches are discussed below.
- Hyper parameters
- Learning method
- Model Evaluation
We can select best Hyperparameter for a specific machine learning approach. Hyperparameters are the parameters of the learning method which we have to specify before fitting the model. Where as model parameters are the parameters which appears as a result of the fitting.
In Logistic regression model, regularization strength is a hyperparameter which has to be specified before the model fitting, where the coefficients of the fitted model are the model parameters. Discovering the right hyper parameters for a model plays a crucial role in the model performance of the given data.
We may also choose best learning method and their optimal hyperparameters for the particular machine learning model. We will refer this as an algorithm selection.
In order to get best performance of the model, we need to choose a model among several machine learning models by evaluating their performance. A better way is to split the data randomly into training and testing dataset in 70% and 30% ratio respectively. Now, train the model on training dataset and observe its performance on testing dataset. If we also include validation dataset by splitting data into 60% training dataset, 20% testing dataset, 20% validation dataset would be better for the model performance. Hence, instead of measuring the test error, we would also measure the validation error.
Validation is mainly used to tune hyperparameters as we don’t tune them on training set because it can result in overfitting. We even don’t tune them on our testing data as it results in an overly optimistic estimation of generalization. Hence, we keep a separate set of data for validation for hyperparameter tuning. We can use these errors to identify the problem in model if it is not performing well. If our training error is large and validation or test dataset error is large, then it is an underfitting problem or high bias. If our training error is small and validation or test datasets is large, then it will have overfitting problem or high variance.
The test dataset is used to estimate the generalization error, it cannot be used for training in any sense which includes tuning hyperparameters. We should not evaluate on test dataset and then go back and twist things which will give an overly optimistic estimation of generalization error.
Some of the ways of evaluating a model’s performance on our known data are:
In K-fold cross validation, the training dataset is divided into k folds. Take k-1 folds for training and validate on remaining folds and average the results iteratively. There is also leave one out cross validation where k=n and n are the number of data points.
New datasets are generated by sampling the original dataset with replacement, then train on bootstrapped dataset and unselected data is validated.
Validation and Testing
Validation means a phase where we are tuning our model and its specific hyperparameters. After tuning our model, we have to test tis model on new set of data. This is to pretend the model’s performance on completely new data and observe its performance, which is the most important quality of a model.
Evaluating Regression models
Mean absolute error
Median absolute error
Root mean squared error
Coefficient of determination
Evaluating classification models
Area under the curve (AUC)
Hyperparameter selection plays a key role in the model selection. It is not less than an art as without a reliable and practical systematic process for optimizing them. Somehow, there are some automated techniques which are quite useful in this regard.
Grid Search means searching through various combinations of different hyperparameters and observing which combination performs well. Usually, hyperparameters are searched over specific intervals or scales depending on the hyperparameter. It will be easy if they are parallelized.
Random Search is similar to the Grid search by sampling randomly from the full grid, but in much less time.
If we take the combinations of n parameters, the probability that all n is outside of the 5% of top combinations is (1−0.05), thus the probability that at least one is in the 5% is just 1−(1−0.05) n. If we want to find one of these combinations 95% of the time, that is, we want the probability that at least one of them to be what we’re looking for to be 95%, then we just set 1−(1−0.05)n=0.951−(1−0.05)n=0.95, and thus n≥60n≥60, so we need to try only 60 random hyperparameter combinations at minimum to have a 95% chance of finding at least one hyperparameter combination that yields top 5% performance for the model.
Bayesian Hyperparameter Optimization
We can use Bayesian optimization to select good hyperparameters for us. We can sample hyperparameters from a Gaussian process (the prior) and use the result as observations to compute a posterior distribution. Then we select the next hyperparameters to try by optimizing the expected improvement over the current best result or the Gaussian process upper confidence bound (UCB). In particular, we choose an acquisition function to construct a utility function from the model posterior – this is what we use to decide what next set of hyperparameters to try.
Basic idea: Model the generalization performance of an algorithm as a smooth function of its hyperparameters and then try to find the maxima.
It has two parts:
Exploration: evaluate this function on sets of hyperparameters where the outcome is most uncertain
Exploitation: evaluate this function on sets of hyperparameters which seem likely to output high values
Which repeat until convergence.
This is faster than grid search by making “educated” guesses as to where the optimal set of hyperparameters might be, as opposed to brute-force searching through the entire space.
One problem is that computing the results of a hyperparameter sample can be very expensive (for instance, if you are training a large neural network).
We use a Gaussian process because its properties allow us to compute marginals and conditionals in closed form.