How to deal with imbalanced dataset in Machine Learning?
There are 5 different methods for dealing with imbalanced datasets:
- Change the performance metric
- Change the algorithm
- Over sample minority class
- Under sample majority class
- Generate synthetic samples
1. Change the performance metric
Accuracy is not the best metric to use when estimating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include:
2. Change the algorithm
In every machine learning problem, it’s a good rule to try a various algorithm which can be especially beneficial for imbalanced datasets. Decision trees frequently perform well on imbalanced data.
3. Resampling Techniques — Oversample minority class
Oversampling is defined as adding more copies of the minority class to the Data. It can be a good choice when you don’t have a ton of data.
Always split the data into test and train sets before trying oversampling techniques. Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets which degrades the model.
4. Resampling techniques — Under sample majority class
Under sampling is defined as removing some observations of the majority class in the Data. Under sampling is a good choice when you have a ton of data. But a disadvantage is that important information may be lost which could lead to underfitting and poor generalization to the test set.
5. Generate synthetic samples
This technique is to create synthetic samples. We will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique to generate synthetic samples. SMOTE uses a nearest neighbors’ algorithm to generate new and synthetic data with which we can use for training our model.