- e.g. Flower Species 150 : 50 Setosa, 50 Verginica and 50 Versicolor
Here, observations, samples, and rows represent the same thing.
- Randomly duplicate the minority class samples to reinforce the learning of data.
- Resample minority class by setting the number of samples to match with the majority class.
- Combine the upsampled minority samples with the original majority samples.
- Randomly remove/drop the majority class samples to make the class distribution balanced.
- Resample the majority class by setting the number of samples to match with minority class.
- Combine the downsampled majority samples with the original minority samples.
- Precision, Recall | True Positive Rate (TPR), and F1 Score are better metrics for imbalanced dataset.
- F1 Score: Harmonic mean of recall and precision (keeps the balance between recall and precision)
- Stratified K Fold Cross Validation: Samples with an equal proportion of each class label to train the model.
- Random Forest and KNN can perform well with imbalanced data.
- RF combines predictions from multiple models trained on different resampled versions of data.
- Parameter class_weights (We can specify a higher weight for the minority class)
- Force the model to pay more attention while learning from the minority class samples.