Imbalanced Classification Problems
In case of binary classification strongly imbalanced classes often lead to unsatisfactory results regarding the prediction of new observations, especially for the small class. In this context imbalanced classes simply means that the number of observations of one class (usu. positive or majority class) by far exceeds the number of observations of the other class (usu. negative or minority class). This setting can be observed fairly often in practice and in various disciplines like credit scoring, fraud detection, medical diagnostics or churn management.
Most classification methods work best when the number of observations per class are roughly equal. The problem with imbalanced classes is that because of the dominance of the majority class classifiers tend to ignore cases of the minority class as noise and therefore predict the majority class far more often. In order to lay more weight on the cases of the minority class, there are numerous correction methods which tackle the imbalanced classification problem. These methods can generally be divided into cost- and sampling-based approaches. Below all methods supported by mlr are introduced.
Sampling-based approaches
The basic idea of sampling methods is to simply adjust the proportion of the classes in order to increase the weight of the minority class observations within the model.
The sampling-based approaches can be divided further into three different categories:
-
Undersampling methods: Elimination of randomly chosen cases of the majority class to decrease their effect on the classifier. All cases of the minority class are kept.
-
Oversampling methods: Generation of additional cases (copies, artificial observations) of the minority class to increase their effect on the classifier. All cases of the majority class are kept.
-
Hybrid methods: Mixture of under- and oversampling strategies.
All these methods directly access the underlying data and "rearrange" it. In this way the sampling is done as part of the preprocesssing and can therefore be combined with every appropriate classifier.
mlr currently supports the first two approaches.
(Simple) over- and undersampling
As mentioned above undersampling always refers to the majority class, while oversampling affects the minority class. By the use of undersampling, randomly chosen observations of the majority class are eliminated. Through (simple) oversampling all observations of the minority class are considered at least once when fitting the model. In addition, exact copies of minority class cases are created by random sampling with repetitions.
First, let's take a look at the effect for a classification task.
Based on a simulated ClassifTask with imbalanced classes two new
tasks (task.over
, task.under
) are created via mlr functions
oversample and undersample, respectively.
data.imbal.train = rbind(
data.frame(x = rnorm(100, mean = 1), class = "A"),
data.frame(x = rnorm(5000, mean = 2), class = "B")
)
task = makeClassifTask(data = data.imbal.train, target = "class")
task.over = oversample(task, rate = 8)
task.under = undersample(task, rate = 1/8)
table(getTaskTargets(task))
#>
#> A B
#> 100 5000
table(getTaskTargets(task.over))
#>
#> A B
#> 800 5000
table(getTaskTargets(task.under))
#>
#> A B
#> 100 625
Please note that the undersampling rate has to be between 0 and 1, where 1 means no undersampling and 0.5 implies a reduction of the majority class size to 50 percent. Correspondingly, the oversampling rate must be greater or equal to 1, where 1 means no oversampling and 2 would result in doubling the minority class size.
As a result the performance should improve if the model is applied to new data.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task)
mod.over = train(lrn, task.over)
mod.under = train(lrn, task.under)
data.imbal.test = rbind(
data.frame(x = rnorm(10, mean = 1), class = "A"),
data.frame(x = rnorm(500, mean = 2), class = "B")
)
performance(predict(mod, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.01960784 0.50000000 0.50000000
performance(predict(mod.over, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.04509804 0.41500000 0.58500000
performance(predict(mod.under, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.05098039 0.41800000 0.70550000
In this case the performance measure has to be considered very carefully. As the misclassification rate (mmce) evaluates the overall accuracy of the predictions, the balanced error rate (ber) and area under the ROC Curve (auc) might be more suitable here, as the misclassifications within each class are separately taken into account.
Over- and undersampling wrappers
Alternatively, mlr also offers the integration of over- and undersampling via a wrapper approach. This way over- and undersampling can be applied to already existing learners to extend their functionality.
The example given above is repeated once again, but this time with extended learners instead of modified tasks (see makeOversampleWrapper and makeUndersampleWrapper). Just like before the undersampling rate has to be between 0 and 1, while the oversampling rate has a lower boundary of 1.
lrn.over = makeOversampleWrapper(lrn, osw.rate = 8)
lrn.under = makeUndersampleWrapper(lrn, usw.rate = 1/8)
mod = train(lrn, task)
mod.over = train(lrn.over, task)
mod.under = train(lrn.under, task)
performance(predict(mod, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.01960784 0.50000000 0.50000000
performance(predict(mod.over, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.03333333 0.40900000 0.72020000
performance(predict(mod.under, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.04509804 0.41500000 0.71660000
Extensions to oversampling
Two extensions to (simple) oversampling are available in mlr.
1. SMOTE (Synthetic Minority Oversampling Technique)
As the duplicating of the minority class observations can lead to overfitting, within SMOTE the "new cases" are constructed in a different way. For each new observation, one randomly chosen minority class observation as well as one of its randomly chosen next neighbours are interpolated, so that finally a new artificial observation of the minority class is created. The smote function in mlr handles numeric as well as factor features, as the gower distance is used for nearest neighbour calculation. The factor level of the new artificial case is sampled from the given levels of the two input observations.
Analogous to oversampling, SMOTE preprocessing is possible via modification of the task.
task.smote = smote(task, rate = 8, nn = 5)
table(getTaskTargets(task))
#>
#> A B
#> 100 5000
table(getTaskTargets(task.smote))
#>
#> A B
#> 800 5000
Alternatively, a new wrapped learner can be created via makeSMOTEWrapper.
lrn.smote = makeSMOTEWrapper(lrn, sw.rate = 8, sw.nn = 5)
mod.smote = train(lrn.smote, task)
performance(predict(mod.smote, newdata = data.imbal.test), measures = list(mmce, ber, auc))
#> mmce ber auc
#> 0.04509804 0.41500000 0.71660000
By default the number of nearest neighbours considered within the algorithm is set to 5.
2. Overbagging
Another extension of oversampling consists in the combination of sampling with
the bagging approach. For each iteration of the bagging process,
minority class observations are oversampled with a given rate in obw.rate
.
The majority class cases can either all be taken into account for each
iteration (obw.maxcl = "all"
) or bootstrapped with replacement to increase
variability between training data sets during iterations (obw.maxcl = "boot"
).
The construction of the Overbagging Wrapper works similar
to makeBaggingWrapper.
First an existing mlr learner has to be passed to makeOverBaggingWrapper.
The number of iterations or fitted models can be set via obw.iters
.
lrn = makeLearner("classif.rpart", predict.type = "response")
obw.lrn = makeOverBaggingWrapper(lrn, obw.rate = 8, obw.iters = 3)
For binary classification the prediction is based on majority voting to create a discrete label. Corresponding probabilities are predicted by considering the proportions of all the predicted labels. Please note that the benefit of the sampling process is highly dependent on the specific learner as shown in the following example.
First, let's take a look at the tree learner with and without overbagging:
lrn = setPredictType(lrn, "prob")
rdesc = makeResampleDesc("CV", iters = 5)
r1 = resample(learner = lrn, task = task, resampling = rdesc, show.info = FALSE,
measures = list(mmce, ber, auc))
r1$aggr
#> mmce.test.mean ber.test.mean auc.test.mean
#> 0.01960784 0.50000000 0.50000000
obw.lrn = setPredictType(obw.lrn, "prob")
r2 = resample(learner = obw.lrn, task = task, resampling = rdesc, show.info = FALSE,
measures = list(mmce, ber, auc))
r2$aggr
#> mmce.test.mean ber.test.mean auc.test.mean
#> 0.04470588 0.43611719 0.58535862
Now let's consider a random forest as initial learner:
lrn = makeLearner("classif.randomForest")
obw.lrn = makeOverBaggingWrapper(lrn, obw.rate = 8, obw.iters = 3)
lrn = setPredictType(lrn, "prob")
r1 = resample(learner = lrn, task = task, resampling = rdesc, show.info = FALSE,
measures = list(mmce, ber, auc))
r1$aggr
#> mmce.test.mean ber.test.mean auc.test.mean
#> 0.03509804 0.46089748 0.58514212
obw.lrn = setPredictType(obw.lrn, "prob")
r2 = resample(learner = obw.lrn, task = task, resampling = rdesc, show.info = FALSE,
measures = list(mmce, ber, auc))
r2$aggr
#> mmce.test.mean ber.test.mean auc.test.mean
#> 0.04098039 0.45961754 0.54926842
While overbagging slighty improves the performance of the decision tree, the auc decreases in the second example when additional overbagging is applied. As the random forest itself is already a strong learner (and a bagged one as well), a further bagging step isn't very helpful here and usually won't improve the model.
Cost-based approaches
In contrast to sampling, cost-based approaches usually require particular learners, which can deal with different class-dependent costs (Cost-Sensitive Classification).
Weighted classes wrapper
Another approach independent of the underlying classifier is to assign the costs as class weights, so that each observation receives a weight, depending on the class it belongs to. Similar to the sampling-based approaches, the effect of the minority class observations is thereby increased simply by a higher weight of these instances and vice versa for majority class observations.
In this way every learner which supports weights can be extended through the wrapper approach. If the learner does not have a direct parameter for class weights, but supports observation weights, the weights depending on the class are internally set in the wrapper.
lrn = makeLearner("classif.logreg")
wcw.lrn = makeWeightedClassesWrapper(lrn, wcw.weight = 0.01)
For binary classification, the single number passed to the classifier corresponds to the weight of the positive / majority class, while the negative / minority class receives a weight of 1. So actually, no real costs are used within this approach, but the cost ratio is taken into account.
If the underlying learner already has a parameter for class weighting (e.g.,
class.weights
in "classif.ksvm"
), the wcw.weight
is basically passed
to the specific class weighting parameter.
lrn = makeLearner("classif.ksvm")
wcw.lrn = makeWeightedClassesWrapper(lrn, wcw.weight = 0.01)