Benchmark Experiments
In a benchmark experiment different learning methods are applied to one or several data sets with the aim to compare and rank the algorithms with respect to one or more performance measures.
In mlr a benchmark experiment can be conducted by calling function benchmark on a list of Learners and a list of Tasks. benchmark basically executes resample for each combination of Learner and Task. You can specify an individual resampling strategy for each Task and select one or multiple performance measures to be calculated.
Conducting benchmark experiments
We start with a small example. Two learners, linear discriminant analysis (lda) and a
classification tree (rpart), are applied to one classification problem (sonar.task).
As resampling strategy we choose "Holdout"
.
The performance is thus calculated on a single randomly sampled test data set.
In the example below we create a resample description (ResampleDesc), which is automatically instantiated by benchmark. The instantiation is done only once per Task, i.e., the same training and test sets are used for all learners. It is also possible to directly pass a ResampleInstance.
If you would like to use a fixed test data set instead of a randomly selected one, you can create a suitable ResampleInstance through function makeFixedHoldoutInstance.
## Two learners to be compared
lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart"))
## Choose the resampling strategy
rdesc = makeResampleDesc("Holdout")
## Conduct the benchmark experiment
bmr = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> Resampling: holdout
#> Measures: mmce
#> [Resample] iter 1: 0.3000000
#>
#> Aggregated Result: mmce.test.mean=0.3000000
#>
#> Task: Sonar-example, Learner: classif.rpart
#> Resampling: holdout
#> Measures: mmce
#> [Resample] iter 1: 0.2857143
#>
#> Aggregated Result: mmce.test.mean=0.2857143
#>
bmr
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
For convenience, if you don't want to pass any additional arguments to makeLearner, you don't need to generate the Learners explicitly, but it's sufficient to provide the learner name. In the above example we could also have written:
## Vector of strings
lrns = c("classif.lda", "classif.rpart")
## A mixed list of Learner objects and strings works, too
lrns = list(makeLearner("classif.lda", predict.type = "prob"), "classif.rpart")
bmr = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> Resampling: holdout
#> Measures: mmce
#> [Resample] iter 1: 0.2571429
#>
#> Aggregated Result: mmce.test.mean=0.2571429
#>
#> Task: Sonar-example, Learner: classif.rpart
#> Resampling: holdout
#> Measures: mmce
#> [Resample] iter 1: 0.2714286
#>
#> Aggregated Result: mmce.test.mean=0.2714286
#>
bmr
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.2571429
#> 2 Sonar-example classif.rpart 0.2714286
In the printed summary table every row corresponds to one pair of Task and Learner. The entries show the mean misclassification error (mmce), the default performance measure for classification, on the test data set.
The result bmr
is an object of class BenchmarkResult. Basically, it contains a list
of lists of ResampleResult objects, first ordered by Task and then by Learner.
Making experiments reproducible
Typically, we would want our experiment results to be reproducible. mlr obeys
the set.seed
function, so make sure to use set.seed
at the beginning of your
script if you would like your results to be reproducible.
Note that if you are using parallel computing, you may need to adjust how you
call set.seed
depending on your usecase. One possibility is to use
set.seed(123, "L'Ecuyer")
in order to ensure the results are reproducible for
each child process. See the examples in mclapply for more
information on reproducibility and parallel computing.
Accessing benchmark results
mlr provides several accessor functions, named getBMR<WhatToExtract>
, that permit
to retrieve information for further analyses. This includes for example the performances
or predictions of the learning algorithms under consideration.
Learner performances
Let's have a look at the benchmark result above. getBMRPerformances returns individual performances in resampling runs, while getBMRAggrPerformances gives the aggregated values.
getBMRPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> iter mmce
#> 1 1 0.2571429
#>
#> $`Sonar-example`$classif.rpart
#> iter mmce
#> 1 1 0.2714286
getBMRAggrPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean
#> 0.2571429
#>
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean
#> 0.2714286
Since we used holdout as resampling strategy, individual and aggregated performance values coincide.
By default, nearly all "getter" functions return a nested list, with the first
level indicating the task and the second level indicating the learner.
If only a single learner or, as in our case a single task is considered, setting
drop = TRUE
simplifies the result to a flat list.
getBMRPerformances(bmr, drop = TRUE)
#> $classif.lda
#> iter mmce
#> 1 1 0.2571429
#>
#> $classif.rpart
#> iter mmce
#> 1 1 0.2714286
Often it is more convenient to work with data.frames. You can easily
convert the result structure by setting as.df = TRUE
.
getBMRPerformances(bmr, as.df = TRUE)
#> task.id learner.id iter mmce
#> 1 Sonar-example classif.lda 1 0.2571429
#> 2 Sonar-example classif.rpart 1 0.2714286
getBMRAggrPerformances(bmr, as.df = TRUE)
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.2571429
#> 2 Sonar-example classif.rpart 0.2714286
Predictions
Per default, the BenchmarkResult contains the learner predictions.
If you do not want to keep them, e.g., to conserve memory, set keep.pred = FALSE
when
calling benchmark.
You can access the predictions using function getBMRPredictions.
Per default, you get a nested list of ResamplePrediction objects.
As above, you can use the drop
or as.df
options to simplify the result.
getBMRPredictions(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: prob
#> threshold: M=0.50,R=0.50
#> time (mean): 0.01
#> id truth prob.M prob.R response iter set
#> 1 127 M 0.98673000 0.013270001 M 1 test
#> 2 159 M 0.99659179 0.003408211 M 1 test
#> 3 81 R 0.55436799 0.445632009 M 1 test
#> 4 207 M 0.98660766 0.013392337 M 1 test
#> 5 74 R 0.94120073 0.058799272 M 1 test
#> 6 154 M 0.03862365 0.961376347 R 1 test
#> ... (#rows: 70, #cols: 7)
#>
#> $`Sonar-example`$classif.rpart
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.01
#> id truth response iter set
#> 1 127 M R 1 test
#> 2 159 M R 1 test
#> 3 81 R R 1 test
#> 4 207 M M 1 test
#> 5 74 R R 1 test
#> 6 154 M M 1 test
#> ... (#rows: 70, #cols: 5)
head(getBMRPredictions(bmr, as.df = TRUE))
#> task.id learner.id id truth prob.M prob.R response iter
#> 1 Sonar-example classif.lda 127 M 0.98673000 0.013270001 M 1
#> 2 Sonar-example classif.lda 159 M 0.99659179 0.003408211 M 1
#> 3 Sonar-example classif.lda 81 R 0.55436799 0.445632009 M 1
#> 4 Sonar-example classif.lda 207 M 0.98660766 0.013392337 M 1
#> 5 Sonar-example classif.lda 74 R 0.94120073 0.058799272 M 1
#> 6 Sonar-example classif.lda 154 M 0.03862365 0.961376347 R 1
#> set
#> 1 test
#> 2 test
#> 3 test
#> 4 test
#> 5 test
#> 6 test
It is also easily possible to access results for certain learners or tasks via their
IDs. For this purpose many "getter" functions have a learner.ids
and a task.ids
argument.
head(getBMRPredictions(bmr, learner.ids = "classif.rpart", as.df = TRUE))
#> task.id learner.id id truth response iter set
#> 1 Sonar-example classif.rpart 127 M R 1 test
#> 2 Sonar-example classif.rpart 159 M R 1 test
#> 3 Sonar-example classif.rpart 81 R R 1 test
#> 4 Sonar-example classif.rpart 207 M M 1 test
#> 5 Sonar-example classif.rpart 74 R R 1 test
#> 6 Sonar-example classif.rpart 154 M M 1 test
If you don't like the default IDs, you can set the IDs of learners and tasks via the id
option of
makeLearner and make*Task.
Moreover, you can conveniently change the ID of a Learner via function setLearnerId.
IDs
The IDs of all Learners, Tasks and Measures in a benchmark experiment can be retrieved as follows:
getBMRTaskIds(bmr)
#> [1] "Sonar-example"
getBMRLearnerIds(bmr)
#> [1] "classif.lda" "classif.rpart"
getBMRMeasureIds(bmr)
#> [1] "mmce"
Fitted models
Per default the BenchmarkResult also contains the fitted models for all learners on all tasks.
If you do not want to keep them set models = FALSE
when calling benchmark.
The fitted models can be retrieved by function getBMRModels.
It returns a (possibly nested) list of WrappedModel objects.
getBMRModels(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> $`Sonar-example`$classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters:
#>
#>
#> $`Sonar-example`$classif.rpart
#> $`Sonar-example`$classif.rpart[[1]]
#> Model for learner.id=classif.rpart; learner.class=classif.rpart
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: xval=0
getBMRModels(bmr, drop = TRUE)
#> $classif.lda
#> $classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters:
#>
#>
#> $classif.rpart
#> $classif.rpart[[1]]
#> Model for learner.id=classif.rpart; learner.class=classif.rpart
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: xval=0
getBMRModels(bmr, learner.ids = "classif.lda")
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> $`Sonar-example`$classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters:
Learners and measures
Moreover, you can extract the employed Learners and Measures.
getBMRLearners(bmr)
#> $classif.lda
#> Learner classif.lda from package MASS
#> Type: classif
#> Name: Linear Discriminant Analysis; Short name: lda
#> Class: classif.lda
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: prob
#> Hyperparameters:
#>
#>
#> $classif.rpart
#> Learner classif.rpart from package rpart
#> Type: classif
#> Name: Decision Tree; Short name: rpart
#> Class: classif.rpart
#> Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights,featimp
#> Predict-Type: response
#> Hyperparameters: xval=0
getBMRMeasures(bmr)
#> [[1]]
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: test.mean
#> Arguments:
#> Note: Defined as: mean(response != truth)
Merging benchmark results
Sometimes after completing a benchmark experiment it turns out that you want to extend it by another Learner or another Task. In this case you can perform an additional benchmark experiment and then use function mergeBenchmarkResults to combine the results to a single BenchmarkResult object that can be accessed and analyzed as usual.
For example in the benchmark experiment above we applied lda and rpart to the sonar.task. We now perform a second experiment using a random forest and quadratic discriminant analysis (qda) and merge the results.
## First benchmark result
bmr
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.2571429
#> 2 Sonar-example classif.rpart 0.2714286
## Benchmark experiment for the additional learners
lrns2 = list(makeLearner("classif.randomForest"), makeLearner("classif.qda"))
bmr2 = benchmark(lrns2, sonar.task, rdesc, show.info = FALSE)
bmr2
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.randomForest 0.1428571
#> 2 Sonar-example classif.qda 0.2714286
## Merge the results
mergeBenchmarkResults(list(bmr, bmr2))
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.2571429
#> 2 Sonar-example classif.rpart 0.2714286
#> 3 Sonar-example classif.randomForest 0.1428571
#> 4 Sonar-example classif.qda 0.2714286
Note that in the above examples in each case a resample description was passed to the benchmark function. For this reason lda and rpart were most likely evaluated on a different training/test set pair than random forest and qda.
Differing training/test set pairs across learners pose an additional source of variation in the results, which can make it harder to detect actual performance differences between learners. Therefore, if you suspect that you will have to extend your benchmark experiment by another Learner later on it's probably easiest to work with ResampleInstances from the start. These can be stored and used for any additional experiments.
Alternatively, if you used a resample description in the first benchmark experiment you could
also extract the ResampleInstances from the BenchmarkResult bmr
and pass these to all further benchmark calls.
rin = getBMRPredictions(bmr)[[1]][[1]]$instance
rin
#> Resample instance for 208 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
## Benchmark experiment for the additional random forest
bmr3 = benchmark(lrns2, sonar.task, rin, show.info = FALSE)
bmr3
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.randomForest 0.2000000
#> 2 Sonar-example classif.qda 0.5142857
## Merge the results
mergeBenchmarkResults(list(bmr, bmr3))
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.2571429
#> 2 Sonar-example classif.rpart 0.2714286
#> 3 Sonar-example classif.randomForest 0.2000000
#> 4 Sonar-example classif.qda 0.5142857
Benchmark analysis and visualization
mlr offers several ways to analyze the results of a benchmark experiment. This includes visualization, ranking of learning algorithms and hypothesis tests to assess performance differences between learners.
In order to demonstrate the functionality we conduct a slightly larger benchmark experiment with three learning algorithms that are applied to five classification tasks.
Example: Comparing lda, rpart and random Forest
We consider linear discriminant analysis (lda), classification trees (rpart), and random forests (randomForest). Since the default learner IDs are a little long, we choose shorter names in the R code below.
We use five classification tasks. Three are already provided by mlr, two more data sets are taken from package mlbench and converted to Tasks by function convertMLBenchObjToTask.
For all tasks 10-fold cross-validation is chosen as resampling strategy. This is achieved by passing a single resample description to benchmark, which is then instantiated automatically once for each Task. This way, the same instance is used for all learners applied to a single task.
It is also possible to choose a different resampling strategy for each Task by passing a list of the same length as the number of tasks that can contain both resample descriptions and resample instances.
We use the mean misclassification error mmce as primary performance measure, but also calculate the balanced error rate (ber) and the training time (timetrain).
## Create a list of learners
lrns = list(
makeLearner("classif.lda", id = "lda"),
makeLearner("classif.rpart", id = "rpart"),
makeLearner("classif.randomForest", id = "randomForest")
)
## Get additional Tasks from package mlbench
ring.task = convertMLBenchObjToTask("mlbench.ringnorm", n = 600)
wave.task = convertMLBenchObjToTask("mlbench.waveform", n = 600)
tasks = list(iris.task, sonar.task, pid.task, ring.task, wave.task)
rdesc = makeResampleDesc("CV", iters = 10)
meas = list(mmce, ber, timetrain)
bmr = benchmark(lrns, tasks, rdesc, meas, show.info = FALSE)
bmr
#> task.id learner.id mmce.test.mean ber.test.mean
#> 1 iris-example lda 0.02000000 0.02222222
#> 2 iris-example rpart 0.08000000 0.07555556
#> 3 iris-example randomForest 0.05333333 0.05250000
#> 4 mlbench.ringnorm lda 0.35000000 0.34605671
#> 5 mlbench.ringnorm rpart 0.17333333 0.17313632
#> 6 mlbench.ringnorm randomForest 0.05833333 0.05806121
#> 7 mlbench.waveform lda 0.19000000 0.18257244
#> 8 mlbench.waveform rpart 0.28833333 0.28765247
#> 9 mlbench.waveform randomForest 0.16500000 0.16306057
#> 10 PimaIndiansDiabetes-example lda 0.22778537 0.27148893
#> 11 PimaIndiansDiabetes-example rpart 0.25133288 0.28967870
#> 12 PimaIndiansDiabetes-example randomForest 0.23685919 0.27543146
#> 13 Sonar-example lda 0.24619048 0.23986694
#> 14 Sonar-example rpart 0.30785714 0.31153361
#> 15 Sonar-example randomForest 0.17785714 0.17442696
#> timetrain.test.mean
#> 1 0.0037
#> 2 0.0053
#> 3 0.0290
#> 4 0.0094
#> 5 0.0138
#> 6 0.4969
#> 7 0.0093
#> 8 0.0124
#> 9 0.4521
#> 10 0.0055
#> 11 0.0078
#> 12 0.3805
#> 13 0.0199
#> 14 0.0162
#> 15 0.2520
From the aggregated performance values we can see that for the iris- and PimaIndiansDiabetes-example linear discriminant analysis performs well while for all other tasks the random forest seems superior. Training takes longer for the random forest than for the other learners.
In order to draw any conclusions from the average performances at least their variability has to be taken into account or, preferably, the distribution of performance values across resampling iterations.
The individual performances on the 10 folds for every task, learner, and measure are retrieved below.
perf = getBMRPerformances(bmr, as.df = TRUE)
head(perf)
#> task.id learner.id iter mmce ber timetrain
#> 1 iris-example lda 1 0.0000000 0.0000000 0.004
#> 2 iris-example lda 2 0.1333333 0.1666667 0.004
#> 3 iris-example lda 3 0.0000000 0.0000000 0.003
#> 4 iris-example lda 4 0.0000000 0.0000000 0.003
#> 5 iris-example lda 5 0.0000000 0.0000000 0.004
#> 6 iris-example lda 6 0.0000000 0.0000000 0.004
A closer look at the result reveals that the random forest outperforms the classification tree in every instance, while linear discriminant analysis performs better than rpart most of the time. Additionally lda sometimes even beats the random forest. With increasing size of such benchmark experiments, those tables become almost unreadable and hard to comprehend.
mlr features some plotting functions to visualize results of benchmark experiments that you might find useful. Moreover, mlr offers statistical hypothesis tests to assess performance differences between learners.
Integrated plots
Plots are generated using ggplot2. Further customization, such as renaming plot elements or changing colors, is easily possible.
Visualizing performances
plotBMRBoxplots creates box or violin plots which show the distribution of performance values across resampling iterations for one performance measure and for all learners and tasks (and thus visualize the output of getBMRPerformances).
Below are both variants, box and violin plots. The first plot shows the mmce
and the second plot the balanced error rate (ber).
Moreover, in the second plot we color the boxes according to the learner.id
s.
plotBMRBoxplots(bmr, measure = mmce)
plotBMRBoxplots(bmr, measure = ber, style = "violin", pretty.names = FALSE) +
aes(color = learner.id) +
theme(strip.text.x = element_text(size = 8))
Note that by default the measure name
s and the learner short.name
s are used as axis
labels.
mmce$name
#> [1] "Mean misclassification error"
mmce$id
#> [1] "mmce"
getBMRLearnerIds(bmr)
#> [1] "lda" "rpart" "randomForest"
getBMRLearnerShortNames(bmr)
#> [1] "lda" "rpart" "rf"
If you prefer the id
s like, e.g., mmce and ber set pretty.names = FALSE
(as done for
the second plot).
Of course you can also use the ggplot2 functionality like the ylab
function to choose completely different labels.
One question which comes up quite often is how to change the panel headers (which default to the Task IDs) and the learner names on the x-axis. For example looking at the above plots we would like to remove the "example" suffixes and the "mlbench" prefixes from the panel headers. Moreover, we want uppercase learner labels. Currently, the probably simplest solution is to change the factor levels of the plotted data as shown below.
plt = plotBMRBoxplots(bmr, measure = mmce)
head(plt$data)
#> task.id learner.id iter mmce ber timetrain
#> 1 iris-example lda 1 0.0000000 0.0000000 0.004
#> 2 iris-example lda 2 0.1333333 0.1666667 0.004
#> 3 iris-example lda 3 0.0000000 0.0000000 0.003
#> 4 iris-example lda 4 0.0000000 0.0000000 0.003
#> 5 iris-example lda 5 0.0000000 0.0000000 0.004
#> 6 iris-example lda 6 0.0000000 0.0000000 0.004
levels(plt$data$task.id) = c("Iris", "Ringnorm", "Waveform", "Diabetes", "Sonar")
levels(plt$data$learner.id) = c("LDA", "CART", "RF")
plt + ylab("Error rate")
Visualizing aggregated performances
The aggregated performance values (resulting from getBMRAggrPerformances) can be visualized by function plotBMRSummary. This plot draws one line for each task on which the aggregated values of one performance measure for all learners are displayed. By default, the first measure in the list of Measures passed to benchmark is used, in our example mmce. Moreover, a small vertical jitter is added to prevent overplotting.
plotBMRSummary(bmr)
Calculating and visualizing ranks
Additional to the absolute performance, relative performance, i.e., ranking the learners is usually of interest and might provide valuable additional insight.
Function convertBMRToRankMatrix calculates ranks based on aggregated learner performances of one measure. We choose the mean misclassification error (mmce). The rank structure can be visualized by plotBMRRanksAsBarChart.
m = convertBMRToRankMatrix(bmr, mmce)
m
#> iris-example mlbench.ringnorm mlbench.waveform
#> lda 1 3 2
#> rpart 3 2 3
#> randomForest 2 1 1
#> PimaIndiansDiabetes-example Sonar-example
#> lda 1 2
#> rpart 3 3
#> randomForest 2 1
Methods with best performance, i.e., with lowest mmce, are assigned the lowest rank. Linear discriminant analysis is best for the iris and PimaIndiansDiabetes-examples while the random forest shows best results on the remaining tasks.
plotBMRRanksAsBarChart with option pos = "tile"
shows a corresponding heat map. The
ranks are displayed on the x-axis and the learners are color-coded.
plotBMRRanksAsBarChart(bmr, pos = "tile")
A similar plot can also be obtained via plotBMRSummary. With option trafo = "rank"
the
ranks are displayed instead of the aggregated performances.
plotBMRSummary(bmr, trafo = "rank", jitter = 0)
Alternatively, you can draw stacked bar charts (the default) or bar charts with juxtaposed
bars (pos = "dodge"
) that are better suited to compare the frequencies of learners within
and across ranks.
plotBMRRanksAsBarChart(bmr)
plotBMRRanksAsBarChart(bmr, pos = "dodge")
Comparing learners using hypothesis tests
Many researchers feel the need to display an algorithm's superiority by employing some sort of hypothesis testing. As non-parametric tests seem better suited for such benchmark results the tests provided in mlr are the Overall Friedman test and the Friedman-Nemenyi post hoc test.
While the ad hoc Friedman test based on friedman.test from the stats package is testing the hypothesis whether there is a significant difference between the employed learners, the post hoc Friedman-Nemenyi test tests for significant differences between all pairs of learners. Non parametric tests often do have less power then their parametric counterparts but less assumptions about underlying distributions have to be made. This often means many data sets are needed in order to be able to show significant differences at reasonable significance levels.
In our example, we want to compare the three learners on the selected data sets. First we might we want to test the hypothesis whether there is a difference between the learners.
friedmanTestBMR(bmr)
#>
#> Friedman rank sum test
#>
#> data: mmce.test.mean and learner.id and task.id
#> Friedman chi-squared = 5.2, df = 2, p-value = 0.07427
In order to keep the computation time for this tutorial small, the Learners are only evaluated on five tasks. This also means that we operate on a relatively low significance level . As we can reject the null hypothesis of the Friedman test at a reasonable significance level we might now want to test where these differences lie exactly.
friedmanPostHocTestBMR(bmr, p.value = 0.1)
#>
#> Pairwise comparisons using Nemenyi multiple comparison test
#> with q approximation for unreplicated blocked data
#>
#> data: mmce.test.mean and learner.id and task.id
#>
#> lda rpart
#> rpart 0.254 -
#> randomForest 0.802 0.069
#>
#> P value adjustment method: none
At this level of significance, we can reject the null hypothesis that there exists no performance difference between the decision tree (rpart) and the random Forest.
Critical differences diagram
In order to visualize differently performing learners, a
critical differences diagram can be plotted, using either the
Nemenyi test (test = "nemenyi"
) or the Bonferroni-Dunn test (test = "bd"
).
The mean rank of learners is displayed on the x-axis.
- Choosing
test = "nemenyi"
compares all pairs of Learners to each other, thus the output are groups of not significantly different learners. The diagram connects all groups of learners where the mean ranks do not differ by more than the critical differences. Learners that are not connected by a bar are significantly different, and the learner(s) with the lower mean rank can be considered "better" at the chosen significance level. - Choosing
test = "bd"
performs a pairwise comparison with a baseline. An interval which extends by the given critical difference in both directions is drawn around the Learner chosen as baseline, though only comparisons with the baseline are possible. All learners within the interval are not significantly different, while the baseline can be considered better or worse than a given learner which is outside of the interval.
The critical difference is calculated by where denotes the number of tasks, is the number of learners, and comes from the studentized range statistic divided by . For details see Demsar (2006).
Function generateCritDifferencesData does all necessary calculations while function plotCritDifferences draws the plot. See the tutorial page about visualization for details on data generation and plotting functions.
## Nemenyi test
g = generateCritDifferencesData(bmr, p.value = 0.1, test = "nemenyi")
plotCritDifferences(g) + coord_cartesian(xlim = c(-1,5), ylim = c(0,2))
## Bonferroni-Dunn test
g = generateCritDifferencesData(bmr, p.value = 0.1, test = "bd", baseline = "randomForest")
plotCritDifferences(g) + coord_cartesian(xlim = c(-1,5), ylim = c(0,2))
Custom plots
You can easily generate your own visualizations by customizing the ggplot objects returned by the plots above, retrieve the data from the ggplot objects and use them as basis for your own plots, or rely on the data.frames returned by getBMRPerformances or getBMRAggrPerformances. Here are some examples.
Instead of boxplots (as in plotBMRBoxplots) we could create density plots to show the performance values resulting from individual resampling iterations.
perf = getBMRPerformances(bmr, as.df = TRUE)
## Density plots for two tasks
qplot(mmce, colour = learner.id, facets = . ~ task.id,
data = perf[perf$task.id %in% c("iris-example", "Sonar-example"),], geom = "density") +
theme(strip.text.x = element_text(size = 8))
In order to plot multiple performance measures in parallel, perf
is reshaped to long format.
Below we generate grouped boxplots showing the error rate (mmce) and the
training time timetrain.
## Compare mmce and timetrain
df = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"))
df = df[df$variable != "ber",]
head(df)
#> task.id learner.id iter variable value
#> 1 iris-example lda 1 mmce 0.0000000
#> 2 iris-example lda 2 mmce 0.1333333
#> 3 iris-example lda 3 mmce 0.0000000
#> 4 iris-example lda 4 mmce 0.0000000
#> 5 iris-example lda 5 mmce 0.0000000
#> 6 iris-example lda 6 mmce 0.0000000
qplot(variable, value, data = df, colour = learner.id, geom = "boxplot",
xlab = "measure", ylab = "performance") +
facet_wrap(~ task.id, nrow = 2)
It might also be useful to assess if learner performances in single resampling iterations, i.e., in one fold, are related. This might help to gain further insight, for example by having a closer look at train and test sets from iterations where one learner performs exceptionally well while another one is fairly bad. Moreover, this might be useful for the construction of ensembles of learning algorithms. Below, function ggpairs from package GGally is used to generate a scatterplot matrix of mean misclassification errors (mmce) on the Sonar data set.
perf = getBMRPerformances(bmr, task.id = "Sonar-example", as.df = TRUE)
df = reshape2::melt(perf, id.vars = c("task.id", "learner.id", "iter"))
df = df[df$variable == "mmce",]
df = reshape2::dcast(df, task.id + iter ~ variable + learner.id)
head(df)
#> task.id iter mmce_lda mmce_rpart mmce_randomForest
#> 1 Sonar-example 1 0.2857143 0.2857143 0.14285714
#> 2 Sonar-example 2 0.2380952 0.2380952 0.23809524
#> 3 Sonar-example 3 0.3333333 0.2857143 0.28571429
#> 4 Sonar-example 4 0.2380952 0.3333333 0.04761905
#> 5 Sonar-example 5 0.1428571 0.2857143 0.19047619
#> 6 Sonar-example 6 0.4000000 0.4500000 0.25000000
GGally::ggpairs(df, 3:5)
Further comments
- Note that for supervised classification mlr offers some more plots that operate on BenchmarkResult objects and allow you to compare the performance of learning algorithms. See for example the tutorial page on ROC curves and functions generateThreshVsPerfData, plotROCCurves, and plotViperCharts as well as the page about classifier calibration and function generateCalibrationData.
- In the examples shown in this section we applied "raw" learning algorithms, but often things are more complicated. At the very least, many learners have hyperparameters that need to be tuned to get sensible results. Reliable performance estimates can be obtained by nested resampling, i.e., by doing the tuning in an inner resampling loop while estimating the performance in an outer loop. Moreover, you might want to combine learners with pre-processing steps like imputation, scaling, outlier removal, dimensionality reduction or feature selection and so on. All this can be easily done using mlr's wrapper functionality. The general principle is explained in the section about wrapped learners in the Advanced part of this tutorial. There are also several sections devoted to common pre-processing steps.
- Benchmark experiments can very quickly become computationally demanding. mlr offers some possibilities for parallelization.