Data Preprocessing
Data preprocessing refers to any transformation of the data done before applying a learning algorithm. This comprises for example finding and resolving inconsistencies, imputation of missing values, identifying, removing or replacing outliers, discretizing numerical data or generating numerical dummy variables for categorical data, any kind of transformation like standardization of predictors or Box-Cox, dimensionality reduction and feature extraction and/or selection.
mlr offers several options for data preprocessing. Some of the following simple methods to change a Task (or data.frame) were already mentioned on the page about learning tasks:
- capLargeValues: Convert large/infinite numeric values.
- createDummyFeatures: Generate dummy variables for factor features.
- dropFeatures: Remove selected features.
- joinClassLevels: Only for classification: Merge existing classes to new, larger classes.
- mergeSmallFactorLevels: Merge infrequent levels of factor features.
- normalizeFeatures: Normalize features by different methods, e.g., standardization or scaling to a certain range.
- removeConstantFeatures: Remove constant features.
- subsetTask: Remove observations and/or features from a Task.
Moreover, there are tutorial pages devoted to
Fusing learners with preprocessing
mlr's wrapper functionality permits to combine learners with preprocessing steps. This means that the preprocessing "belongs" to the learner and is done any time the learner is trained or predictions are made.
This is, on the one hand, very practical. You don't need to change any data or learning Tasks and it's quite easy to combine different learners with different preprocessing steps.
On the other hand this helps to avoid a common mistake in evaluating the performance of a learner with preprocessing: Preprocessing is often seen as completely independent of the later applied learning algorithms. When estimating the performance of the a learner, e.g., by cross-validation all preprocessing is done beforehand on the full data set and only training/predicting the learner is done on the train/test sets. Depending on what exactly is done as preprocessing this can lead to overoptimistic results. For example if imputation by the mean is done on the whole data set before evaluating the learner performance you are using information from the test data during training, which can cause overoptimistic performance results.
To clarify things one should distinguish between data-dependent and data-independent preprocessing steps: Data-dependent steps in some way learn from the data and give different results when applied to different data sets. Data-independent steps always lead to the same results. Clearly, correcting errors in the data or removing data columns like Ids that should not be used for learning, is data-independent. Imputation of missing values by the mean, as mentioned above, is data-dependent. Imputation by a fixed constant, however, is not.
To get a honest estimate of learner performance combined with preprocessing, all data-dependent preprocessing steps must be included in the resampling. This is automatically done when fusing a learner with preprocessing.
To this end mlr provides two wrappers:
- makePreprocWrapperCaret is an interface to all preprocessing options offered by caret's preProcess function.
- makePreprocWrapper permits to write your own custom preprocessing methods by defining the actions to be taken before training and before prediction.
As mentioned above the specified preprocessing steps then "belong" to the wrapped Learner. In contrast to the preprocessing options listed above like normalizeFeatures
- the Task itself remains unchanged,
- the preprocessing is not done globally, i.e., for the whole data set, but for every pair of training/test data sets in, e.g., resampling,
- any parameters controlling the preprocessing as, e.g., the percentage of outliers to be removed can be tuned together with the base learner parameters.
We start with some examples for makePreprocWrapperCaret.
Preprocessing with makePreprocWrapperCaret
makePreprocWrapperCaret is an interface to caret's preProcess function that provides many different options like imputation of missing values, data transformations as scaling the features to a certain range or Box-Cox and dimensionality reduction via Independent or Principal Component Analysis. For all possible options see the help page of function preProcess.
Note that the usage of makePreprocWrapperCaret is slightly different than that of preProcess.
- makePreprocWrapperCaret takes (almost) the same formal arguments as preProcess,
but their names are prefixed by
ppc.
. - The only exception: makePreprocWrapperCaret does not have a
method
argument. Instead all preprocessing options that would be passed to preProcess'smethod
argument are given as individual logical parameters to makePreprocWrapperCaret.
For example the following call to preProcess
preProcess(x, method = c("knnImpute", "pca"), pcaComp = 10)
with x
being a matrix or data.frame
would thus translate into
makePreprocWrapperCaret(learner, ppc.knnImpute = TRUE, ppc.pca = TRUE, ppc.pcaComp = 10)
where learner
is a mlr Learner or the name of a learner class like
"classif.lda"
.
If you enable multiple preprocessing options (like knn imputation and principal component analysis above) these are executed in a certain order detailed on the help page of function preProcess.
In the following we show an example where principal components analysis (PCA) is used for dimensionality reduction. This should never be applied blindly, but can be beneficial with learners that get problems with high dimensionality or those that can profit from rotating the data.
We consider the sonar.task, which poses a binary classification problem with 208 observations and 60 features.
sonar.task
#> Supervised task: Sonar-example
#> Type: classif
#> Target: Class
#> Observations: 208
#> Features:
#> numerics factors ordered functionals
#> 60 0 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 2
#> M R
#> 111 97
#> Positive class: M
Below we fuse quadratic discriminant analysis from package MASS with a principal components preprocessing step. The threshold is set to 0.9, i.e., the principal components necessary to explain a cumulative percentage of 90% of the total variance are kept. The data are automatically standardized prior to PCA.
lrn = makePreprocWrapperCaret("classif.qda", ppc.pca = TRUE, ppc.thresh = 0.9)
lrn
#> Learner classif.qda.preproc from package MASS
#> Type: classif
#> Name: ; Short name:
#> Class: PreprocWrapperCaret
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: response
#> Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3
The wrapped learner is trained on the sonar.task. By inspecting the underlying qda model, we see that the first 22 principal components have been used for training.
mod = train(lrn, sonar.task)
mod
#> Model for learner.id=classif.qda.preproc; learner.class=PreprocWrapperCaret
#> Trained on: task.id = Sonar-example; obs = 208; features = 60
#> Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3
getLearnerModel(mod)
#> Model for learner.id=classif.qda; learner.class=classif.qda
#> Trained on: task.id = Sonar-example; obs = 208; features = 22
#> Hyperparameters:
getLearnerModel(mod, more.unwrap = TRUE)
#> Call:
#> qda(f, data = getTaskData(.task, .subset, recode.target = "drop.levels"))
#>
#> Prior probabilities of groups:
#> M R
#> 0.5336538 0.4663462
#>
#> Group means:
#> PC1 PC2 PC3 PC4 PC5 PC6
#> M 0.5976122 -0.8058235 0.9773518 0.03794232 -0.04568166 -0.06721702
#> R -0.6838655 0.9221279 -1.1184128 -0.04341853 0.05227489 0.07691845
#> PC7 PC8 PC9 PC10 PC11 PC12
#> M 0.2278162 -0.01034406 -0.2530606 -0.1793157 -0.04084466 -0.0004789888
#> R -0.2606969 0.01183702 0.2895848 0.2051963 0.04673977 0.0005481212
#> PC13 PC14 PC15 PC16 PC17 PC18
#> M -0.06138758 -0.1057137 0.02808048 0.05215865 -0.07453265 0.03869042
#> R 0.07024765 0.1209713 -0.03213333 -0.05968671 0.08528994 -0.04427460
#> PC19 PC20 PC21 PC22
#> M -0.01192247 0.006098658 0.01263492 -0.001224809
#> R 0.01364323 -0.006978877 -0.01445851 0.001401586
Below the performances of qda with and without PCA preprocessing are compared in a benchmark experiment. Note that we use stratified resampling to prevent errors in qda due to a too small number of observations from either class.
rin = makeResampleInstance("CV", iters = 3, stratify = TRUE, task = sonar.task)
res = benchmark(list("classif.qda", lrn), sonar.task, rin, show.info = FALSE)
res
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.qda 0.3941339
#> 2 Sonar-example classif.qda.preproc 0.2643202
PCA preprocessing in this case turns out to be really beneficial for the performance of Quadratic Discriminant Analysis.
Joint tuning of preprocessing options and learner parameters
Let's see if we can optimize this a bit. The threshold value of 0.9 above was chosen arbitrarily and led to 22 out of 60 principal components. But maybe a lower or higher number of principal components should be used. Moreover, qda has several options that control how the class covariance matrices or class probabilities are estimated.
Those preprocessing and learner parameters can be tuned jointly. Before doing this let's first get an overview of all the parameters of the wrapped learner using function getParamSet.
getParamSet(lrn)
#> Type len Def Constr Req
#> ppc.BoxCox logical - FALSE - -
#> ppc.YeoJohnson logical - FALSE - -
#> ppc.expoTrans logical - FALSE - -
#> ppc.center logical - TRUE - -
#> ppc.scale logical - TRUE - -
#> ppc.range logical - FALSE - -
#> ppc.knnImpute logical - FALSE - -
#> ppc.bagImpute logical - FALSE - -
#> ppc.medianImpute logical - FALSE - -
#> ppc.pca logical - FALSE - -
#> ppc.ica logical - FALSE - -
#> ppc.spatialSign logical - FALSE - -
#> ppc.thresh numeric - 0.95 0 to Inf -
#> ppc.pcaComp integer - - 1 to Inf -
#> ppc.na.remove logical - TRUE - -
#> ppc.k integer - 5 1 to Inf -
#> ppc.fudge numeric - 0.2 0 to Inf -
#> ppc.numUnique integer - 3 1 to Inf -
#> ppc.n.comp integer - - 1 to Inf -
#> method discrete - moment moment,mle,mve,t -
#> nu numeric - 5 2 to Inf Y
#> predict.method discrete - plug-in plug-in,predictive,debiased -
#> Tunable Trafo
#> ppc.BoxCox TRUE -
#> ppc.YeoJohnson TRUE -
#> ppc.expoTrans TRUE -
#> ppc.center TRUE -
#> ppc.scale TRUE -
#> ppc.range TRUE -
#> ppc.knnImpute TRUE -
#> ppc.bagImpute TRUE -
#> ppc.medianImpute TRUE -
#> ppc.pca TRUE -
#> ppc.ica TRUE -
#> ppc.spatialSign TRUE -
#> ppc.thresh TRUE -
#> ppc.pcaComp TRUE -
#> ppc.na.remove TRUE -
#> ppc.k TRUE -
#> ppc.fudge TRUE -
#> ppc.numUnique TRUE -
#> ppc.n.comp TRUE -
#> method TRUE -
#> nu TRUE -
#> predict.method TRUE -
The parameters prefixed by ppc.
belong to preprocessing. method
, nu
and predict.method
are qda parameters.
Instead of tuning the PCA threshold (ppc.thresh
) we tune the number of principal
components (ppc.pcaComp
) directly.
Moreover, for qda we try two different ways to estimate the posterior probabilities
(parameter predict.method
): the usual plug-in estimates and unbiased estimates.
We perform a grid search and set the resolution to 10. This is for demonstration. You might want to use a finer resolution.
ps = makeParamSet(
makeIntegerParam("ppc.pcaComp", lower = 1, upper = getTaskNFeats(sonar.task)),
makeDiscreteParam("predict.method", values = c("plug-in", "debiased"))
)
ctrl = makeTuneControlGrid(resolution = 10)
res = tuneParams(lrn, sonar.task, rin, par.set = ps, control = ctrl, show.info = FALSE)
res
#> Tune result:
#> Op. pars: ppc.pcaComp=8; predict.method=plug-in
#> mmce.test.mean=0.1920635
as.data.frame(res$opt.path)[1:3]
#> ppc.pcaComp predict.method mmce.test.mean
#> 1 1 plug-in 0.4757074
#> 2 8 plug-in 0.1920635
#> 3 14 plug-in 0.2162871
#> 4 21 plug-in 0.2643202
#> 5 27 plug-in 0.2454106
#> 6 34 plug-in 0.2645273
#> 7 40 plug-in 0.2742581
#> 8 47 plug-in 0.3173223
#> 9 53 plug-in 0.3512767
#> 10 60 plug-in 0.3941339
#> 11 1 debiased 0.5336094
#> 12 8 debiased 0.2450656
#> 13 14 debiased 0.2403037
#> 14 21 debiased 0.2546584
#> 15 27 debiased 0.3075224
#> 16 34 debiased 0.3172533
#> 17 40 debiased 0.3125604
#> 18 47 debiased 0.2979986
#> 19 53 debiased 0.3079365
#> 20 60 debiased 0.3654244
There seems to be a preference for a lower number of principal components (<27) for both "plug-in"
and "debiased"
with "plug-in"
achieving slightly lower error rates.
Writing a custom preprocessing wrapper
If the options offered by makePreprocWrapperCaret are not enough, you can write your own preprocessing wrapper using function makePreprocWrapper.
As described in the tutorial section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and before prediction and are completely user-defined.
Below we show how to create a preprocessing wrapper that centers and scales the data before training/predicting. Some learning methods as, e.g., k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner by coupling it with function scale.
Note that we chose this simple example for demonstration. Centering/scaling the data is also possible with makePreprocWrapperCaret.
Specifying the train function
The train function has to be a function with the following arguments:
data
is a data.frame with columns for all features and the target variable.target
is a string and denotes the name of the target variable indata
.args
is a list of further arguments and parameters that influence the preprocessing.
It must return a list with elements $data
and $control
,
where $data
is the preprocessed data set and $control
stores all information required
to preprocess the data before prediction.
The train function for the scaling example is given below. It calls scale on the numerical features and returns the scaled training data and the corresponding scaling parameters.
args
contains the center
and scale
arguments of function scale
and slot $control
stores the scaling parameters to be used in the prediction stage.
Regarding the latter note that the center
and scale
arguments of scale
can be either a logical value or a numeric vector of length equal to the number of the numeric
columns in data
, respectively.
If a logical value was passed to args
we store the column means and standard deviations/
root mean squares in the $center
and $scale
slots of the returned $control
object.
trainfun = function(data, target, args = list(center, scale)) {
## Identify numerical features
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
## Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
## Store the scaling parameters in control
## These are needed to preprocess the data before prediction
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
## Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
Specifying the predict function
The predict function has the following arguments:
data
is a data.frame containing only feature values (as for prediction the target values naturally are not known).target
is a string indicating the name of the target variable.args
are theargs
that were passed to the train function.control
is the object returned by the train function.
It returns the preprocessed data.
In our scaling example the predict function scales the numerical features using the
parameters from the training stage stored in control
.
predictfun = function(data, target, args, control) {
## Identify numerical features
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
## Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
## Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
Creating the preprocessing wrapper
Below we create a preprocessing wrapper with a regression neural network (which itself does not have a scaling option) as base learner.
The train and predict functions defined above are passed to makePreprocWrapper via
the train
and predict
arguments.
par.vals
is a list of parameter values that is relayed to the args
argument of the train function.
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name:
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,decay=0.01
Let's compare the cross-validated mean squared error (mse) on the Boston Housing data set with and without scaling.
rdesc = makeResampleDesc("CV", iters = 3)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: BostonHousing-example
#> Learner: regr.nnet.preproc
#> Aggr perf: mse.test.mean=20.6204314
#> Runtime: 0.171712
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: BostonHousing-example
#> Learner: regr.nnet
#> Aggr perf: mse.test.mean=55.0639088
#> Runtime: 0.101538
Joint tuning of preprocessing and learner parameters
Often it's not clear which preprocessing options work best with a certain learning algorithm. As already shown for the number of principal components in makePreprocWrapperCaret we can tune them easily together with other hyperparameters of the learner.
In our scaling example we can try if nnet works best with both centering and
scaling the data or if it's better to omit one of the two operations or do no preprocessing
at all.
In order to tune center
and scale
we have to add appropriate LearnerParams
to the parameter set of the wrapped learner.
As mentioned above scale allows for numeric and logical center
and scale
arguments. As we want to use the latter option we declare center
and scale
as logical
learner parameters.
lrn = makeLearner("regr.nnet", trace = FALSE)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
par.set = makeParamSet(
makeLogicalLearnerParam("center"),
makeLogicalLearnerParam("scale")
),
par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name:
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,center=TRUE,scale=TRUE
getParamSet(lrn)
#> Type len Def Constr Req Tunable Trafo
#> center logical - - - - TRUE -
#> scale logical - - - - TRUE -
#> size integer - 3 0 to Inf - TRUE -
#> maxit integer - 100 1 to Inf - TRUE -
#> linout logical - FALSE - Y TRUE -
#> entropy logical - FALSE - Y TRUE -
#> softmax logical - FALSE - Y TRUE -
#> censored logical - FALSE - Y TRUE -
#> skip logical - FALSE - - TRUE -
#> rang numeric - 0.7 -Inf to Inf - TRUE -
#> decay numeric - 0 0 to Inf - TRUE -
#> Hess logical - FALSE - - TRUE -
#> trace logical - TRUE - - FALSE -
#> MaxNWts integer - 1000 1 to Inf - FALSE -
#> abstol numeric - 0.0001 -Inf to Inf - TRUE -
#> reltol numeric - 1e-08 -Inf to Inf - TRUE -
Now we do a simple grid search for the decay
parameter of nnet and the
center
and scale
parameters.
rdesc = makeResampleDesc("Holdout")
ps = makeParamSet(
makeDiscreteParam("decay", c(0, 0.05, 0.1)),
makeLogicalParam("center"),
makeLogicalParam("scale")
)
ctrl = makeTuneControlGrid()
res = tuneParams(lrn, bh.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
res
#> Tune result:
#> Op. pars: decay=0.05; center=FALSE; scale=TRUE
#> mse.test.mean=14.8430579
as.data.frame(res$opt.path)
#> decay center scale mse.test.mean dob eol error.message exec.time
#> 1 0 TRUE TRUE 49.38128 1 NA <NA> 0.042
#> 2 0.05 TRUE TRUE 20.64761 2 NA <NA> 0.049
#> 3 0.1 TRUE TRUE 22.42986 3 NA <NA> 0.049
#> 4 0 FALSE TRUE 96.25474 4 NA <NA> 0.028
#> 5 0.05 FALSE TRUE 14.84306 5 NA <NA> 0.053
#> 6 0.1 FALSE TRUE 16.65383 6 NA <NA> 0.048
#> 7 0 TRUE FALSE 40.51518 7 NA <NA> 0.061
#> 8 0.05 TRUE FALSE 68.00069 8 NA <NA> 0.048
#> 9 0.1 TRUE FALSE 55.42210 9 NA <NA> 0.049
#> 10 0 FALSE FALSE 96.25474 10 NA <NA> 0.026
#> 11 0.05 FALSE FALSE 56.25758 11 NA <NA> 0.050
#> 12 0.1 FALSE FALSE 42.85529 12 NA <NA> 0.047
Preprocessing wrapper functions
If you have written a preprocessing wrapper that you might want to use from time to time it's a good idea to encapsulate it in an own function as shown below. If you think your preprocessing method is something others might want to use as well and should be integrated into mlr just contact us.
makePreprocWrapperScale = function(learner, center = TRUE, scale = TRUE) {
trainfun = function(data, target, args = list(center, scale)) {
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
predictfun = function(data, target, args, control) {
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
makePreprocWrapper(
learner,
train = trainfun,
predict = predictfun,
par.set = makeParamSet(
makeLogicalLearnerParam("center"),
makeLogicalLearnerParam("scale")
),
par.vals = list(center = center, scale = scale)
)
}
lrn = makePreprocWrapperScale("classif.lda")
train(lrn, iris.task)
#> Model for learner.id=classif.lda.preproc; learner.class=PreprocWrapper
#> Trained on: task.id = iris-example; obs = 150; features = 4
#> Hyperparameters: center=TRUE,scale=TRUE