mlrCPO is a package with some depth to it, it comes with a few vignettes that each explain different aspects of its operation. These are the current document (“First Steps”), offering a short introduction and information on where to get started, “mlrCPO Core”, describing all the functions and tools offered by
mlrCPO that are independent from specific
CPOs, “CPOs Built Into mlrCPO”, listing all
CPOs included in the
mlrCPO package, and “Building Custom CPOs”, describing the process of creating new
CPOs that offer new functionality.
All vignettes also have a “compact version” with the R output suppressed for readability. They are linked in the navigation section at the top.
All vignettes assume that
mlrCPO (and therefore its requirement
mlr) is installed successfully and loaded using
library("mlrCPO"). Help with installation is provided on the project’s GitHub page.
“Composable Preprocessing Operators”, “CPO”, are an extension for the mlr (“Machine Learning in R”) project which present preprocessing operations in the form of R objects. These CPO objects can be composed to form complex operations, they can be applied to data sets, and can be attached to mlr
Learner objects to generate machine learning pipelines that combine preprocessing and model fitting.
“Preprocessing”, as understood by
mlrCPO, is any manipulation of data used in a machine learning process to get it from its form as found in the wild into a form more fitting for the machine learning algorithm (“
Learner”) used for model fitting. It is important that the exact method of preprocessing is kept track of, to be able to perform this method when the resulting model is used to make predictions on new data. It is also important, when evaluating preprocessing methods e.g. using resampling, that the parameters of these methods are independent of the validation dataset and only depend on the training data set.
mlrCPO tries to support the user in all these aspects of preprocessing:
CPOs that can perform many different operations. Operations that go beyond the provided toolset can be implemented in custom
CPOTrained” objects that represent the preprocessing done on training data that should, in that way, be re-applied to new prediction data.
Learner” objects that represent the entinre machine learning pipeline to be tuned and evaluated.
At the centre of
mlrCPO are “
CPO” objects. To get a
CPO object, it is necessary to call a CPO Constructor. A CPO Constructor sets up the parameters of a
CPO and provides further options for its behaviour. Internally, CPO Constructors are functions that have a common interface and a friendly printer method.
CPOs exist first to be applied to data. Every
CPO represents a certain data transformation, and this transformation is performed when the
CPO is applied. This can be done using the
applyCPO function, or the
CPOs can be applied to
data.frame objects, and to
A useful feature of
CPOs is that they can be concatenated to form new operations. Two
CPOs can be combined using the
composeCPO function or, as before, the
%>>% operator. When two
CPOs are combined, the product is a new
CPO that can itself be composed or applied. The result of a composition represents the operation of first applying the first
CPO and then the second
data %>>% (cpo1 %>>% cpo2) is the same as
(data %>>% cpo1) %>>% cpo2.
The last example shows that it is sometimes not a good idea to have a
CPO affect the whole dataset. Therefore, when a
CPO is created, it is possible to choose what columns the
CPO should affect. The CPO Constructor has a variety of parameters, starting with
affect., that can be used to choose what columns the
CPO operates on. To prevent
cpoAsNumeric from influencing the
Species column, we can thus do
A more convenient method in this case, however, is to use an
Task”, which keeps track of the target column. “Feature Operation”
CPOs (as all the ones shown) do not influence the target column.
When performing preprocessing, it is sometimes necessary to change a small aspect of a long preprocessing pipeline. Instead of having to re-construct the whole pipeline,
mlrCPO offers the possibility to change hyperparameters of a
CPO. This makes it very easy e.g. for tuning of preprocessing in combination with a machine learning algorithm.
CPOs can be manipulated in the same way as they are manipulated for
getParamSet (to list the parameters),
getHyperPars (to list the parameter values), and
setHyperPars (to change these values). To get the parameter set of a
CPO, it is also possible to use verbose printing using the
! (exclamation mark) operator.
CPOs use copy semantics, therefore
setHyperPars creates a copy of a
CPO that has the changed hyperparameters.
When chaining many
CPOs, it is possible for the many hyperparameters to lead to very cluttered
ParamSets, or even for hyperparameter names to clash.
mlrCPO has two remedies for that.
CPO also has an
id that is always prepended to the hyperparameter names. It can be set during construction, using the
id parameter, or changed later using
setCPOId. The latter one only works on primitive, i.e. not compound,
CPOs. Set the
NULL to use the
CPO’s hyperparameters without a prefix.
The second remedy against hyperparameter clashes is different “exports” of hyperparameters: The hyperparameters that can be changed using
setHyperPars, i.e. that are exported by a
CPO, are a subset of the parameters of the
CPOConstructor. For each kind of
CPO, there is a standard set of parameters that are exported, but during construction, it is possible to influence the parameters that actually get exported via the
export can be one of a set of standard export settings (among them “
export.all” and “
export.none”) or a
character vector of the parameters to export.
Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing into a machine-learning pipeline: The same preprocessing steps that are performed on the training data need to be performed on the new prediction data. However, the transformation performed for prediction often needs information from the training step. For example, if training entail performing PCA, then for prediction, the data must not undergo another PCA, instead it needs to be rotated by the rotation matrix found by the training PCA. The process of obtaining the rotation matrix will be called “training” the
CPO, and the object that contains the trained information is called
CPOTrained. For preprocessing operations that operate only on features of a task (as opposed to the target column), the
CPOTrained will always be applied to new incoming data, and hence be of class
CPORetrafo and called a “retrafo” object. To obtain this retrafo object, one can use
retrafo(). Retrafo objects can be applied to data just as
CPOs can, by using the
To show that
ret actually represents the exact same preprocessing operation, we can feed the first line of
iris.demo back to it, to verify that the transformation is the same.
We obviously would not have gotten there by feeding the first line to
CPOTrained objects associated with an object are automatically chained when another
CPO is applied. To prevent this from happening, it is necessary to “clear” the retrafos and inverters associated with the object using
clearRI has no influence on the
CPO operations themselves, and the resulting data is the same:
It is also possible to chain
CPOTrained object using
%>>%. This can be useful if the trafo chain loses access to the
retrafo attribute for some reason. In general, it is only recommended to compose
CPOTrained objects that were created in the same process and in correct order, since they are usually closely associated with the training data in a particular place within the preprocessing chain.
So far only
CPOs were introduced that change the feature columns of a
Task. (“Feature Operation
CPOs”–FOCPOs). There is another class of
CPOs, “Target Operation
CPOs” or TOCPOs, that can change a
Task’s target columns.
This comes at the cost of some complexity when performing prediction: Since the training data that was ultimately fed into a
Learner had a transformed target column, the predictions made by the resulting model will not be directly comparable to the original target values. Consider
CPO that log-transforms the target variable of a regression
Task. The predictions made with a
Learner on a log-transformed target variable will be in log-space and need to be exponentiated (or otherwise re-transformed). This inversion operation is represented by an “inverter” object that is attached to a transformation result similarly to a retrafo object, and can be obtained using the
inverter() function. It is of class
CPOInverter, a subclass of
The inverter object is used by the
invert() function that inverts the prediction made by a model trained on the transformed task, and re-transforms this prediction to fit the space of the original target data. The inverter object caches the “truth” of the data being inverted (
iris.logd, in the example), so
invert can give information on the truth of the inverted data.
This procedure can also be done with new incoming data. In general, more than just the
cpoLogTrafoRegr operation could be done on the
iris.regr task in the example, so to perform the complete preprocessing and inversion, one needs to use the retrafo object as well. When applying the retrafo object, a new inverter object is generated, which is specific to the exact new data that was being retransformed:
cpoLogTrafoRegr is a special case of TOCPO in that its inversion operation is constant: It does not depend on the new incoming data, so in theory it is not necessary to get a new inverter object for every piece of data that is being transformed. Therefore, it is possible to use the retrafo object for inversion in this case. However, the “truth” column will not be available in this case:
Whether a retrafo object is capable of performing inversion can be checked with the
getCPOTrainedCapability() function. It returns a vector with named elements
"invert", indicating whether a
CPOTrained is capable of performing retrafo or inversion. A
1 indicates that the object can perform the action and has an effect, a
0 indicates that the action would have no effect (but also throws no error), and a
-1 means that the object is not capable of performing the action.
As an example of a
CPO that does not have a constant inverter, consider
cpoRegrResiduals, wich fits a regression model on training data and returns the residuals of this fit. When performing prediction, the
invert action is to add predictions by the
CPO’s model to the incoming predictions made by a model trained on the residuals.
Besides FOCPOs and TOCPOs, there are also “Retrafoless”
CPOs (ROCPOs). These only perform operation in the training part of a machine learning pipeline, but in turn are the only
CPOs that may change the number of rows in a dataset. The goal of ROCPOs is to change the number of data samples, but not to transform the data or target values themselves. Examples of ROCPOs are
There is no retrafo or inverter associated with the result. Instead, both of them are NULLCPO
Until now, the
CPOs have been invoked explicitly to manipulate data and get retrafo and inverter objects. It is good to be aware of the data flows in a machine learning process involving preprocessing, but
mlrCPO makes it very easy to automatize this. It is possible to attach a
CPO to a
attachCPO or the
%>>%-operator. When a
CPO is attached to a
CPOLearner is created. The
CPOLearner performs the preprocessing operation dictated by the
CPO before training the underlying model, and stores and uses the retrafo and inverter objects necessary during prediction. It is possible to attach compound
CPOs, and it is possible to attach further
CPOs to a
CPOLearner to extend the preprocessing pipeline. Exported hyperparamters of a
CPO are also present in a
CPOLearner and can be changed using
setHyperPars, as usual with other
Recreating the pipeline from General Inverters with a
CPOLearner looks like the following. Note the prediction
pred made in the end is identical with the one made above.
It is possible to get the retrafo object from a model trained with a
CPOLearner using the
retrafo() function. In this example, it is identical with the
retrafo(iris.resid) gotten in the example in General Inverters.
Since the hyperparameters of a
CPO are present in a
CPOLearner, is possible to tune hyperparameters of preprocessing operations. It can be done using
tuneParams() function and works identically to tuning common
%>>% operator, there are a few related operators which are short forms of operations that otherwise take more typing.
%<<%is similar to
%>>%but works in the other direction.
a %>>% bis the same as
b %<<% a.
%<<%operators, combined with assignment.
a %<>>% bis the same as
a = a %>>% b. These operators perform the operations on their right before they do the assignment, so it is not necessary to use parentheses when writing
a = a %>>% b %>>% cas
a %<>>% b %>>% c.
%|<%feed data in a
CPOand gets the
data %>|% ais the same as
retrafo(data %>>% a). The
%>|%operator performs the operation on its right before getting the retrafo, so it is not necessary to use parentheses when writing
retrafo(data %>>% a %>>% b)as
data %>|% a %>>% b.
As described before, it is possible to compose
CPOs to create relatively complex preprocessing pipelines. It is therefore necessary to have tools to inspect a
CPO pipeline or related objects.
The first line of attack when inspecting a
CPO is always the
print(x, verbose = TRUE) will often print more information about a
CPO than the ordinary print function. A shorthand alias for this is the exclamation point “
!”. When verbosely printing a
CPOConstructor, the transformation functions are shown. When verbosely printing a
CPO, the constituent elements are separately printed, each showing their parameter sets.
When working with compound
CPOs, it is sometimes necessary to manipulate a
CPO inside a compound
CPO pipeline. For this purpose, the
as.list() generic is implemented for both
CPOTrained for splitting a pipeline into a list of the primitive elements. The inverse is
pipeCPO(), which takes a list of
CPOTrained and concatenates them using
CPOTrained objects contain information about the retrafo or inversion to be performed for a
CPO. It is possible to access this information using
getCPOTrainedState(). The “state” of a
CPOTrained object often contains a
$data slot with information about the expected input and output format (“
ShapeInfo”) of incoming data, a slot for each of its hyperparameters, and a
$control slot that is specific to the
CPO in question. The
cpoPca state, for example, contains the PCA rotation matrix and a vector for scaling and centering. The contents of a state’s
$control object are described in a
CPO’s help page.
It is even possible to change the “state” of a
CPOTrained and construct a new
makeCPOTrainedFromState(). This is fairly advanced usage and only recommended for users familiar with the inner workings of the particular
CPO. If we get familiar with the
CPO using the
!cpoPca) to look at the retrafo function, we notice that the
control$scale values are given to a call of
scale(). If we want to create a new
CPOTrained that does not perform centering or scaling during before applying the rotation matrix, we can change these values.
Comparing this to the original “
repca” retrafo shows that the result of applying
repca has generally smaller values because of the centering.
There is a large and growing variety of
CPOs that perform many different operations. It is advisable to browse through CPOs Built Into mlrCPO for an overview. To get a list of all built-in
listCPO(). A few important or “meta”
CPOs that can be used to influence the behaviour of other
CPOs are described here.
The value associated with “no operation” is the
NULLCPO value. It is the neutral element of the
%>>% operations, and the value of
inverter() when there are otherwise no associated retrafo or inverter values.
The multiplexer makes it possible to combine many CPOs into one, with an extra
selected.cpo parameter that chooses between them. This makes it possible to tune over many different tuner configurations at once.
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.
Attaching the cpo applicator to a learner gives this learner a “cpo” hyperparameter that can be set to any CPO.
cbind other CPOs as operation. The
cbinder makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other. It is often useful to combine
cpoSelect to filter out columns that would otherwise be duplciated.
cpoCbind recognises that
"scale" happens before
"pca", but is also fed to the result directly. The verbose print draws a (crude) ascii-art graph.
CPOs are very flexible and can be combined in many ways, it may be necessary to create completely custom
CPOs. Custom CPOs can be created using the
makeCPO() and related functions. “Building Custom CPOs” is a wide topic which has its own vignette.
CPOs are built using
CPOConstructors by calling them like functions.
CPOConstructorscan be found by using
listCPO()or consulting the relevant vignette.
CPOs and many related objects is available using the
!(exclamation mark) operator.
CPOs export hyperparameters that are accessible using
getHyperPars(), and mutable using
setHyperPars(). Which parameters are exported can be controlled using the
exportparameter during construction.
composeCPO()), applied to data (
applyCPO()) and attached to
attachCPO()) using special functions for each of these operations, or using the general
CPO: FOCPO (Feature Operation
CPOs), TOCPO (Target Operation
CPOs) and ROCPO (Retrafoless
CPOs). The first may only change feature columns, the second only target columns. While the last one may change both feature and target values and even the number of rows of a dataset, it does so with the understanding that new “prediction” data will not be transformed by it and is thus mainly useful for subsampling.
CPOhas a retrafo-
CPOTrainedobject associated with it that can be retrieved using
retrafo()and used to transform new prediction data in similar way as the original training data.
CPOTrainedobjects can themselves be composed using
%>>%, although it is only recommended to compose
CPOTrainedobjects in the same order as they were created, and only if they were created in the same preprocessing pipeline.
CPOTrainedobjects can be inspected using
getCPOTrainedState(), and re-built with changed state using
inverter(). An inverter is also created during application of a retrafo
CPOTrainedare created during training and used on every prediction data set, inverter
CPOTrainedare created anew during each
CPOTrainedapplication and are closely associated with the data that they were created with.
CPOTrainedobjects associated with data are stored in their “attributes” and are automatically chained when more
CPOs are applied.
clearRI()is used to remove the associated
CPOTrainedobjects and prevent this chaining.
CPOs can be attached to
Learners to get
CPOLearners which automatically transform training and prediction data and perform prediction inversion.
CPOLearners have the
Learner’s and the
CPO’s hyperparameters and can thus be manipulated using
setHyperPars(), and can be tuned using
NULLCPO(the neutral element of
makeCPOand similar functions. These are described in their own vignette.