Introduction

Regression trees, and their boostrapped derivatives, are ones of the most popular off the shelf machine learning techniques. Their popularity is due to their high interpretability and exceptionaly great predictive power. The original regression tree developments by Breiman et al (1984) considered scalars or categories as output variables. In modern applications this represents a significant limitation given that many modern datasets contain output variables that are multidimensional such as vectors, or even infinite dimensional such as curves or functions. To expand the applicability of regression trees for such complex output data Segal (1992) and Segal (2011) developed strategies for growing regression trees when the output data are vectors, while Yu and Lambert (1998) applied the same strategies in the context of infinite dimensional output variables. While all these expansions are theoretically sound and were shown to work great in their respective publications, freely available computer codes that implement the methods were never released. This significantly limited the accessibility and the adoption of the methods in the scientific and engineering communities. In this document we introduce the fTree package that implements all of the aforementioned expansions of regression trees and several other experimental techniques proposed in Grujic (2017). We start this document with a brief review of regression trees and then we proceed to discuss the expansion strategies for high dimensional output data. At the end of this document we also discuss a strategy for growing regression trees when the input data are functions. The package was developed for both scientific and engineering usage and a special emphasis was given to expendability which we discuss very last.

Regression Trees

Consider a training set \(\mathcal{T}:\{(\boldsymbol{x_i}, y_i)\}_{i=1}^{N}\) where \(\boldsymbol{x}_i\) is a vector in \(R^n\) and \(y_i\) is a scalar output (\(y_i \in R\)), and let \(f: \boldsymbol{x} \rightarrow y\) be a function that maps \(\boldsymbol{x}_i\)’s to \(y_i\)’s. The idea of regression trees is to partition the \(p\)-dimensional input space spanned by \(\boldsymbol{x}_i\)’s into \(M\) disjoint sub-regions \(R_m\)’s, and then in every sub region approximate the true function \(f\) with some local function \(f_m\). This local function can be a constant (i.e. the local mean), a local regression or even a Gaussian process. Input space partitioning can be performed in many ways, however, one of the most widely adopted partitioning schemes is the binary recursive splitting method proposed by . Their method is binary because it considers binary splits of \(R^n\) along one predictor (covariate) at a time. It is recursive because regions are recursively split into sub-regions until some stopping criteria is met, or, no more training data is left to split. The procedure is also greedy since, in determining the best split of a region it does not consider the quality of later splits. To determine the best split of one region, the method relies on user specified cost function \(G\). For example, when considering some split \(s\) along predictor \(p\) of region \(R_m\) into two sub-regions \(R_{ml}\) and \(R_{mr}\) one would compute the quality of the split as follows: \[ Q_{s,p}^m = G(R_m) - [G(R_{ml}) + G(R_{mr})] \] The splitting procedure computes the quality of all possible splits on all available predictors and selects the one with the highest quality after which it proceeds to further refine the newly formed sub-regions. For trees with scalar outputs, that approximate the true function with the local mean, the most appropriate cost function is the sum of squared residuals \(G_{sse}(R_m)=\sum_{y_i\in R_m}(y_i-\mu_{m})^2\).

Regression trees have very high interpretative capabilities. Every step of the recursive partitioning procedure can be recorded in a form of a decision tree that visually puts the whole splitting process into perspective. An example of a recursive regression tree produced on an input space spanned by two parameters (\(X_1\) and \(X_2\)) is given in figure below.

An example of a regression tree. Left: 2D input space. Right: The corresponding regression tree

An example of a regression tree. Left: 2D input space. Right: The corresponding regression tree

Recursive splitting can be performed on pretty much any type of input parameters. Splitting on continuous input parameters is trivial, the data is simply ordered and the split point is moved from the lowest to the highest point of the parameters range. Categorical predictors have two cases, orderable and unorderable. For orderable categorical predictors the splitting procedure is the same as for continuous, while for unorderable one has to consider different combinations of categories. Obviously this becomes tedious and numerically difficult for a large number of categories. In the current version of fTree we allow for orderable predictors only. The implementation of the routines for splitting on categorical parameters is left for one of the future versions. However, at the end of this document we show a procedure for converting high dimensional predictors into orderable sequence that can be used directly with the current version of the fTree package.

In the fTree package we fit regression trees with the ftree function that takes the following inputs:

  • We pass a data frame of continuous predictors via .X input variable. The number of rows equals the number of observations while the number of columns equals the number of predictors.
  • The output data are provided via .Y variable. The code expects a data frame with functions or vectors stacked in columns. It is required that nrow(.X) == ncol(.Y).
  • .D - specifies a user provided distance matrix (See below for more details).
  • .SIGMA_inv - specifies a user provided covariance matrix. This parameter is needed only for mahalanobis cost function. See below for more details.
  • cost.type - This parameter specifies the cost function to use in the tree fitting procedure. We explain all available cost functions in great detail below.
  • tree.type - This parameter specifies the desired tree type. Available options are: single that specifies that a single tree should be fitted, bagging that specifies that a bootstrapped tree should be fitted, and finally randomforest that specifies that a random forest should be fitted.
  • nBoot specified the number of bootstrapped samples to draw if tree.type="bagging" or tree.type="randomforest" is used
  • nP specifies the number of predictors to randomly consider in the randomforest fitting paradigm.
  • .minSplit is the minimum required number of observations that should be contained within a node before splitting is attempted.
  • .minBucket is the minimum number of elements in leaf nodes.
  • .cp - A split is accepted if its improvement in quality is at least .cp * goodness where goodness is the value of the cost function computed on the entire dataset (i.e. before splitting).
  • ArgStep - Argument step for functional data when the cost function is sse.

What follows is a detailed discussion on the implemented cost functions and practical aspects of working with each one of them. In all our examples we will use a shale dataset with 13 predictors and 178 oil and gas production profiles (see the figure below). For more details about the dataset please consult chapter 4 of Grujic (2017).

par(mfrow=c(1,2))
matplot(OIL, type="l", col="blue", lty=1, xlab="Time (days)", ylab="STB/day", main="Oil Rates")
matplot(GAS, type="l", col="red", lty=1, xlab="Time (days)", ylab="MMCF/day", main="Gas Rates")

train = 1:100
min sd mean median max
GeolX_Rel 865.00 1.929534e+04 6.225931e+04 62301.550 89424.20
GeolY_Rel -66243.30 1.401062e+04 -1.852100e+04 -17293.100 23.50
ProdVert200 0.00 7.817580e-02 8.382980e-02 0.070 0.29
ProdVert300 0.00 1.558110e-01 2.143085e-01 0.220 0.58
StagesPumped 8.00 7.863530e+00 2.427128e+01 20.000 45.00
ComplProppant 1276781.00 1.344911e+06 3.002629e+06 2781521.500 8410537.00
StimLength 2194.00 1.572431e+03 4.653128e+03 4030.000 9526.00
TotalFluid.bbl 40507.74 4.073533e+04 9.414582e+04 76605.785 259617.81
AmtSlickwater 0.00 3.792931e+04 7.121995e+04 65948.235 199703.87
AmtCrosslink 0.00 1.749231e+04 2.119711e+04 14408.865 71978.74
AmtAcid 0.00 1.229993e+02 4.314894e+01 0.000 738.11
AmtLinear 0.00 3.671645e+03 1.665078e+03 792.955 35233.12
AmtResin 0.00 7.466893e+04 5.730479e+04 24187.500 360878.00

Functional and Multivariate Regression Trees

Squared L2 Norm as a Cost Function

When the outputs are vectors \(\boldsymbol{y}\in R^n\), the simplest cost function can be formulated as follows:

\[G(m) = \sum_{i=1}^{N_m}\|\boldsymbol{y} - \boldsymbol{\mu_m}\|^2\] Where:

  • \(m\) is the \(m\)-th node/region
  • \(\mu_m\) is the mean of the outputs contained in node \(m\).

In the ftree function we refer to this cost function as l2square to indicate that it represents a squared L2 norm. The following code fits a regression tree with the l2square cost function:

fTree.l2squared <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], cost.type = 'l2square', tree.type = 'single')

L2 Norm as a Cost Function

Similarly we can also use a plain L2 norm as a cost function

\[G(m) = \sum_{i=1}^{N_m}\|\boldsymbol{y} - \boldsymbol{\mu_m}\|\]

In the ftree function we refer to this cost function as l2norm. Note that in computer code implementation this cost function requires the computation of square roots and as such it is much slower than the previously considered squared l2 norm. The produced regression trees are always different though. The following code fits a tree with the l2norm cost function

fTree.l2norm <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], cost.type = 'l2norm', tree.type = 'single')

Mahalanobis Cost Function

Another type of a cost function for multivariate outputs was proposed by Segal(1992). The cost function computes the sum of squared Mahalanobis distances between the data and the regions mean vector

\[G(m)=\sum_{i=1}^{Nm}(\boldsymbol{y}-\boldsymbol{\mu_m})^t\Sigma_m^{-1}(\boldsymbol{y}-\boldsymbol{\mu_m})\] In the ftree function we refer to this cost function as mahalanobis. In the current version of the code, the mahalanobis cost function is the most computationally expensive out of all considered cost functions. In addition, the user needs to ensure that the empirical covariance matrix \(\Sigma_m\) is positive definite. When the covariance matrix is estimated from the training data, postive-definiteness is not guaranteed and in the case of indefiniteness, approximations need to be used. In the ftree function, we use the nearPD function from the Matrix package to approximate indefinite covariance matrices. The nearPD function computes the nearest positive definite covariance matrix by discarding eigen vectors associated with negative eigen values. We also allow for user provided inverse of the covariance matrix which can be passed through the .SIGMA_inv variable.

# it computes the covariance matrix internally with nearPD.
fTree.mahalanobis <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], cost.type = 'mahalanobis', tree.type = 'single')

# or you can compute it externally and pass it in via .SIGMA_inv
SIGMA_inv = as.matrix(solve(nearPD(cov(t(OIL[,train])))$mat))
fTree.mahalanobis <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .SIGMA_inv = SIGMA_inv, cost.type = 'mahalanobis', tree.type = 'single')

SSE Cost Function

When the output data are curves \(y(t), t \in T\), one can proceed in several ways. The simplest approach is to treat functions as vectors and employ one of the previously introduced cost functions. The dimensionality of the problem can be somewhat reduced by projecting the functional data onto the functional principal components and working with the fpc scores and the l2square or l2norm cost functions. Another approach is to use the sum of the integrated squared residuals as the cost function:

\[G(m)=\sum_{i=1}^{N_m}\int_{T}(y(t)-\mu(t))dt\]

In the ftree function, this cost function is computed with numerical (trapezoidal) integration with a user provided time step ArgStep. We refer to this cost function as sse. Note that the sse cost function produces very similar results as the l2square cost function when applied on the same data.

fTree.sse <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], cost.type = 'sse', tree.type = 'single')

A Similarity Distance-Based Cost Function

An alternative and numerically efficient way of growing regression trees with complex output data is by means of similarity distances. Starting from a matrix of similarity distances \(D_m\) computed between the responses of some region (node) \(m\) we can compute the following distance based cost function:

\[G(m) = \sum_{i=1}^{Nm}max(d_{ij})=\sum_{i=1}^{Nm}d_{i,medoid}\] where \(d_{i,medoid}\) is the distance of the \(i\)-th output from the most central output of the region \(m\), the medoid. In the ftree function we refer to this cost function as rdist. The rdist cost function is the most computationally efficient out of all other cost functions outlined in this document. Distances are computed only once before running the code and then the algorithm simply subsets the distance matrix at each iteration. This approach is also very robust since one can fit or test many different regression trees simply by changing the type of the similarity distance. Another convenience of this cost function is that it allows for computation with compound similarity distance matrices. For example, to fit a tree on multivariate functional data one would compute one distance matrix for each level of functional data, then scale the matrices to [0,1], and sum them up to produce a compound distance matrix. The compound distance matrix is then used with the rdist cost function to fit a regression tree.

There are two ways to grow regression trees with distances in fTree. The first approach is to compute a similarity distance matrix between all responses and pass it to the ftree function via .D parameter. Another approach is to pass the raw functional or vector data via .Y variable and set .D to the desired distance type, for example .D='euclidean'. The data contained in .Y and the distance type are passed to the generic dist function prior to growing a regression tree.

# it computes the similarity distance matrix internally with `dist`.
fTree.rdistO <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .D = "euclidean", cost.type = 'rdist', tree.type = 'single')

# Or you can compute your similiarity distance matrix externally and pass it in via .D parameter
D = as.matrix(dist(t(OIL[,train])), "euclidean")
fTree.rdistO <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .D = D, cost.type = 'rdist', tree.type = 'single')

If one wants to grow regression trees with multiple outputs then one can simply combine the distance matrices computed on each output type. Here we will demonstrate that approach on the oil and gas curves we introduced previously.

D.oil = as.matrix(dist(t(OIL[,train])), "euclidean")
D.gas = as.matrix(dist(t(GAS[,train])), "euclidean")
D.oil = D.oil/max(D.oil)
D.gas = D.gas/max(D.gas)
fTree.rdistOG <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .D = D.oil + D.gas, cost.type = 'rdist', tree.type = 'single')

Note that in this mode we are still passing the oil rates to the ftree function via the .Y variable. The oil rates are not used for growing the tree in the rdist mode, they are only present for predictive purposes (more on that later).

The ftree function returns a list of the following elements:

str(fTree.sse, 1)
## List of 8
##  $ covariates: num [1:100, 1:13] 45458 56394 62597 60223 60839 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ functions : num [1:600, 1:100] 362 359 355 352 348 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ costType  : chr "sse"
##  $ treeType  : chr "single"
##  $ minSplit  : num 20
##  $ minBucket : num 7
##  $ cP        : num 0.005
##  $ trees     :List of 1
##  - attr(*, "class")= chr [1:2] "fTree" "list"

The trees slot contains the fitted tree saved in a form of a nested list. In this particular example the length of the trees slot is one since the fitted tree.type was single. In the case of bootstrapped trees the length of the trees object equals the value of the nBoot parameter. Each element of the nested list represents one node with the following elements:

str(fTree.sse$trees[[1]], 1)
## List of 11
##  $ indices       : num [1:99, 1] 1 2 3 4 5 6 7 8 9 10 ...
##  $ nodeGoodness  : num 1.41e+08
##  $ isLeaf        : int 0
##  $ nLeaves       : int 9
##  $ depth         : int 0
##  $ midpoint      : num 8
##  $ splitsGoodness: num [1:3, 1:13] 1.35e+08 8.84e+07 4.70e+07 1.22e+08 7.01e+07 ...
##  $ bestPredictor : int 5
##  $ splitPoint    : chr "3485688.000000"
##  $ left          :List of 11
##  $ right         :List of 11

Where:

  • indices is a vector of indices of the data contained within the node.
  • nodeGoodness represents the value of the cost function computed on the outputs of the current node.
  • isLeaf is a boolean parameter specifying wheter the node is a leaf node. This is used in prediction mode.
  • nLeaves represents the number of child leaves of the current node
  • depth represents the depth of the node in the recursive partitioning hierarchy
  • midpoint - a visualization parameter
  • bestPredictor represents the order of the predictor that splits the current node
  • splitsGoodness is a matrix of the best splits and their associated qualities of each predictor for the current node. Note that midpoint and bestPredictor are obtained from this matrix. The matrix is used in variable importance computations.
  • splitPoint represents the split point of the bestPredictor for splitting the current node
  • left is a sublist containing the data and the nodes that are on the “left” after splitting on the bestPredictor at splitPoint
  • right is a right sublist containing the data and the nodes that remain on the right after splitting the current node at bestPredictor at splitPoint

How To Select a Cost Function?

The selection of the cost function is problem dependent. In some settings a simple cost function such as l2square would do a good job, while in some other cases a complex distance computation (i.e. Modified Hausdorff) has to be performed along with the rdist cost function. The best way is to try different options and assess the results in terms of whether or not the modeling objectives are being met. To ease the selection process we expose two convenience functions that enable easier data mining of the fitted regression trees. The first function takes a node flag as an input and it returns the indices of the data contained in the specified node. The returned indices are relative to the original ordering of the data. The user can then plot the data with custom coded routines. For example, one might consider plotting vectors in low dimensional multidimensional scaled plots (i.e. cmdscale) or simply use matplot on the subsetted functional data.

The second function is using the same node flagging as the previous one, however its output is a plot that marks the selected node in the tree topography. The combination of these two functions enables easy data mining of the fitted regression tree and an easy visual assessment of the utilized cost function(s).

node = "0ll"
plotFtree(fTree.sse, .labSize = 2, .horizontal = T, .ylimLow = -1.5, .node = node, .round = 2)
indices <- getNodeIndices(fTree.sse$trees[[1]], node)
matplot(OIL[,train], type="l", col="black", lty=1, xlab="Time (days)", ylab="STB/day")
matplot(OIL[,train][,indices$rightInd], type = "l", col="blue", lty=1, add=TRUE)
matplot(OIL[,train][,indices$leftInd], type = "l", col="red", lty=1, add=TRUE)

Bootstrapped Regression Trees

There are two modeling paradigms when it comes to bootstrapped regression trees. The first modeling paradigm considers simple bootstrapping of the training set and it fits one regression tree to each bootstrapped sample. This paradigm is referred to as “bagging” both in the literature and in the ftree function (i.e. tree.type='bagging'). The second bootstrapping paradigm bootstraps the training data twice. First it bootstraps the observations then on each split in the tree fitting procedure it selects only a randomly selected subset of all available predictors. This bootstrapping paradigm is commonly referred to as “random forest” both in the code and in the literature (i.e. tree.type='randomforest'). Bootstrapped regression trees can be fitted with all of the aforementioned cost functions simply by changing the tree.type parameter. For example:

fTree.rdistO <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .D = D.oil, cost.type = 'rdist', tree.type = 'randomforest', nBoot = 200)
fTree.rdistOG <- ftree(.X = WELL_PARAMETERS[train,], .Y=OIL[,train], .D = D.oil + D.gas, cost.type = 'rdist', tree.type = 'randomforest')

Variable Importance

In the current version of the code, variable importances can be computed only on single and bagging tree types. In one of the future versions of the code we plan on implementing the permutation scheme for random forests. In the current version of the code surrogate splitting is not performed and on each node we save the quality of the best split produced by each predictor. Variable importance computations are simply averaging the qualities of the best split of each predictor over all nodes. In other words:

\[ S_p = \frac{1}{M}\sum_{m=1}^{M}\frac{1}{N_m}max\{Q_{s,p}^{m}, \forall s \} \] Where:

  • \(Q_{s,p}^m\) is the quality of the best split produced by splitting the node \(m\) by parameter \(p\),
  • \(M\) is the number of splits in a tree and
  • \(N_m\) is the number of training points within the node \(m\).

The following code computes variable importance plots for the two distance based trees we fitted earlier.

varSensitivity(fTree.rdistO)
varsensitivity(fTree.rdistOG)

Making Predictions With fTree

To make predictions with the fTree package we use the predictFtree function. The function takes two inputs. The first input is a fitted fTree structure, while the second input is a data frame .Xnew that contains the input parameters of the new data. The code returns a list of length nrow(.Xnew). In the case of bootstrapped trees, each element of the list is a data frame whose number of rows equals the nBoot parameter we used when building the tree (with ftree).

fTree.rdistO.predict <- predictFtree(fTree.rdistO, .Xnew = WELL_PARAMETERS[-train,])

In the case of multivariate similarity distance based tree, in the current version of fTree we cannot forecast both outputs at the same time. Instead we need to employ a small workaround solution demonstrated in the following code.

fTree.rdistOG.predictO <- predictFtree(fTree.rdistOG, .Xnew = WELL_PARAMETERS[-train,])
fTree.rdistOG.Gas <- fTree.rdistOG
fTree.rdistOG.Gas$functions <- GAS[,train]
fTree.rdistOG.predictG <- predictFtree(fTree.rdistOG.Gas, .Xnew = WELL_PARAMETERS[-train,])

We can plot the results with the following code:

par(mfrow=c(4,2))
par(mar=c(4,4,1,1)+0.1)
for(i in 1:4){
matplot(t(fTree.rdistOG.predictO[[i]]),type="l", col="blue", lty=1, xlab="Time(days)", ylab="STB/day", ylim=c(0,700))
lines(OIL[,-train][,i], col="black", lty=1, lwd=2) 

matplot(t(fTree.rdistOG.predictG[[i]]), col="red", lty=1, xlab="Time(days)", ylab="MSCF/day", type="l", ylim=c(0,1600))
lines(GAS[,-train][,i], col="black", lty=1, lwd=2) 

}
A few forecasts of oil and gas rates. Each row is one well, left column are oil rates while the right column are gas rates. Colored lines are the predictions, black lines are the true responses

A few forecasts of oil and gas rates. Each row is one well, left column are oil rates while the right column are gas rates. Colored lines are the predictions, black lines are the true responses

Functional Predictors

The package does not implement the routines for growing regression trees with functional predictors, however the data can be pre-processed such that it is presented to the ftree function as a continuous predictor. The procedure we propose for such data transformation is as follows.

  1. Compute similarity distances between the observations of a complex predictor (i.e. with the dist function)
  2. Cluster the functional or complex predictor data with hierarchical clustering (i.e. with the hclust function)
  3. Use Bar Joseph reordering algorithm to re-order the leafs of the hierarchical clustering tree.
  4. Present the new ordering of the complex predictor as a continuous predictor to the fTree function.

The following is a visual representation of the procedure.

An example of ordering of complex predictors. A - Raw unordered functional predictor data. B - Low dimensional (MDS) representation of the data colored by original (old) ordering. C - Hierarchical clustering performed on the raw data. D - Leaf reordered hierarchical clustering dendrogram. E - A low dimensional representation (MDS) colored by the new ordering of the data. F - A plot of the original data colored by the new ordering. (from Grujic (2017))

An example of ordering of complex predictors. A - Raw unordered functional predictor data. B - Low dimensional (MDS) representation of the data colored by original (old) ordering. C - Hierarchical clustering performed on the raw data. D - Leaf reordered hierarchical clustering dendrogram. E - A low dimensional representation (MDS) colored by the new ordering of the data. F - A plot of the original data colored by the new ordering. (from Grujic (2017))

Writing Custom Cost Functions (Under Development)