The ‘fdagstat’ package

The fdagstat package is an expansion of the well-known gstat package by Edzer Pebesma. The package implements the recently developed trace-based based methods of functional geostatistics. In particular, it implements the ordinary-trace kriging method by Giraldo et al (2011), universal trace kriging method by Menafoglio et al (2013), and universal trace-co-kriging method by Grujic et al (2017). In the development of the package several applications were kept in mind:

  1. Spatial or low dimensional interpolation of functional data as in Giraldo et al (2011) and Menafoglio et al (2013)
  2. High dimensional interpolation of single and multivariate functional data. One example of this application is functional-meta-modeling or multi-fidelity meta-modeling as outlined in Grujic et al (2017)
  3. Mixed interpolation by means of functional regression kriging where the drift term depends on several covariates and the functional residual is assumed to be spatially correlated.

In addition to the implemented trace-based routines, the package is also capable of scalar interpolation. In particular it implements universal and ordinary kriging and co-kriging methodologies (Chiles and Delfiner (1999)). Therefore, the package is also capable of fitting functional principal component based methods of functional geostatistics such as: universal co-kriging of fpc scores by Menafoglio et al (2016) and universal co-kriging of fpc scores of multivariate functional data by Bohorquez et al (2016).

All routines contained in the package were developed for efficient practical and scientific usage. For example, our predict routine always exports kriging weights which was not the case in the original gstat implementation. Additionally, we use basic data structures such as lists, matrices and vectors for easy access and even easier edit/modification of all modeling parameters.

In this document we demonstrate the capabilities of the package by constructing two very basic functional meta-models. The functional data used in this tutorial were generated with a subsurface flow simulator (SLB Eclipse) by varying three input parameters. A detailed description of the subsurface model can be found in Grujic et al (2017).

The data set used and analyzed in this tutorial is freely available and it is provided within the package (data("3ParamFWPR")).

Universal Trace Kriging (UTrK).

In this tutorial we construct a functional meta-model with the universal trace-kriging methodology by Menafoglio et al (2013). We start with a brief data mining of the 3ParamFWPR dataset provided within the fdagstat package. In figure below, we plot the “input” space spanned by the parameters that were varied as a part of the numerical simulation experiment. The right panel of the figure below represents all of the available functional data originally produced by the reservoir simulator and pre-processed through simple data transformation (for details see Grujic et al (2017)).

library(fdagstat)
data("3ParamFWPR")
Left: Input space spanned by three parameters. Right: Functional output data

Left: Input space spanned by three parameters. Right: Functional output data

The input space was generated with random latin hypercube sampling, hence the input parameters are uncorrelated. In this analysis, the parameters are considered as fixed non-random objects. As a part of the pre-processing, all input parameters were scaled to [0,1], following a common practice in meta-modeling (see Sacks et al (1989)).

Next we split the dataset into training and testing subsets, and proceed to fit a universal trace-kriging model, following Menafoglio et al (2013).

set.seed(0)
train <- sample(nrow(DOE), floor(0.7*nrow(DOE)), replace = FALSE)

Initialization

The first step with the fdagstat package is to create an fstat data structure, which is a simple R list. This is accomplished with the fstat function. This function is analogous to the gstat function from the gstat package.

g <- fstat(NULL, vName = "FWPR", Coordinates = DOE[train,], Functions = Functions[,train], scalar = FALSE)

Here, the first input to the fstat function is NULL since we are initializing a new fstat data structure. The second parameter is the variable name. The assigned variable name will be used in all subroutines when processing the data. For example, the titles in variogram plots will correspond to the variable name assigned during the fstat data initialization phase. The third parameter is a data frame that contains the coordinates of the data, each row corresponding to a data point. In the presence of covariates to be considered in the drift linear model, these should be included as additional columns of the same data frame of coordinates. The fourth parameter is a data frame containing the evaluation of the functional data on an equispaced grid of arbitrary step. Here, each column of the data frame corresponds to a data point, each row to a value of the abscissa along which the data are evaluated. Hence, the number of rows in the Coordinates object must be equal to the number of columns in the Functions object. The fifth parameter scalar is a boolean variable which specifies the type of the fourth parameter. By default, it is set to FALSE meaning a functional response variable, as in this case. When set to TRUE the code will expect a column vector for the fourth parameter and all subsequently called geostatistical routines will work in scalar or conventional geostatistics mode.

Drift Estimation

Next we estimate the functional drift and compute the functional residuals.

g <- estimateDrift("~.", g, Intercept = TRUE)

The first input to the estimateDrift function is a formula object. The formula specifies which columns of the data frame Coordinates (previously provided to the fstat function) should be used in the drift computation. Those coordinates are included as regressors in the functional model, in a function-on-scalar regression model, see Menafoglio et al (2013). In this particular case, we chose to work with all coordinates. In some other modeling workflows, like regression kriging, one might consider a subset of coordinates as covariates for the drift. The second parameter is an existing fstat data structure to which the drift should be added. The fourth parameter indicates whether to fit a regression with or without the intercept. Note that, by default, function estimateDrift performs the estimation by ordinary least squares (OLS). For a generalized least square (GLS) estimate, see below.

Trace Variogram Estimation

Our next step is to compute the trace variogram on the functional residuals. This is accomplished with the fvariogram function as follows

g <- fvariogram("~.", g, Nlags = 100, LagMax = 1, ArgStep = 1, useResidual = TRUE, comments=FALSE)

The first input parameter is a formula specifying which columns of the Coordinates object should be used to compute the distances in variogram computations. For our meta-model, we consider as formula for variogram computation the same as the formula used for the drift computation. The second parameter is the fstat structure that contains the data and the drift object. Note that in stationary cases one can skip drift modeling and proceed directly to variogram computation with the fvariogram function. In that case the code would simply compute the variograms on the raw data. The third parameter Nlags specifies the number of spatial lags to be used for variogram computation. The fourth parameter LagMax represents the maximum distance among pairs of data according to which computing the empirical trace variogram. The fifth parameter specifies the sampling step of the Functions object provided upon fstat structure initialization. This parameter is important because the code is using the trapezoidal rule to compute the integrals in trace variogram computation. Finally, the sixth parameter specifies whether to use the residuals for variogram computation or the data themselves. Note that, in case of stationarity, this choice is not influential on the analysis. However, in the presence of non-stationarity, the trace-variogram estimated from the data may result in severely biased estimates.

In this example, we are suppressing the comments with the comments=FALSE flag.

Once the variogram computation is complete we can visualize the variogram with the following function:

plotVariogram(g)
Empirical trace variogram

Empirical trace variogram

Note that the plotVariogram function uses the ggplot2 package routines for plotting. The function also has the capabilities to directly incorporate a pre-defined ggplot2 theme or even to return a ggplot2 object that can be further modified with the ggplot2 commands. For more info please check out the help(plotVariogram) page, and see below for a few examples.

Black

print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black"))) 

Orange Black

print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black")) +  theme(strip.background = element_rect("orange")))

Change Titles

print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black")) +  theme(strip.background = element_rect("orange")) + ggtitle("New Title") + xlab('New X') + ylab('New y'))

Point Size

print(plotVariogram(g, npSize = TRUE, ggReturn = TRUE) + scale_colour_manual(values=c("black")) +  theme(strip.background = element_rect("orange")) + ggtitle("New Title") + xlab('New X') + ylab('New y'))

The empirical variogram has a clear quadratic behavior at 0. Therefore, we proceed to fit a Gaussian model with the fitVariograms function.

g <- fitVariograms(g, model=vgm(2e+09, "Gau", 0.75, 0))
plotVariogram(g)
Fitted variogram

Fitted variogram

The fitVariograms function is very similar to the previously introduced functions, in a sense that it takes an existing fstat structure and adds a variogram model into it. In the backend, the fitVariograms function uses gstat’s capabilities for variogram fitting. Therefore, the second parameter to the fitVariograms function is produced by the vgm function of the gstat package. The vgm function takes at minimum three inputs: variogram sill, variogram type, variogram range, and the nugget. Note that fitting more complex variograms (i.e. nested) can also be accomplished with the vgm function by adding another vgm object with add.to parameter, see help(vgm) for more info.

Finally, we proceed to use the fitted model to predict the functions of the test set. To this end, we first need to compute the covariance matrix between the training points. This can be achieved with the addCovariance function:

g <- addCovariance(g, type = 'omni')

This will add a covariance slot to the fstat structure:

str(g, 1)
## List of 6
##  $ data       :List of 1
##  $ variogram  :Classes 'gstatVariogram' and 'data.frame':    99 obs. of  7 variables:
##  $ model      :List of 1
##  $ drift      :List of 1
##   ..- attr(*, "type")= chr "OLS"
##   ..- attr(*, "formula")=List of 1
##  $ KrigFormula: chr "~."
##  $ covariance :List of 1
##   ..- attr(*, "covType")= chr "omni"
##  - attr(*, "class")= chr "fstat"

Note that the modeling workflow by Menafoglio et al (2013) considers drifts estimated with generalized least squares (GLS), through an iterative algorithm. The package has the capabilities to accomplish this. Before we describe the procedure, it is important to note that prior to fitting a GLS model one needs an estimate of an initial covariance matrix. This can be obtained as in the previous steps. That is, first the drift is estimated via ordinary least squares (estimateDrift function), second the trace-variogram is estimated empirically from the residuals (fvariogram function), then a valid model is fitted (fitVariograms function) and a covariance matrix is added (addCovariance function). To obtain the GLS estimate of the residuals, one is left to call the estimateDrift function one more time with .type set to GLS. In this case, the parameters of the model are estimated by using a generalized least square method, based on the current estimate of the covariance matrix. Given the new estimate of the drift, the variogram modeling is to be repeated as before from the updated residuals.

g <- estimateDrift("~.", g, .type = "GLS", Intercept = TRUE)
g <- fvariogram("~.", g, Nlags = 100, LagMax = 1, ArgStep = 1, useResidual = TRUE, comments=FALSE)
g <- addCovariance(g, 'omni')

Note that Menafoglio et al (2013) recommend to iterate the procedure a few times. The current version of the gstat package does not have this feature implemented; to get an improved GLS estimate, the above code should be iterated few times. Moreover, depending on the magnitude of the covariances and on the magnitude of the coordinates, one may need to scale the covariance structure prior to running the addCovariance(g) function. A possible way of doing covariance rescaling is by rescaling the sills of the fitted variogram object.

g$model$omni$FWPR_pc1$psill <- g$model$omni$FWPR_pc1$psill/sum(g$model$omni$FWPR_pc1$psill)
g <- addCovariance(g)

Note that kriging is invariant by such kind of rescaling.

Forecasting New Functions

To predict new data we use the predictFstat function:

forecasts <- predictFstat(g, .newCoordinates = DOE[-train, ], .what = "FWPR", .type = "UK")

The first input to the function is the fstat structure that contains the model and the pre-computed covariance structure. The second input is a data frame with the coordinates of the target points. Note that the value at target points of possible covariates should be provided as columns in the .newCoordinates argument. The variables inside the provided .newCooordinates object must be the same as the ones provided inside the .Coordinates object upon the fstat structure initialization. The third input parameter specifies which variable to predict. The name of the target response must match one of the variable names provided during the fstat structure initialization phase. Finally, the fourth argument specifies which type of kriging the user wants to perform. Possible options for single variate functional data are OK or UK specifying ordinary trace-kriging by Giraldo et al (2011), and universal trace kriging by Menafoglio et al (2013), respectively. The object returned by the predictFstat function is a list with the following slots:

str(forecasts, 1)
## List of 5
##  $ C       : num [1:323, 1:323] 1.75e+09 4.69e+08 7.95e+07 2.38e+08 5.82e+06 ...
##   ..- attr(*, "sill")= num 1.75e+09
##  $ c       : num [1:323, 1:139] 8.21e+08 7.73e+08 9.38e+07 1.62e+07 3.09e+06 ...
##  $ Variance: num [1:139] 54792251 60290402 48610804 41949969 45166585 ...
##  $ Forecast: num [1:750, 1:139] -27275 -27169 -27063 -26958 -26852 ...
##  $ Weights : num [1:323, 1:139] -0.044 -0.02381 -0.00466 -0.00709 0.00184 ...

We plot the forecasted functions with the following code:

matplot(forecasts$Forecast, type = "l", col="blue", xlab="Time(days)", ylab="FWPR (stb/day)")
matplot(Functions[,-train], type="l", col="red", add = TRUE)
legend('bottomright', c("True", "Forecasted"), col=c("red","blue"), lty=1, lwd=1)

Universal Co-Kriging of Functional Principal Component Scores (UCoK)

An alternative approach to universal trace kriging was outlined in Menafoglio et al (2016), as a generalization of the stationary approach by Nerini et al (2010). The method starts by representing the functional data through a multivariate vector (i.e., scores), by using functional principal component analysis. The method then proceeds to build a universal co-kriging predictor on these scores and to forecast functional data at unobserved locations or design points via their predicted scores. The fdagstat has the capability of fitting such models, as we demonstrate next.

Functional Data Analysis

We start by performing functional principal component analysis of the training set. Here we choose to work with the first two functional principal components since they described more than 95% of the variance in the data.

library(fda)
basis <- create.bspline.basis(c(1,750), 30, 4)
expandedFunctions <- smooth.basis(1:750, as.matrix(Functions[,train]), basis)
expandedFunctions.pca <- pca.fd(expandedFunctions$fd, 2)
MEAN <- eval.fd(1:750, expandedFunctions.pca$meanfd)
FPCS <- eval.fd(1:750, expandedFunctions.pca$harmonics)
FPCA Results

FPCA Results

Note that one may resort to numerical routines to perform functional principal component analysis, without operating a projection over a functional basis as we considered here. In the following, for illustrative purposes, we will work with standardized functional principal component scores. This is mainly due to the fact that the Linear Model of Coregionalization (LMC, Goovaerts (1993)) is often easier to fit when all variables are expressed on the same scale.

sd1 <- sd(expandedFunctions.pca$scores[,1])
mn1 <- mean( expandedFunctions.pca$scores[,1])
sd2 <- sd(expandedFunctions.pca$scores[,2])
mn2 <- mean( expandedFunctions.pca$scores[,2])
  
FPWR_pc1 <- (expandedFunctions.pca$scores[,1] - mn1)/sd1
FPWR_pc2 <- (expandedFunctions.pca$scores[,2] - mn2)/sd2

In this case, to recover the original scale, one need to operate a back-scaling after prediction, before building the functional prediction, as we show below.

Initialization and Variogram Fitting

The initialization of the fstat object is performed in the same way as before, except for the argument scalar, which now needs to be set to scalar=TRUE to indicate that we are working with scalar data.

g <- fstat(NULL, 'FWPR_pc1', DOE[train,], as.data.frame(FPWR_pc1), scalar=TRUE)
g <- fstat(g, 'FWPR_pc2', DOE[train,], as.data.frame(FPWR_pc2), scalar=TRUE)

The rest of the modeling methodology is analogous to the universal trace kriging workflow presented earlier:

g <- estimateDrift("~.", g, Intercept = TRUE)
g <- fvariogram("~.", g, Nlags = 100, LagMax = 1, 
                useResidual = TRUE, comments=FALSE)
plotVariogram(g)
Variograms of the fpc scores

Variograms of the fpc scores

Note that in this case we have three variograms. Two omni-directional auto variograms (FWPR_pc1 and FWPR_pc1), and one cross variogram (FWPR_pc1.FWPR_pc2). To fit the three variograms we use the linear model of coregionalization (LMC, Goovaerts (1993)). In the gstat package this is accomplished with the fit.lmc function that is called in the backend of fitVariograms. Since the fit.lmc function allows for variable ranges in the variograms which violates the assumption of the linear model of coregionalizaiton, here, we set fitRanges to FALSE to fit a numerically valid model. Hence, the ranges are not optimized, but kept fixed to their initial values.

g <- fitVariograms(g, model = vgm(0.3, "Gau", 0.35, 0), forceNugget = TRUE, fitRanges = FALSE)
plotVariogram(g)
Fitted variogram models

Fitted variogram models

g <- addCovariance(g, 'omni')

Future versions of the package will contain improved codes for LMC fitting, that will perform a more advanced and more flexible multivariate model fitting.

Forecasting New Functions

Forecasting is analogous to the universal trace kriging example shown before.

forecast.pc1 <- predictFstat(g, .newCoordinates = DOE[-train,], .what = "FWPR_pc1", .type = "UcoK", algIndependent = TRUE)
forecast.pc2 <- predictFstat(g, .newCoordinates = DOE[-train,], .what = "FWPR_pc2", .type = "UcoK", algIndependent = TRUE)

Note that in this case the predictFstat function is called twice, since both the scores needs to be predicted to reconstruct the function. Moreover, we are using universal co-kriging (.type="UcoK") with algebraically independent drifts (algIndependent = TRUE). Readers who are not familiar with different types of co-kriging are can refer to Chiles and Delfiner (1999).

Finally, we use the forecasted principal component scores to construct the forecasts of unobserved functions. The following code achieves this, and plots the results.

# Bring back to the original scale:
forecast.pc1$Forecast <- (forecast.pc1$Forecast * sd1) + mn1
forecast.pc2$Forecast <- (forecast.pc2$Forecast * sd2) + mn2

# Construct functions:
forecast <- NULL
for(i in 1:length(forecast.pc1$Forecast)){
  forecast <- rbind(forecast, t(MEAN + FPCS %*% c(forecast.pc1$Forecast[i], forecast.pc2$Forecast[i])))
}
matplot(t(forecast), type="l", col="blue", xlab="Time (days)", ylab="FWPR (bbl/day)")
matplot(Functions[,-train], type = "l", col="red", add=TRUE)
legend('bottomright', c("True", "Forecasted"), lty=1, col=c("red", "blue"))
UCoK forecasts

UCoK forecasts