The fdagstat
package is an expansion of the well-known gstat
package by Edzer Pebesma. The package implements the recently developed trace-based based methods of functional geostatistics. In particular, it implements the ordinary-trace kriging method by Giraldo et al (2011), universal trace kriging method by Menafoglio et al (2013), and universal trace-co-kriging method by Grujic et al (2017). In the development of the package several applications were kept in mind:
In addition to the implemented trace-based routines, the package is also capable of scalar interpolation. In particular it implements universal and ordinary kriging and co-kriging methodologies (Chiles and Delfiner (1999)). Therefore, the package is also capable of fitting functional principal component based methods of functional geostatistics such as: universal co-kriging of fpc scores by Menafoglio et al (2016) and universal co-kriging of fpc scores of multivariate functional data by Bohorquez et al (2016).
All routines contained in the package were developed for efficient practical and scientific usage. For example, our predict routine always exports kriging weights which was not the case in the original gstat
implementation. Additionally, we use basic data structures such as lists, matrices and vectors for easy access and even easier edit/modification of all modeling parameters.
In this document we demonstrate the capabilities of the package by constructing two very basic functional meta-models. The functional data used in this tutorial were generated with a subsurface flow simulator (SLB Eclipse) by varying three input parameters. A detailed description of the subsurface model can be found in Grujic et al (2017).
The data set used and analyzed in this tutorial is freely available and it is provided within the package (data("3ParamFWPR")
).
In this tutorial we construct a functional meta-model with the universal trace-kriging methodology by Menafoglio et al (2013). We start with a brief data mining of the 3ParamFWPR
dataset provided within the fdagstat
package. In figure below, we plot the “input” space spanned by the parameters that were varied as a part of the numerical simulation experiment. The right panel of the figure below represents all of the available functional data originally produced by the reservoir simulator and pre-processed through simple data transformation (for details see Grujic et al (2017)).
library(fdagstat)
data("3ParamFWPR")
Left: Input space spanned by three parameters. Right: Functional output data
The input space was generated with random latin hypercube sampling, hence the input parameters are uncorrelated. In this analysis, the parameters are considered as fixed non-random objects. As a part of the pre-processing, all input parameters were scaled to [0,1], following a common practice in meta-modeling (see Sacks et al (1989)).
Next we split the dataset into training and testing subsets, and proceed to fit a universal trace-kriging model, following Menafoglio et al (2013).
set.seed(0)
train <- sample(nrow(DOE), floor(0.7*nrow(DOE)), replace = FALSE)
The first step with the fdagstat
package is to create an fstat
data structure, which is a simple R
list. This is accomplished with the fstat
function. This function is analogous to the gstat
function from the gstat
package.
g <- fstat(NULL, vName = "FWPR", Coordinates = DOE[train,], Functions = Functions[,train], scalar = FALSE)
Here, the first input to the fstat
function is NULL
since we are initializing a new fstat
data structure. The second parameter is the variable name
. The assigned variable name
will be used in all subroutines when processing the data. For example, the titles in variogram plots will correspond to the variable name
assigned during the fstat
data initialization phase. The third parameter is a data frame that contains the coordinates of the data, each row corresponding to a data point. In the presence of covariates to be considered in the drift linear model, these should be included as additional columns of the same data frame of coordinates. The fourth parameter is a data frame containing the evaluation of the functional data on an equispaced grid of arbitrary step. Here, each column of the data frame corresponds to a data point, each row to a value of the abscissa along which the data are evaluated. Hence, the number of rows in the Coordinates
object must be equal to the number of columns in the Functions
object. The fifth parameter scalar
is a boolean variable which specifies the type of the fourth parameter. By default, it is set to FALSE
meaning a functional response variable, as in this case. When set to TRUE
the code will expect a column vector for the fourth parameter and all subsequently called geostatistical routines will work in scalar or conventional geostatistics mode.
Next we estimate the functional drift and compute the functional residuals.
g <- estimateDrift("~.", g, Intercept = TRUE)
The first input to the estimateDrift
function is a formula object. The formula specifies which columns of the data frame Coordinates
(previously provided to the fstat
function) should be used in the drift computation. Those coordinates are included as regressors in the functional model, in a function-on-scalar regression model, see Menafoglio et al (2013). In this particular case, we chose to work with all coordinates. In some other modeling workflows, like regression kriging, one might consider a subset of coordinates as covariates for the drift. The second parameter is an existing fstat
data structure to which the drift should be added. The fourth parameter indicates whether to fit a regression with or without the intercept. Note that, by default, function estimateDrift
performs the estimation by ordinary least squares (OLS). For a generalized least square (GLS) estimate, see below.
Our next step is to compute the trace variogram on the functional residuals. This is accomplished with the fvariogram
function as follows
g <- fvariogram("~.", g, Nlags = 100, LagMax = 1, ArgStep = 1, useResidual = TRUE, comments=FALSE)
The first input parameter is a formula specifying which columns of the Coordinates
object should be used to compute the distances in variogram computations. For our meta-model, we consider as formula for variogram computation the same as the formula used for the drift computation. The second parameter is the fstat
structure that contains the data and the drift object. Note that in stationary cases one can skip drift modeling and proceed directly to variogram computation with the fvariogram
function. In that case the code would simply compute the variograms on the raw data. The third parameter Nlags
specifies the number of spatial lags to be used for variogram computation. The fourth parameter LagMax
represents the maximum distance among pairs of data according to which computing the empirical trace variogram. The fifth parameter specifies the sampling step of the Functions
object provided upon fstat
structure initialization. This parameter is important because the code is using the trapezoidal rule to compute the integrals in trace variogram computation. Finally, the sixth parameter specifies whether to use the residuals for variogram computation or the data themselves. Note that, in case of stationarity, this choice is not influential on the analysis. However, in the presence of non-stationarity, the trace-variogram estimated from the data may result in severely biased estimates.
In this example, we are suppressing the comments with the comments=FALSE
flag.
Once the variogram computation is complete we can visualize the variogram with the following function:
plotVariogram(g)
Empirical trace variogram
Note that the plotVariogram
function uses the ggplot2
package routines for plotting. The function also has the capabilities to directly incorporate a pre-defined ggplot2
theme or even to return a ggplot2
object that can be further modified with the ggplot2
commands. For more info please check out the help(plotVariogram)
page, and see below for a few examples.
print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black")))
print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black")) + theme(strip.background = element_rect("orange")))
print(plotVariogram(g, ggReturn = TRUE) + scale_colour_manual(values=c("black")) + theme(strip.background = element_rect("orange")) + ggtitle("New Title") + xlab('New X') + ylab('New y'))
print(plotVariogram(g, npSize = TRUE, ggReturn = TRUE) + scale_colour_manual(values=c("black")) + theme(strip.background = element_rect("orange")) + ggtitle("New Title") + xlab('New X') + ylab('New y'))
The empirical variogram has a clear quadratic behavior at 0. Therefore, we proceed to fit a Gaussian model with the fitVariograms
function.
g <- fitVariograms(g, model=vgm(2e+09, "Gau", 0.75, 0))
plotVariogram(g)
Fitted variogram
The fitVariograms
function is very similar to the previously introduced functions, in a sense that it takes an existing fstat
structure and adds a variogram model into it. In the backend, the fitVariograms
function uses gstat
’s capabilities for variogram fitting. Therefore, the second parameter to the fitVariograms
function is produced by the vgm
function of the gstat
package. The vgm
function takes at minimum three inputs: variogram sill, variogram type, variogram range, and the nugget. Note that fitting more complex variograms (i.e. nested) can also be accomplished with the vgm
function by adding another vgm
object with add.to
parameter, see help(vgm)
for more info.
Finally, we proceed to use the fitted model to predict the functions of the test set. To this end, we first need to compute the covariance matrix between the training points. This can be achieved with the addCovariance
function:
g <- addCovariance(g, type = 'omni')
This will add a covariance
slot to the fstat
structure:
str(g, 1)
## List of 6
## $ data :List of 1
## $ variogram :Classes 'gstatVariogram' and 'data.frame': 99 obs. of 7 variables:
## $ model :List of 1
## $ drift :List of 1
## ..- attr(*, "type")= chr "OLS"
## ..- attr(*, "formula")=List of 1
## $ KrigFormula: chr "~."
## $ covariance :List of 1
## ..- attr(*, "covType")= chr "omni"
## - attr(*, "class")= chr "fstat"
Note that the modeling workflow by Menafoglio et al (2013) considers drifts estimated with generalized least squares (GLS), through an iterative algorithm. The package has the capabilities to accomplish this. Before we describe the procedure, it is important to note that prior to fitting a GLS model one needs an estimate of an initial covariance matrix. This can be obtained as in the previous steps. That is, first the drift is estimated via ordinary least squares (estimateDrift
function), second the trace-variogram is estimated empirically from the residuals (fvariogram
function), then a valid model is fitted (fitVariograms
function) and a covariance matrix is added (addCovariance
function). To obtain the GLS estimate of the residuals, one is left to call the estimateDrift
function one more time with .type
set to GLS
. In this case, the parameters of the model are estimated by using a generalized least square method, based on the current estimate of the covariance matrix. Given the new estimate of the drift, the variogram modeling is to be repeated as before from the updated residuals.
g <- estimateDrift("~.", g, .type = "GLS", Intercept = TRUE)
g <- fvariogram("~.", g, Nlags = 100, LagMax = 1, ArgStep = 1, useResidual = TRUE, comments=FALSE)
g <- addCovariance(g, 'omni')
Note that Menafoglio et al (2013) recommend to iterate the procedure a few times. The current version of the gstat
package does not have this feature implemented; to get an improved GLS estimate, the above code should be iterated few times. Moreover, depending on the magnitude of the covariances and on the magnitude of the coordinates, one may need to scale the covariance structure prior to running the addCovariance(g)
function. A possible way of doing covariance rescaling is by rescaling the sills of the fitted variogram object.
g$model$omni$FWPR_pc1$psill <- g$model$omni$FWPR_pc1$psill/sum(g$model$omni$FWPR_pc1$psill)
g <- addCovariance(g)
Note that kriging is invariant by such kind of rescaling.
To predict new data we use the predictFstat
function:
forecasts <- predictFstat(g, .newCoordinates = DOE[-train, ], .what = "FWPR", .type = "UK")
The first input to the function is the fstat
structure that contains the model and the pre-computed covariance structure. The second input is a data frame with the coordinates of the target points. Note that the value at target points of possible covariates should be provided as columns in the .newCoordinates
argument. The variables inside the provided .newCooordinates
object must be the same as the ones provided inside the .Coordinates
object upon the fstat
structure initialization. The third input parameter specifies which variable to predict. The name of the target response must match one of the variable names provided during the fstat
structure initialization phase. Finally, the fourth argument specifies which type of kriging the user wants to perform. Possible options for single variate functional data are OK
or UK
specifying ordinary trace-kriging by Giraldo et al (2011), and universal trace kriging by Menafoglio et al (2013), respectively. The object returned by the predictFstat
function is a list with the following slots:
str(forecasts, 1)
## List of 5
## $ C : num [1:323, 1:323] 1.75e+09 4.69e+08 7.95e+07 2.38e+08 5.82e+06 ...
## ..- attr(*, "sill")= num 1.75e+09
## $ c : num [1:323, 1:139] 8.21e+08 7.73e+08 9.38e+07 1.62e+07 3.09e+06 ...
## $ Variance: num [1:139] 54792251 60290402 48610804 41949969 45166585 ...
## $ Forecast: num [1:750, 1:139] -27275 -27169 -27063 -26958 -26852 ...
## $ Weights : num [1:323, 1:139] -0.044 -0.02381 -0.00466 -0.00709 0.00184 ...
We plot the forecasted functions with the following code:
matplot(forecasts$Forecast, type = "l", col="blue", xlab="Time(days)", ylab="FWPR (stb/day)")
matplot(Functions[,-train], type="l", col="red", add = TRUE)
legend('bottomright', c("True", "Forecasted"), col=c("red","blue"), lty=1, lwd=1)
An alternative approach to universal trace kriging was outlined in Menafoglio et al (2016), as a generalization of the stationary approach by Nerini et al (2010). The method starts by representing the functional data through a multivariate vector (i.e., scores), by using functional principal component analysis. The method then proceeds to build a universal co-kriging predictor on these scores and to forecast functional data at unobserved locations or design points via their predicted scores. The fdagstat
has the capability of fitting such models, as we demonstrate next.
We start by performing functional principal component analysis of the training set. Here we choose to work with the first two functional principal components since they described more than 95% of the variance in the data.
library(fda)
basis <- create.bspline.basis(c(1,750), 30, 4)
expandedFunctions <- smooth.basis(1:750, as.matrix(Functions[,train]), basis)
expandedFunctions.pca <- pca.fd(expandedFunctions$fd, 2)
MEAN <- eval.fd(1:750, expandedFunctions.pca$meanfd)
FPCS <- eval.fd(1:750, expandedFunctions.pca$harmonics)
FPCA Results
Note that one may resort to numerical routines to perform functional principal component analysis, without operating a projection over a functional basis as we considered here. In the following, for illustrative purposes, we will work with standardized functional principal component scores. This is mainly due to the fact that the Linear Model of Coregionalization (LMC, Goovaerts (1993)) is often easier to fit when all variables are expressed on the same scale.
sd1 <- sd(expandedFunctions.pca$scores[,1])
mn1 <- mean( expandedFunctions.pca$scores[,1])
sd2 <- sd(expandedFunctions.pca$scores[,2])
mn2 <- mean( expandedFunctions.pca$scores[,2])
FPWR_pc1 <- (expandedFunctions.pca$scores[,1] - mn1)/sd1
FPWR_pc2 <- (expandedFunctions.pca$scores[,2] - mn2)/sd2
In this case, to recover the original scale, one need to operate a back-scaling after prediction, before building the functional prediction, as we show below.
The initialization of the fstat
object is performed in the same way as before, except for the argument scalar
, which now needs to be set to scalar=TRUE
to indicate that we are working with scalar data.
g <- fstat(NULL, 'FWPR_pc1', DOE[train,], as.data.frame(FPWR_pc1), scalar=TRUE)
g <- fstat(g, 'FWPR_pc2', DOE[train,], as.data.frame(FPWR_pc2), scalar=TRUE)
The rest of the modeling methodology is analogous to the universal trace kriging workflow presented earlier:
g <- estimateDrift("~.", g, Intercept = TRUE)
g <- fvariogram("~.", g, Nlags = 100, LagMax = 1,
useResidual = TRUE, comments=FALSE)
plotVariogram(g)
Variograms of the fpc scores
Note that in this case we have three variograms. Two omni-directional auto variograms (FWPR_pc1
and FWPR_pc1
), and one cross variogram (FWPR_pc1.FWPR_pc2
). To fit the three variograms we use the linear model of coregionalization (LMC, Goovaerts (1993)). In the gstat
package this is accomplished with the fit.lmc
function that is called in the backend of fitVariograms
. Since the fit.lmc
function allows for variable ranges in the variograms which violates the assumption of the linear model of coregionalizaiton, here, we set fitRanges
to FALSE
to fit a numerically valid model. Hence, the ranges are not optimized, but kept fixed to their initial values.
g <- fitVariograms(g, model = vgm(0.3, "Gau", 0.35, 0), forceNugget = TRUE, fitRanges = FALSE)
plotVariogram(g)
Fitted variogram models
g <- addCovariance(g, 'omni')
Future versions of the package will contain improved codes for LMC fitting, that will perform a more advanced and more flexible multivariate model fitting.
Forecasting is analogous to the universal trace kriging example shown before.
forecast.pc1 <- predictFstat(g, .newCoordinates = DOE[-train,], .what = "FWPR_pc1", .type = "UcoK", algIndependent = TRUE)
forecast.pc2 <- predictFstat(g, .newCoordinates = DOE[-train,], .what = "FWPR_pc2", .type = "UcoK", algIndependent = TRUE)
Note that in this case the predictFstat
function is called twice, since both the scores needs to be predicted to reconstruct the function. Moreover, we are using universal co-kriging (.type="UcoK"
) with algebraically independent drifts (algIndependent = TRUE
). Readers who are not familiar with different types of co-kriging are can refer to Chiles and Delfiner (1999).
Finally, we use the forecasted principal component scores to construct the forecasts of unobserved functions. The following code achieves this, and plots the results.
# Bring back to the original scale:
forecast.pc1$Forecast <- (forecast.pc1$Forecast * sd1) + mn1
forecast.pc2$Forecast <- (forecast.pc2$Forecast * sd2) + mn2
# Construct functions:
forecast <- NULL
for(i in 1:length(forecast.pc1$Forecast)){
forecast <- rbind(forecast, t(MEAN + FPCS %*% c(forecast.pc1$Forecast[i], forecast.pc2$Forecast[i])))
}
matplot(t(forecast), type="l", col="blue", xlab="Time (days)", ylab="FWPR (bbl/day)")
matplot(Functions[,-train], type = "l", col="red", add=TRUE)
legend('bottomright', c("True", "Forecasted"), lty=1, col=c("red", "blue"))
UCoK forecasts