1 Introduction

The document contains the practical for the course `Co-data learning: the group-regularized ridge and elastic net’ on 13 April 2018, given during the COSTNET Networking Biostatistics meeting and training school in Milan from 12-13 April 2018. The requirements for the course are basic knowledge of R and statistics. The course consists of two part: theory and practice. The theoretical part in is in the form of a lecture, while the practical part is in the form of an R tutorial. This document contains the R tutorial.

The course is concerned with two R packages: GRridge (Wiel et al. 2016) and gren. This document contains the tutorial part for the gren package. This package enables the user to enhance classification performance and variable selection of the logistic elastic net through the inclusion of external information. The information comes in the form of a partitioning of the features into non-overlapping group. Example of such co-data are:

  1. Response-independent summaries in the primary data (e.g. standard deviation).
  2. Feature groupings from prior knowledge (e.g. pathways from public data bases).
  3. Feature-specific summaries from an independent study (e.g. p-values).

In this practical we will analyse two datasets using the gren package to demonstrate the possibilities.

2 Getting started

For this practical, we advise you to create an R script in the same folder as this file (and the other files for the practical). To run the code in this document simply copy and paste it to this R script and execute it from there.

A development version of the gren package is available online. Installation from Github is as follows:

## the devtools package is required for installation from Github
if(!("devtools" %in% installed.packages()[, 1])) {
  install.packages("devtools")
}
library(devtools)

## check if the correct version of Rcpp is installed
if(!("Rcpp" %in% installed.packages()[, 1])) {
  install.packages("Rcpp")
} else if(package_version(packageDescription("Rcpp")$Version) < 
          package_version("0.12.14")) {
  install.packages("Rcpp")
}

## install gren from github
install_github("magnusmunch/gren")
library(gren)

## the Hmisc package is used in the second data example
if(!("Hmisc" %in% installed.packages()[, 1])) {
  install.packages("Hmisc")
}
library(Hmisc)

Note that we need version at least version 0.12.14 of Rcpp, so we check whether this is installed (or a newer version).

It might be useful to clear the global environment before starting

rm(list = ls())

Lastly, make sure that the data files are in the working directory. The easiest way to do this when working in Rstudio is to choose Session > Set Working Directory > To Source File Location and make sure that the files are in the same folder as your script.

3 Example 1: microRNAs in colon cancer

3.1 Data

In this part, we will be working with a dataset containing deep sequenced microRNAs for a cohort of treated colorectal cancer patients. The data is read into the global environment with the following piece of code:

## load the data
load("gren_data_mir_colon.Rdata")

MicroRNAs are small non-coding strands of RNA that are believed to be important in epigenetic functioning. The microRNAs are contained in the mirCol object in the form of a matrix. The rows of the matrix contain the samples, while the columns contain the microRNAs. Run

dim(mirCol)

to find the dimensions of the object containing the microRNAs. Note that gren requires an \(n \times p\) input matrix, compared to \(p \times n\) in GRridge, with \(n\) and \(p\) the number of samples and features, respectively.

EXERCISE How many microRNAs do we have? And what is the number of subjects?

The aim of the analysis is to classify treatment response, coded as either non-progressive/remission or progressive. The responses are given in the respCol object in the form of a factor.

head(respCol)
table(respCol)

EXERCISE What is the number of non-progressive/remission patients? And the number of progressive patients?

In addition there are four clinical covariates that we may include in the analysis. They are contained in the data.frame named unpenCol. the first covariate denotes prior use of adjuvent therapy, a binary variable named adjth. The second is the type of systemic treatment in the form of a ternary variable called thscheme. The third is the numeric vector age and the fourth is the binary variable that codes for primary tumour differentiation called pcrcdiff.

head(unpenCol)

## we check distribution of the clinical variables
apply(unpenCol[, c(1, 2, 4)], 2, table)
hist(unpenCol$age, main="Distribution of ages", xlab="Age")

EXERCISE How many clinical covariates are there?

EXERCISE Do we want to penalise these clinical covariates? Why/why not?

Lastly, there is a factor named mirExpr, which contains the group-memberships of the microRNAs. There are three groups: non differentially expressed microRNAs named nonExpr, medium differentially expressed microRNAS (0.001 \(<\) FDR \(\leq\) 0.05) named medExpr, and highly differentially expressed microRNAs (FDR \(<\) 0.001) called highExpr.These groups are based on a preliminary experiment in a different group of patients, where metastatic colorectal tumour tissue and primary colorectal tissue were compared to normal non-colorectal tissue and primary colorectal tissue, respectively (Neerincx et al. 2015). Run the following code to inspect the partitioning:

head(mirExpr)
table(mirExpr)

EXERCISE What are the sizes of the different groups of microRNAs?

We expect that incorporation of this partitioning enhances therapy response classification, because tumor-specific miRNAs might be more relevant than non-specific ones.

EXERCISE What group do you expect to receive the largest/smallest penalty?

3.2 Estimation

After inspection we standardise the microRNA expression levels. Note that the microRNA expression levels have already been pre-processed somewhat, so they are not counts anymore. But since we standardize them anyway, it doesn’t really matter.

mirColScaled <- scale(mirCol)

Now that we have looked at the data and standardised the expression levels, we can estimate a group-regularized elastic net model using the gren function. gren takes several arguments, the most important ones are x, y, m, unpenalized, and partitions. These arguments constitute the data used for the estimation. For an explanation of these arguments and others use ?gren.

By default, gren estimates a binary logistic regression model. If the outcome data is not binary, but a sum of binary experiments, we may create a numeric vector m that contains the number of Bernoulli trials for each observation. Since our outcome here is binary, we do not specify it here.

In order to estimate the model, we have to specify a penalty mixing parameter alpha. This parameter determines the proportion of \(L_1\) and \(L_2\)-norm penalty. alpha=0 is a ridge model, while alpha=1 is the lasso model. We will start with the default value alpha=0.5. If lambda=NULL (default) it is estimated by cross-validation of the regular elastic net model. Furthermore we will include an unpenalized intercept by setting intercept=TRUE (default).

EXERCISE Can you think of any reason why don’t penalise the intercept?

The following code fits a group-regularized elastic net model (we set a seed, to make the results reproducible). Note that it may take two or three minutes, depending on your computer.

set.seed(1)
## estimate the group-regularized elastic net model
fitGrenCol <- gren(x=mirColScaled, y=respCol, unpenalized=unpenCol, 
                   partitions=list(expression=mirExpr), alpha=0.5)

3.3 Interpreting results

Next we look at the results. The estimated penalty multipliers may be retrieved from the object and plotted. We add a dotted line at one, to indicate the not group-regularized setting:

barplot(fitGrenCol$lambdag$expression, main="Estimated penalty multipliers",
        ylab=expression(paste(lambda[g], "'")),
        names.arg=c("Not expressed", "Medium differentially \n expressed",
                    "Highly differentially \n expressed"))
abline(h=1, lty=2)

EXERCISE Why does the dotted line correspond to a regular (not group-regularized) model?

EXERCISE Do the estimated penalty multipliers follow the expected pattern?

It might also be interesting to compare the model parameter estimates by the group-regularized regression with the regular elastic net estimates:

## extract the estimates from te fitted object
estRegular <- coef(fitGrenCol$freq.model$regular, s=fitGrenCol$lambda)[, 1]
estGroupreg <- coef(fitGrenCol$freq.model$groupreg, s=fitGrenCol$lambda)[, 1]

## plot the estimates in a scatterplot
plot(estRegular, estGroupreg, col=mirExpr, xlab=expression(hat(beta)[regular]),
     ylab=expression(hat(beta)[group]), 
     main="Regular vs group-regularized estimates")
legend("topleft", pch=1, col=c(1:3),
       legend=c("Not expressed", "Medium differentially expressed",
                "Highly differentially expressed"))

names(which(estRegular!=0))
names(which(estGroupreg!=0))

intersect(names(estRegular)[which(estRegular!=0)[-c(1:6)]],
          names(estGroupreg)[which(estGroupreg!=0)[-c(1:6)]])

EXERCISE Which model selects more microRNAs, the group-regularized model or the regular model? What is the overlap between the two?

To reduce the costs of clinical predictions, it is often desirable to obtain a sparse final model. In addition, under equal model sizes, the comparison of two models is easier to interpret. Feature selection is possible by setting psel to the desired number of features and is done by adjusting the global lambda after the penalty multipliers have been estimated. Here, we select 50 features. Again, this will likely take two or three minutes.

set.seed(2)
fitGrenColSel <- gren(x=mirColScaled, y=respCol, unpenalized=unpenCol,
                      partitions=list(expression=mirExpr), alpha=0.5, psel=50)

## extract the estimates from the model object
estRegularSel <- coef(fitGrenColSel$freq.model$regular, 
                      s=fitGrenColSel$lambda)[, 1]
estGroupregSel <- coef(fitGrenColSel$freq.model$groupreg, 
                       s=fitGrenColSel$lambda)[, 1]
## plot the estimates in a scatterplot
plot(estRegularSel, estGroupregSel, col=mirExpr, 
     xlab=expression(hat(beta)[regular]), ylab=expression(hat(beta)[group]), 
     main="Regular vs group-regularized estimates, with model size 50")
legend("topleft", pch=1, col=c(1:3),
       legend=c("Not expressed", "Medium differentially expressed",
                "Highly differentially expressed"))

names(which(estRegularSel!=0))
names(which(estGroupregSel!=0))

intersect(names(estRegularSel)[which(estRegularSel!=0)[-c(1:6)]],
          names(estGroupregSel)[which(estGroupregSel!=0)[-c(1:6)]])

EXERCISE How did the results change compared to before? Is there more or less overlap between the selected microRNAs?

3.4 Predictive performance

Since logistic regression is a classification method, we may compare classification performance of the regular and group-regularized elastic net. To avoid overestimation of performance we would like to use an indepenent data set to calculate the performance measures on. The simplest way of accomplishing this is by randomly splitting the data set in two parts, one for estimation and one for classification performance evaluation. But since we have a limited number of patients, splitting them in half would leave us with a very small dataset for estimation. With such a small data set we run the risk of bad estimators for the model. To avoid this we use cross-validation to estimate performance. Using cross-validation, we will still slightly overestimate performance, but since we are comparing two methods, we expect that this cancels out in the comparison.

The gren package contains a convenience function for performance cross-validation: cv.gren. This function is generally faster than naive cross-validation. An important extra argument is fix.lambda. If it is set to TRUE, the global lambda is fixed throughout the cross-validation iterations, which saves time, but slightly increases overestimation of performance. We compare with the regular elastic net by setting compare=TRUE. To make the comparison fair, we select the same number of features using psel=50.

If time is limited run the following line to load the estimated model:

## if the cross-validation takes too long, run the following line
load("gren_models_mir_colon.Rdata")

Otherwise the following code estimates the performance by cross-validation (may take six or seven minutes to run):

set.seed(3)
fitCvgrenCol <- cv.gren(x=mirColScaled, y=respCol, alpha=0.5, 
                        partitions=list(expression=mirExpr),
                        unpenalized=unpenCol, psel=50, nfolds.out=10)

Due to time constraints, we used 10-fold cross-validation to estimate performance by setting nfolds.out=10. It is of course possible to use more folds by changing this value.

To assess predictive performance we compare the ROCs of the group-regularized elastic net with the regular elastic net.

## comparing ROCs for group-regularized and regular elastic net
par(pty="s")
plot(pROC::roc(respCol, as.numeric(fitCvgrenCol$groupreg$pred)), print.auc=TRUE, 
     print.auc.x=0.3, print.auc.y=0.2)
plot(pROC::roc(respCol, as.numeric(fitCvgrenCol$regular$pred)), add=TRUE, col=2, 
     print.auc=TRUE, print.auc.x=0.3, print.auc.y=0.1)
legend("topleft", legend=c("Group-regularized", "Regular"), col=c(1:2), lty=1)
par(pty="m")

EXERCISE Do you think that the inclusion of the co-data helped with predictive performance here?

It might also be informative to plot the (cross-validated) predicted probabilites of the two models:

plot(as.numeric(fitCvgrenCol$regular$pred), 
     as.numeric(fitCvgrenCol$groupreg$pred), col=respCol, 
     main="Predicted probabilities for Colon cancer data",
     xlab="Predictions regular model", 
     ylab="Predictions group-regularized model")
legend("topleft", legend=c("progressive", "non-progressive/remission"), 
       col=c(1:2), pch=1)

par(mfrow=c(1, 2))
hist(as.numeric(fitCvgrenCol$groupreg$pred), main="Group-regularized",
     xlab="Predicted probabilities")
hist(as.numeric(fitCvgrenCol$regular$pred), main="Not group-regularized",
     xlab="Predicted probabilities")
par(mfrow=c(1, 1))

EXERCISE In what way does the inclusion of the co-data change the predicted probabilities?

4 Example 2: microRNAs in cervical cancer

4.1 Data

In this study microRNAs from women were collected through self-samples. These women were either cervical cancer disease free or with high grade cervical lesions (CIN2/3). These lesions have a high risk for progression to cancer and hence need accurate detection. The data consists of (transformed) sequenced microRNAs. We first load the data:

## load the data and inspect the different data object
load("gren_data_mir_cervical.Rdata")

The mircoRNAs are given as a matrix object named mirCerv. The dependent variable is in the form of a factor called respCerv. In addition, there is the mirCons object, which contains the conservation status of the microRNAs. The conservation status comes in three levels: not conserved, conserved across most mammals, and broadly conserved across most vertebrates. We may include conservation status as co-data, since we expect that conserved microRNAs are more important for body functioning and consequently might be important in CIN2/3 development.

4.2 Two partitions

We may create additional co-data by examining the microRNAs themselves. Here we create a co-data set based on the sample standard deviations of the microRNAs:

## create 10 groups of standard deviations and 10 groups of abundances
std <- apply(mirCerv, 2, sd)
mirStd <- cut2(std, g=5)

EXERCISE What do you think the relation between the standard deviation and penalty parameter of the microRNAs should be?

We may include this co-data in the estimation by extending the partitions argument:

## create a partitions objected that contains the different co-data sets
partitions <- list(conservation=mirCons, std=mirStd)

We may also specify an extra argument monotone, a list containing two logical vectors. The first indicates the whether the corresponding partitions penalty multipliers are monotone, while the second indicates whether the corresponding parititions penalty multipliers are monotonically decreasing with group number or not

EXERCISE Would you impose a monotonicity constraint on the standard deviations partition? And if so, in what direction would the monotonicity be?

To include a monotonicity constraint create the following object:

## create a partitions objected that contains the different co-data sets
monotone <- list(c(FALSE, TRUE), c(TRUE, TRUE))

To estimate the model with the monotonicity constraint, alpha=0.5, and with the selection of 50 features, run the following chunk of code (this may take one or two minutes):

## prepare the data and estimate a model
mirCervScaled <- scale(mirCerv)

set.seed(4)
fitGrenCervSel <- gren(x=mirCervScaled, y=respCerv, partitions=partitions, 
                       alpha=0.5, monotone=monotone, psel=50)

Again, we may inspect the penalty multipliers. This time we have two sets, so we create two barplots:

par(mfrow=c(1, 2))
barplot(fitGrenCervSel$lambdag$conservation, 
        main="Penalty multipliers conservation", 
        ylab=expression(paste(lambda[g], "'")),
        names.arg=c("Not conserved", "Conserved in mammals",
                    "Broadly conserved \n across vertebrates"))
barplot(fitGrenCervSel$lambdag$std, 
        main="Penalty multipliers standard deviation", 
        ylab=expression(paste(lambda[g], "'")),
        names.arg=levels(mirStd))
par(mfrow=c(1, 1))

EXERCISE Are the penalty multipliers of the conservation status according to expectation? And what about the standard deviation penalty multipliers?

Since we now have two partitions, it is not straightforward to see what the total penalty multiplication for each feature is. We calculate and inspect the total penalty multiplication per feature by:

## create a vector with the penalty multipliers per feature in it
multvec <- fitGrenCervSel$lambdag$conservation[
  match(mirCons, names(fitGrenCervSel$lambdag$conservation))]*
  fitGrenCervSel$lambdag$std[match(mirStd, names(fitGrenCervSel$lambdag$std))]

## and create a histogram of the multipliers
hist(multvec, main="Total penalty multiplier per feature", xlab="Penalty multiplier")

EXERCISE Can you discern the different groups from the two partitions from the histogram? Why/why not?

Again, we may use cross validation to estimate predictive performance of the model. If time is limited run the following line to load the estimated model:

## if the cross-validation takes too long, run the following line
load("gren_models_mir_cervical.Rdata")

Otherwise the following code estimates the performance by cross-validation (takes about twelve minutes to estimate):

set.seed(5)
fitCvgrenCerv <- cv.gren(x=mirCervScaled, y=respCerv, alpha=0.5, 
                         partitions=partitions, monotone=monotone, psel=50, 
                         nfolds.out=10)

Again, we look at ROC plots to compare the performance of the group-regularized elastic net with the non group-regularized elastic net:

## comparing ROCs for group-regularized and regular elastic net
par(pty="s")
plot(pROC::roc(respCerv, as.numeric(fitCvgrenCerv$groupreg$pred)), 
     print.auc=TRUE, print.auc.x=0.3, print.auc.y=0.2)
plot(pROC::roc(respCerv, as.numeric(fitCvgrenCerv$regular$pred)), add=TRUE, 
     col=2, print.auc=TRUE, print.auc.x=0.3, print.auc.y=0.1)
legend("topleft", legend=c("Group-regularized", "Regular"), col=c(1:2), lty=1)
par(pty="m")

EXERCISE Do you think that the inclusion of the co-data helped with predictive performance here?

It might also be informative to plot the (cross-validated) predicted probabilites of the two models:

plot(as.numeric(fitCvgrenCerv$regular$pred), 
     as.numeric(fitCvgrenCerv$groupreg$pred), col=respCerv, 
     main="Predicted probabilities for Cervical cancer data",
     xlab="Predictions regular model", 
     ylab="Predictions group-regularized model")
legend("topleft", legend=c("CIN3", "normal"), 
       col=c(1:2), pch=1)

par(mfrow=c(1, 2))
hist(as.numeric(fitCvgrenCerv$groupreg$pred), main="Group-regularized",
     xlab="Predicted probabilities")
hist(as.numeric(fitCvgrenCerv$regular$pred), main="Not group-regularized",
     xlab="Predicted probabilities")
par(mfrow=c(1, 1))

EXERCISE In what way does the inclusion of the co-data change the predicted probabilities?

References

Neerincx, M., D. L. S. Sie, M. A. van de Wiel, N. C. T. van Grieken, J. D. Burggraaf, H. Dekker, P. P. Eijk, et al. 2015. “MiR Expression Profiles of Paired Primary Colorectal Cancer and Metastases by Next-Generation Sequencing.” Oncogenesis 4 (10): e170. doi:10.1038/oncsis.2015.29.

Wiel, M. A. van de, T. G. Lien, W. Verlaat, W. N. van Wieringen, and S. M. Wilting. 2016. “Better Prediction by Use of Co-Data: Adaptive Group-Regularized Ridge Regression.” Statistics in Medicine 35 (3): 368–81. doi:10.1002/sim.6732.