Introduction

This is a preliminary draft of the vignette for randomForestExplainer package. I use the package to explore a random forest built on data concerning glioblastoma brain cancer, generated by the TCGA Research Network. The forest is meant to predict whether a patient survived one year after the diagnosis using data on expression of various genes. This is a typical example of a problem with a large number of variables and small number of observations as the data set contains many more genes than patients.

We will use the following packages:

library(randomForest)
library(randomForestExplainer)

Data

The data come from The Cancer Genome Atlas Project and contain observations for 125 patients that have been diagnosed with glioblastoma. There are 16118 columns in the data: death1y (1 if the patient died within a year from diagnosis, 0 otherwise), Cluster (subtype of the tumor), age (age in years at the time of diagnosis) and 16115 columns containing numerical measurements of gene expression (names of columns are those of corresponding genes). Our aim is to build a random forest that predicts death1y and to single out genes whose expression drives the prediction.

Before we proceed, we have to clarify the term gene expression in order to understand what we are actually modelling. It is worth noting that the definition of a gene depends on context and in our analysis a gene is a place on the DNA. In cell cycle different places of DNA transcribe with different speed and efficiency. We treat the amount of transcripts that are related to a given place as a random variable. In a single experiment we have gene expression which is its realization.

Moreover, we need to resolve the issue of missing data: in order to use all of our 125 observations we simply discard the 1989 out of 16117 predictors that contain missing values, which still leaves a plethora of variables for our analysis.

load("GlioblastomaWide.rda")
GlioblastomaWide <- GlioblastomaWide[, -1]
GlioblastomaWide$death1y <- as.factor(GlioblastomaWide$death1y)
remove <- colSums(is.na(GlioblastomaWide))
GlioblastomaWide <- GlioblastomaWide[, remove == 0]
rm(remove)

Random forest

As our next step we use the randomForest::randomForest function to train a forest of 10000 trees, with option localImp = TRUE and \(r=\lfloor\sqrt{14128}\rfloor=118\) candidate variables for each split. The reason for growing such a big forest is that we are interested in distinguishing important predictors – in that case the more trees in a forest the better, as they create more opportunities for good variables to be chosen for splitting (i.e., increase the sample on which statistics such as mean minimal depth are calculated). As building of the below forest takes a long time I load it from the memory:

# set.seed(2017)
# forest <- randomForest(death1y ~ ., data = GlioblastomaWide, ntree = 10000, localImp = TRUE)
# save(forest, file = "GlioblastomaWide_forest.rda")
load("GlioblastomaWide_forest.rda")

Now, I use the randomForest package to build a random forest for the data. I need the forest to have as many trees as possible, as with thousands of variables the probablility of each being considered for a split of a node is low and for analyzing the structure of the random forest I need a reasonable number of such instances for each variable, as I cannot say anything about the importance of a variable that was not even considered for a split. Moreover, we will analyze measures of importance of variables so we set importance = TRUE.

plot(forest, main = "Learning curve of the forest")
legend("topright", c("error for 'dead'", "misclassification error", "error for 'alive'"), lty = c(1,1,1), col = c("green", "black", "red"))