--- title: "Introduction to Solving Biological Problems Using R - Day 2" author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' output: html_notebook: toc: yes toc_float: yes --- #4. Programming in R ## Motivation From the previous exercise, you should see how we can easily adapt our markdown scripts: - e.g. ESR1 versus GATA3 - But what if we want to analyse many genes? - It would be tedious to create a new markdown document for every gene - ...and prone to error too ##Introducing loops - Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping** - Loops are not used in R so often, because we can usually achieve the same thing using vector calculations - For example, to add two vectors together, we do not need to add each pair of elements one by one, we can just add the vectors ```{r} x <- 1:10 y <- 11:20 x+y ``` - But there are some situations where R functions can not take vectors as input. For example, `t.test()` will only test one gene at a time - What if we wanted to test multiple genes? For completeness, we can re-run the R code to import the data ```{r} geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE) patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE) normalizedValues <- read.delim("gene.expression.txt") ``` - We could run the following code to perform t-tests on the first two genes ```{r eval=FALSE} t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er)) t.test(as.numeric(normalizedValues[2,]) ~ factor(patientMetadata$er)) ``` - But for many genes this will be boring to type, difficult to change, and prone to error - As we are doing the same thing multiple times, but with a different index each time, we can use a **loop** instead ##Loops: Commands and flow control - R has two basic types of loop + a **`for`** loop: run some code on every value in a vector + a **`while`** loop: run some code while some condition is true (*hardly ever used!*) `for` ```{r eval=FALSE} for(i in 1:10) { print(i) } ``` `while` ```{r eval=FALSE} i <- 1 while(i <= 10 ) { print(i) i <- i + 1 } ``` - Here's how we might use a `for` loop to test the first 10 genes ```{r} for(i in 1:10) { t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er)) } ``` - This is *exactly* the same as: ```{r eval=FALSE} i <- 1 t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er)) i <- 2 t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er)) i <- 3 ###....etc....#### ``` ## Storing results However, this for loop is doing the calculations but not storing the results - The output of `t.test()` is an object with data placed in different slots + the `names()` of the object tells us what data we can retrieve, and what variable name to use ```{r} testResult <- t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er)) names(testResult) testResult$statistic ``` - When using a loop, we often create an empty "dummy" variable - This is used store the results at each stage of the loop ```{r} stats <- NULL for(i in 1:10) { testResult <- t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er)) stats[i] <- testResult$statistic } stats ``` ## Practical application Previously we have identified probes on chromosome 8 - Lets say that we want to do a t-test for each gene on chromosome 8 ```{r} chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",] head(chr8Genes) chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),] head(chr8GenesOrd) ``` - The first step is to extract the expression values for chromosome 8 genes from our expression matrix, which has expression values for all genes - We can use the `match` function to tell us which rows in the matrix correspond to chromosome 8 genes ```{r} match(chr8GenesOrd$probe, rownames(normalizedValues)) chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),] dim(chr8Expression) ``` We are now ready to write the for loop ## Exercise: - Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples - Store the ***p-value*** from each individual test - How many genes have a p-value < 0.05? - N.B. Our code will be more robust if we store the number of chromosome 8 genes as a variable + if the data change, the code should still run ```{r} chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",] chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),] chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),] ### Your Answer Here ### ``` ##Conditional branching: Commands and flow control - Use an `if` statement for any kind of condition testing - Different outcomes can be selected based on a condition within brackets ``` if (condition) { ... do this ... } else { ... do something else ... } ``` - `condition` is any logical value, and can contain multiple conditions. + e.g. `(a == 2 & b < 5)`, this is a compound conditional argument - The condition should return a *single* value of `TRUE` or `FALSE` ## Other conditional tests - There are various tests that can check the type of data stored in a variable + these tend to be called **`is...()`**. + try *tab-complete* on `is.` ```{r} is.numeric(10) is.numeric("TEN") is.character(10) ``` - `is.na()` is useful for seeing if an `NA` value is found + cannot use `== NA`! - could be used to check if a gene symbol is found in the data before proceeding with statistical test ```{r} match("foo", geneAnnotation$HUGO.gene.symbol) is.na(match("foo", geneAnnotation$HUGO.gene.symbol)) ``` - Using the **`for`** loop we wrote before, we could add some code to plot the expression of each gene + a boxplot would be ideal - However, we might only want plots for genes with a "significant" pvalue - Here's how we can use an `if` statement to test for this + for each iteration of the the loop: 1. test if the p-value from the test is below 0.05 or not 2. if the p-value is less than 0.05 make a boxplot 3. if not, do nothing ```{r} pdf("Chromosome8Genes.pdf") pvals <- NULL for (i in 1:18) { testResult <- t.test(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er)) pvals[i] <- testResult$p.value if(testResult$p.value < 0.05){ boxplot(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er), main=chr8Genes$HUGO.gene.symbol[i]) } } pvals dev.off() ``` ##Code formatting avoids bugs! Compare: ```{r eval=FALSE} f<-26 while(f!=0){ print(letters[f]) f<-f-1} ``` to: ```{r eval=FALSE} f <- 26 while(f != 0 ){ print(letters[f]) f <- f-1 } ``` - The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times - Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{` - Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons # Making a heatmap - A heatmap is often used to visualise how the expression level of a set of genes vary between conditions - Making the plot is actually quite straightforward + providing you have processed the data appropriately! - Let's take a list of "most-variable genes" + see below for how we identified such genes - `heatmap` requires a matrix object rather than a data frame ```{r} genelist <- c("CLIC6","TFF3","PDZK1","SCUBE2","CYP2B6","HOXB13","NAT1","LY6D","SLC7A2") probes <- geneAnnotation$probe[match(genelist, geneAnnotation$HUGO.gene.symbol)] probes exprows <- match(probes, rownames(normalizedValues)) heatmap(as.matrix(normalizedValues[exprows,])) ``` ## Heatmap adjustments - We can provide a colour legend for the samples - Adjust colour of cells - Label the rows according to gene name - Don't print the sample names as they are too cluttered ```{r} library(RColorBrewer) sampcol <- rep("blue", ncol(normalizedValues)) sampcol[patientMetadata$er == 1 ] <- "yellow" rbPal <- brewer.pal(10, "RdBu") heatmap(as.matrix(normalizedValues[exprows,]), ColSideColors = sampcol, col=rbPal, labRow = genelist,labCol="") ``` - see also + `heatmap.2` from `library(gplots)`; `example(heatmap.2)` + `heatmap.plus` from `library(heatmap.plus)`; `example(heatmap.plus)` ## (Supplementary) Choosing the genes for the heatmap Often when using R you will come across a convenient shortcut function that can save you many hours of coding and frustration. - the `genefilter` package in Bioconductor contains many useful methods for filtering genomic datasets - you can install this package with the following commands ```{r eval=FALSE} source("http://www.bioconductor.org/biocLite.R") biocLite("genefilter") ``` - the `rowSds` function will calculate the standard deviation for each row in a numeric matrix - the output will be vector, with each element being the standard deviation for a corresponding gene ```{r} library(genefilter) geneVar <- rowSds(normalizedValues) geneVar[1] sd(normalizedValues[1,]) ``` - we can now `order` this matrix to get the subset with `[]` to get the indices of the most-variable genes (10 in this case). - the same indices can be used to subset the gene annotation data frame + we can do this because the annotation data frame and expression matrix are in the same order ```{r} topVar <- order(geneVar,decreasing = TRUE)[1:10] topVar geneAnnotation[topVar,] ```