--- title: "Differential Expression Tutorial" author: "Mark Dunning; mark 'dot' dunning 'at' cruk.cam.ac.uk, Oscar Rueda; oscar 'dot' rueda 'at' cruk.cam.ac.uk" date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' output: html_document --- # Importing the data The data from this experiment comprises nine paired tumor/normal colon tissues on Illumina HT12\_v3 gene expression Beadchips. These data were generated to inform a comparison of technologies for microRNA profiling. However, we will only use the mRNA data here. ```{r echo=FALSE,message=FALSE} library(GEOquery) library(limma) ``` ```{r cache=TRUE} library(GEOquery) library(limma) url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/GSE33126_series_matrix.txt.gz" filenm <- "data/GSE33126_series_matrix.txt.gz" if(!file.exists(filenm)) download.file(url, destfile=filenm) gse <- getGEO(filename=filenm) gse ``` ```{r} head(exprs(gse)) ``` ****** ##Q Do the data look to be normalised? ****** ```{r} exprs(gse) <- log2(exprs(gse)) boxplot(exprs(gse),outline=FALSE) ``` Inspect the clinical variables ****** ##Q What column has the information about whether is sample is a tumour or normal? ##Q How many do we have in each group? ****** ```{r echo=FALSE} ## Your answer here ## ``` ****** ##Q Create a design matrix for the experiment, on paper if you like ****** # Simple t-tests For completeness, we note that t-statistics and corresponding p-values can be generated quickly using the `genefilter` package. ```{r} pd <- pData(gse) SampleGroup <- pd$source_name_ch1 library(genefilter) destats <- rowttests(exprs(gse),fac=SampleGroup) head(destats) ``` - Can we remember the drawbacks of this approach? A useful function is the `model.matrix`, which understands the linear model syntax of R (i.e. using the tilde `~` symbol) ```{r} design <- model.matrix(~0+SampleGroup) colnames(design) <- c("Normal","Tumour") ``` By-hand, we would do; ```{r} design <- matrix(nrow=18,ncol=2) design[,1] <- c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1) design[,2] <- c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0) colnames(design) <- c("Normal","Tumour") ``` or a shortcut in this would use the `rep` function. ```{r} design[,1] <- rep(c(0,1),by=9) design[,2] <- rep(c(1,0),by=9) ``` Prior to the differential expression analysis, we will *filter* the data so only the top 50% most-variable genes get analysed, which [has been shown](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2906865/) to increase our power to detect differential expression. We can do this using the `varFilter` function in the `genefilter` package. ```{r} library(genefilter) gse.expFilt <- varFilter(gse) gse.expFilt ``` The `lmFit` funcion is used to fit the model to the data. The output has various components, most of which you probably don't need to know about. ```{r} fit <- lmFit(exprs(gse.expFilt), design) names(fit) dim(fit$coefficients) head(fit$coefficients) ``` ****** ##Q. What is the interpretation of the `coeff` item in the output ****** Now we define the contrast ```{r} contrasts <- makeContrasts(Tumour - Normal, levels=design) fit2 <- contrasts.fit(fit, contrasts) head(fit2$coeff) ``` Finally, apply the *empirical Bayes'* step ```{r} fit2 <- eBayes(fit2) fit2 ``` We usually get our first look at the results by using the `topTable` command ```{r} topTable(fit2) ``` The rows in this table are ordered according the *B-statistic*, or log-odds from the empirical Bayes' analysis. You should be cautious when interpreting these values. Although a B-statistic of 0 corresponds to a 50/50 chance of a gene being DE, this is contingent on a certain percentage of genes in the experiment being DE. In experiments such as the one we are looked at here, there are a large number of DE genes. In this case genes with B-statistic < 0 could still be DE. Thus, you should use the B-statistic as a ranking criteria and not attempt to compare the values between experiments. # Including annotation ```{r} anno <- fData(gse.expFilt) head(anno)[,1:5] anno <- anno[,c("Symbol","Entrez_Gene_ID","Chromosome","Cytoband")] fit2$genes <- anno topTable(fit2) ``` # Diagnostic checks At this point it is useful to take stock and do some diagnostic checks ```{r} dotchart(exprs(gse)["ILMN_1704294",]) ``` ```{r} boxplot(exprs(gse)["ILMN_1704294",]~SampleGroup) ``` To find how many genes are DE we can use the `decideTests` function, which also applies multiple-test correction. *Make a note of how many genes are DE, you will need this later* ```{r} decideTests(fit2) table(decideTests(fit2)) sum(abs(decideTests(fit2))==1) ``` The "*Volcano Plot*" function is a common way of visualising the results of a DE analysis. The $x$ axis shows the log-fold change and the $y$ axis is some measure of statistical significance, which in this case is the log-odds, or "B" statistic. A characteristic "volcano" shape should be seen. ```{r} volcanoplot(fit2) ``` The function also has options to highlight the postions of the top N genes from the test. ```{r} volcanoplot(fit2,highlight=10,names = fit2$genes$"Symbol") ``` # Exporting the results ```{r} write.fit(fit2, file = "de-results.txt",adjust="BH") ``` # Getting the results for a particular gene The `topTable` function by default only prints the top 10 DE genes. However, by changing the arguments we can make it print the statistics for all the genes that we test. ```{r} testResults <- topTable(fit2, number=nrow(fit2)) ``` This gives us the opportunity to look-up the results for particular genes of interest. Lets say we're interested in "OCIAD2" (for no particular reason...) ```{r} testResults[which(testResults$Symbol == "OCIAD2"),] ``` If we have a list of genes, the `%in%` function can be used. ```{r} mylist <- c("LOC441066","ARF3","FMNL3","CSF1R","XLKD1","TTRAP","DMWD","SYNPO2L","PILRB","LAMP3") testResults[which(testResults$Symbol %in% mylist),] ``` # Further filtering and ordering ****** - Q. Create a data frame containing genes with i) adjusted p-value < 0.05 ii) log-fold change > 1. How many genes does this contain? - Q. Now order this data frame according to chromosome ****** ```{r echo=FALSE} ## Your Answer here ## ``` # Improved analysis: Doing a paired analysis ****** - Q. What would the design matrix look like to incorporate patient ID into the model - Q. Create this model using the `model.matrix` function ****** ```{r echo=FALSE} ## Your answer here ## ``` ****** - Q. Proceed to fit your model to the data - Q. How many genes are DE? How does this compare to the previous model? ****** ```{r} ## Your answer here ## ``` # The estrogen dataset Recall the estrogen dataset from yesterday ```{r} library(affy) targetsFile <- "estrogen/estrogen.txt" pd <- read.AnnotatedDataFrame(targetsFile,header=TRUE,sep="",row.names=1) pData(pd) ER <- pData(pd)$estrogen Time <- factor(pData(pd)$time.h) design <- model.matrix(~ER+Time) design design2 <- model.matrix(~ER*Time) design2 ``` # Reading and normalising the data We read the data and then apply rma normalisation. What are the steps? ```{r} raw <-ReadAffy(celfile.path = "estrogen", filenames=rownames(pData(pd)),phenoData = pd) raw eset <- rma(raw) eset ``` # Fitting the models Fit the model with and without interaction. ```{r} library(limma) fit1 <- lmFit(eset, design) fit1 <- eBayes(fit1) topTable(fit1, coef=2) fit2 <- lmFit(eset, design2) fit2 <- eBayes(fit2) topTable(fit2, coef=2) ``` ****** ## Q. Which one do you prefer? Why? ****** ```{r echo=FALSE} table(decideTests(fit2)[,4]) ```