--- title: Overview of linear modelling author: Aaron Lun and Cataline Vallejos date: 11 December 2016 --- ```{r, echo=FALSE, results="hide"} knitr::opts_chunk$set(error=FALSE) ``` # Introduction ## What are linear models? This practical will walk you through the set-up of linear models. These are very flexible models that can accommodate a range of experimental designs. Each observation is described as a linear sum of predictive variables plus a normally-distributed error term: ``` Y_i = B_0 + B_1 * X_i1 + B_2 * X_i2 + ... + B_n * X_in + e_i ``` Here, `Y_i` is the observation for sample _i_; `X_ij` is the value of predictor _j_ for sample _i_; `B_j` is the coefficient for predictor _j_ (`B_0` is a special case, known as the "intercept", where all `X_i0 = 1`); and `e_i` is the error term. Linear modelling assumes that the error (i.e., difference between each observation and its mean) is: - random and independent across observations - normally distributed, with mean zero and a constant standard deviation. It also assumes that the model was correctly specified, i.e., the right predictors were used. ## Demonstrating with linear regression on plant heights This setup is best illustrated by the specific case of simple linear regression. Let's say we have plants of differing ages and their heights: ``` # data.frame(Plant=LETTERS[1:5], Age=1:5*2, Height=rnorm(5, 1:5, sd=0.1)) Plant Age Height 1 A 2 1.052562 2 B 4 1.811050 3 C 6 2.925804 4 D 8 3.770163 5 E 10 4.995787 ``` If we want to fit a line to the height against the age, we would consider the following model: ``` Height = Intercept + Slope * Age + Error ``` This can be rearranged into the general format of a linear model for each plant _i_: ``` Height_i = B_0 + B_1 * Age_i + e_i ``` `B_0` is the intercept, corresponding to a predictor `X_i0` with a value of 1 for all _i_. The `Age_i` term is the other predictor, and has value of 2 for plant A, 4 for plant B, etc. `B_2` is the coefficient for the `Age_i` predictor, representing the gradient/slope of the line. `e_i` is the error term that is i.i.d. normal for all _i_. Here, `Age_i` is a continuous "covariate". This differs from categorical "factors", which will be discussed in more detail later. Also, the assumption during model specification is that the trend is linear. # Setting up a linear model with covariates Let's say we measure the amount of product generated by a reaction over time. ```{r} conc <- c(1.61, 2.27, 2.92, 3.40, 5.27, 6.15, 6.69, 8.48, 8.31, 9.35) # uM time <- c( 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0) # min ``` We can fit a linear model easily: ```{r} fit <- lm(conc ~ time) summary(fit) ``` What do the coefficients represent, and how do I get their values? ```{r} coef(fit) ``` Each observation has a "fitted value", representing the value that is estimated from the fitted line. How do I get the fitted values? ```{r} fitted(fit) ``` How do I get the residual for each observation? What is the relationship between the fitted values, residuals and observations? ```{r} residuals(fit) fitted(fit) + residuals(fit) ``` How can I get the R^2^? What does this represent? How is it related to Pearson's correlation? ```{r} cor(time, conc)^2 summary(fit)$r.squared ``` How can I get the p-values, and what test do the p-values correspond to? Each p-value represents the evidence against the null hypothesis that the corresponding coefficient is equal to zero. Does the p-value for the intercept have any meaning? ```{r} summary(fit)$coefficients ``` How can I plot the data, along with the line of best fit? ```{r} plot(time, conc, xlab="Time (min)", ylab="Concentration (uM)") abline(a=coef(fit)[1], b=coef(fit)[2], col="red", lwd=2) r2 <- round(summary(fit)$r.squared, 3) text(min(time), max(conc), bquote(R^2*"="*.(r2)), pos=4, col="red") ``` __Advanced:__ How do I model non-linear trends with respect to the covariate? Using splines! Interested readers can try to decipher the following code. ```{r} # Reaction rates for an enzyme at varying temperatures. rate <- c(3.7, 10.8, 18.0, 23.7, 28.9, 26.9, 26.6, 25.0, 18.0, 10.7) # uM/min temp <- c( 5, 10, 15, 20, 25, 30, 35, 40, 45, 50) # degrees Celsius # Fitting a linear model with a spline basis matrix with 3 degrees of freedom. library(splines) fit <- lm(rate ~ ns(temp, df=3)) # Looks pretty good. plot(temp, rate, xlab="Temperature (dC)", ylab="Reaction rate (uM/min)") lines(temp, fitted(fit), col="red") ``` # Experimental designs with categorical factors Let's say we have an experiment measuring the weights of: - untreated individuals - individuals treated with drug A - individuals treated with drug Z - individuals treated with drug A and Z I might set up my linear model with three predictor terms: - `B_0`, the intercept, representing the weight of my untreated group - `B_A`, representing the weight change upon drug A treatment - `B_Z`, representing the weight change upon drug Z treatment If I have untreated individuals, the predictor variables `X_A` and `X_Z` are both zero, because they were not treated with either drug. As such, the weight of individuals in this group can be described as: ``` Y_i = B_0 + B_A * 0 + B_Z * 0 + e_i = B_0 + e_i ``` If I have individuals treated with drug A only, `X_A = 1` while `X_Z = 0`. Thus, the weight of individuals in this group can be described as ``` Y_i = B_0 + B_A * 1 + B_Z * 0 + e_i = B_0 + B_A + e_i ``` Conversely, if I have individuals treated with drug Z only, we get: ``` Y_i = B_0 + B_A * 0 + B_Z * 1 + e_i = B_0 + B_Z + e_i ``` Finally, if I have individuals treated with both drugs, we get: ``` Y_i = B_0 + B_A * 1 + B_Z * 1 + e_i = B_0 + B_A + B_Z + e_i ``` Let's say that I fit this linear model and I end up with estimated values of: - `B_0 = 60` - `B_A = 10` - `B_Z = 5` This allows me to calculate the average (i.e., "fitted value") for each group of individuals in this model: - for the untreated individuals, the average is just `B_0 = 60` - for the drug A-treated individuals, the average is `B_0 + B_A = 70` - for the drug Z-treated individuals, the average is `B_0 + B_Y = 65` - for the drug A and Z-treated individuals, the average is `B_0 + B_X + B_Z = 75` In short, the relationship between the independent variable (weight) and my dependent variables (drug treatment) is a linear sum of the coefficients. # Setting up a linear model with factors ## What are factors? Factors are categorical variables: ```{r} hair.col <- factor(c("red", "red", "brown", "black", "pink", "brown")) hair.col fruit <- factor(c("apple", "orange", "orange", "apple", "banana")) fruit ``` These have "levels" specifying the different types of categories that need to be considered. ```{r} levels(hair.col) levels(fruit) ``` For modelling purposes, the first level is important, as it determines the "baseline" against which all other levels are compared. We can change the first level by doing this: ```{r} relevel(hair.col, "red") ``` ## Using a simple two-group design Let's have a look at some data about the height of bean sprouts, grown with and without fertilizer. ```{r} height <- c( 50, 61, 55, 54, 72, 62, 49, 68, 57) fertilizer <- c("N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y") ``` We specify that we want our no-fertilizer group as a baseline. ```{r} fertilizer <- relevel(factor(fertilizer), "N") # 'N' is first. fertilizer ``` Let's fit a linear model: ```{r} fit <- lm(height ~ fertilizer) summary(fit) ``` What are my coefficients in this model, and what do they represent? What is the relationship between the fitted values and the coefficients? What is the relationship between the fitted values, residuals, and `height`? ```{r} coef(fit) mean(height[fertilizer=="N"]) mean(height[fertilizer=="Y"]) fitted(fit) fitted(fit) + residuals(fit) ``` How can I get the p-values, and what test do the p-values correspond to? Does the p-value for the intercept have any meaning? ```{r} summary(fit)$coefficients ``` Does this test look familiar (hint: `t.test`)? ```{r} t.test(height[fertilizer=="Y"], height[fertilizer=="N"], var.equal=TRUE) ``` ## Using a multi-group design What if I repeated the experiment with both high and low levels of fertilizer? ```{r} height <- c( 57, 62, 53, 44, 52, 62, 49, 68, 77, 81, 83) fertilizer <- c("N", "N", "N", "N", "L", "L", "L", "L", "H", "H", "H") ``` Again, specifying the no-fertilizer group as a baseline: ```{r} fertilizer <- relevel(factor(fertilizer), "N") fertilizer ``` And now fitting our linear model: ```{r} fit <- lm(height ~ fertilizer) summary(fit) ``` What are my coefficients in this model, and what do they represent? What test do the p-values correspond to? ```{r} coef(fit) mean(height[fertilizer=="N"]) mean(height[fertilizer=="L"]) mean(height[fertilizer=="H"]) ``` Why is this better than doing a t-test between the groups (hint: variance estimation and degrees of freedom)? ```{r} in.N <- fertilizer=="N" in.H <- fertilizer=="H" t.test(height[in.N], height[in.H], var.equal=TRUE)$p.value ``` How can we test for differences between the high and low treatment groups? We first "relevel" the factor so that the high-treatment level is the baseline: ```{r} fertilizer2 <- relevel(fertilizer, ref="H") fit2 <- lm(height ~ fertilizer2) summary(fit2) ``` How can I test for _any_ differences between the three fertilizer treatments? That is, the null hypothesis is that the mean heights for all groups are the same. We perform an analysis-of-variance (ANOVA) on our fitted model: ```{r} anova(fit) ``` How do we interpret this? It's effectively asking whether any levels in the entire `fertilizer` factor have an effect. Why do this instead of just combining the pairwise comparisons between groups? ## Using a paired-samples design Let's have a look at some matched patients with histology scores for healthy and diseased tissue. ```{r} scores <- c(1.42, 1.98, 2.27, 3.29, 3.03, 4.19, 4.48, 3.14, 4.55, 4.50) status <- c( "H", "H", "H", "H", "H", "D", "D", "D", "D", "D") patient <- c( "A", "B", "C", "D", "E", "A", "B", "C", "D", "E") ``` Let's set the healthy status as the reference level for our `status` factor. We'll also set the first patient "A" as the reference for our `patient` factor. ```{r} status <- relevel(factor(status), "H") patient <- relevel(factor(patient), "A") ``` Let's build our linear model: ```{r} fit <- lm(scores ~ patient + status) summary(fit) ``` What do each of the coefficients represent? Which coefficient is of interest to us, and what is it testing? ```{r} mean(scores[status=="D"] - scores[status=="H"]) ``` Does this test sound familiar? ```{r} t.test(scores[status=="H"], scores[status=="D"], paired=TRUE) ``` ## Using a blocking factor What happens if each patient has multiple diseases? ```{r} scores <- c(2.31, 0.73, 2.50, 2.90, 4.57, 4.68, 4.81, 5.37, 7.80) status <- c( "H", "H", "H", "D1", "D1", "D1", "D2", "D2", "D2") patient <- c( "X", "Y", "Z", "X", "Y", "Z", "X", "Y", "Z") ``` Let's set the healthy status as the reference level for `status`, and our first patient as the reference level for `patient`: ```{r} status <- relevel(factor(status), ref="H") patient <- relevel(factor(patient), ref="X") ``` Let's build our linear model: ```{r} fit <- lm(scores ~ patient + status) summary(fit) ``` What do each of the coefficients represent? Which coefficient is of interest to us, and what is it testing? ```{r} mean(scores[status=="D1"] - scores[status=="H"]) mean(scores[status=="D2"] - scores[status=="H"]) ``` How can I test for differences between D1 and D2? ```{r} status2 <- relevel(status, ref="D1") fit2 <- lm(scores ~ patient + status2) summary(fit2) # look at "status2D2" ``` How can I test for _any_ differences between my healthy/disease conditions? That is, my null hypothesis is that no disease has an effect on the score. ```{r} anova(fit) ``` ## Covariates vs factors ### Differences in the behaviour of `lm` Consider the following example. ```{r} stuff <- c(0.97, 0.82, -0.03, 0.88, 0.64, 1.26, -0.07, 0.47, 0.14, 1.21) group <- c( 1, 1, 2, 2, 3, 3, 4, 4, 5, 5) ``` What's the difference between: ```{r} fit1 <- lm(stuff ~ group) ``` ... and: ```{r} g <- factor(group) fit2 <- lm(stuff ~ g) ``` `fit1` treats `group` as a covariate, and fits a line to `stuff` against the numeric value of `group`. `fit2` treats `group` as a factor, and computes the average of each level of `group`. ```{r} summary(fit1) summary(fit2) ``` __Moral of the story:__ if you want factors, make sure your terms are factors! If you want covariates, make sure your terms are covariates! Of course, it's also possible to have covariates and factors in the same model. Consider the following example with the weights of mice of different age and in different groups: ```{r} weights <- c(10, 12, 13, 13, 15, 17, 18, 18, 19) group <- c( 1, 1, 1, 2, 2, 2, 3, 3, 3) age <- c( 1, 2, 3, 1, 2, 3, 1, 2, 3) ``` We can fit a linear model with `group` as a factor and `age` as a covariate. What do the coefficients here represent? ```{r} g <- relevel(factor(group), ref="1") fit <- lm(weights ~ g + age) summary(fit) ``` ### It's a trap! Watching out for factors Behold the following lines of code: ```{r} x <- c(2,3,4,5,6) which(x==3) LETTERS[x] ``` What happens when we do this? ```{r} f <- factor(x) which(f==3) LETTERS[f] ``` This happens because the interpretation of a factor type depends on its levels (see `levels(f)`). In the above example, even though `f` shows `2` as the first element, it is actually stored as `1`, i.e., the _first_ level. Comparison (i.e., `==`) will use the character "surface" values, while sorting and subsetting uses the underlying "storage" values. Thus, subsetting `LETTERS` then pulls out the first element rather than the second as might be expected. In short, be very careful when dealing with factors. They have their uses, but it's advisable to avoid them altogether unless you really need them, e.g., in `lm`. How can we convert factors to normal vectors? ```{r} as.character(f) # This coerces it to a character vector. as.integer(levels(f))[f] # This coerces it into an integer vector. as.integer(f) # This returns the underlying storage values! ``` `read.table` will automatically load strings as factors. How can we avoid this if we don't want factors? ```{r} example.file <- "blah.txt" writeLines(con=example.file, c("John", "Sue", "Mary")) read.table(example.file)[,1] read.table(example.file, stringsAsFactor=FALSE)[,1] ``` # Diagnosing model quality ## Checking for normality in the residuals Fit the following multi-group model of the seed weights of plants of different genotypes. ```{r} genotype <- rep(c("AA", "Aa", "aa"), each=10) seedsize <- c(3.6, 5.6, 4.7, 4.3, 4.5, 6.0, 5.9, 4.8, 5.0, 6.9, 2.9, 4.9, 4.3, 2.8, 2.6, 5.0, 3.6, 1.8, 4.2, 4.4, 0.4, 2.7, 3.4, 1.0, 1.4, 3.7, 3.0, 1.8, 1.5, 2.6) genotype <- relevel(factor(genotype), ref="aa") fit <- lm(seedsize ~ genotype) ``` How can we check if the residuals are normally distributed (hint: `qqnorm`)? ```{r} resid <- residuals(fit) qqnorm(resid) qqline(resid) ``` What about this second data set? ```{r} seedsize2 <- c(1.2, 3.2, 0.6, 1.2, 3.2, 0.4, 5.6, 2.2, 0.6, 7.0, 3.8, 0.6, 1.8, 0.8, 1.2, 2.4, 3.4, 6.8, 1.4, 5.0, 0.8, 9.0, 0.2, 1.0, 0.5, 0.2, 6.6, 2.0, 0.2, 2.0) fit2 <- lm(seedsize2 ~ genotype) resid2 <- residuals(fit2) qqnorm(resid2) qqline(resid2) ``` Sometimes the data can be transformed to make it look more normal, and thus more appropriate for modelling. The log-transformation is usually a good place to start (see Box-Cox transformations for more general cases): ```{r} log.seedsize2 <- log(seedsize2) lfit2 <- lm(log.seedsize2 ~ genotype) lresid2 <- residuals(lfit2) qqnorm(lresid2) qqline(lresid2) ``` ## Checking for heteroskedasticity Recall one of the assumptions of the linear model is that the errors are normally distribued with the same variance. This is not true when heteroskedasticity is present. > Heteroskedasticity = variablity in the variance of our dependent variable, across the range of values of our independent variables. Let's re-use our `fit` object and have a look at the relationship between the residuals and the fitted values. ```{r} plot(fitted(fit), residuals(fit), xlab="Fitted values", ylab="Residuals") ``` But what about this data set? ```{r} seedsize3 <- c(3.6, 4.4, 6.7, 4.5, 4.8, 4.1, 4.4, 5.0, 5.6, 5.1, 2.9, 2.2, 1.7, 1.9, 2.0, 2.7, 1.4, 2.2, 1.4, 2.1, 0.8, 1.1, 1.1, 1.0, 1.0, 0.9, 1.0, 0.8, 1.0, 1.0) fit3 <- lm(seedsize3 ~ genotype) plot(fitted(fit3), residuals(fit3), xlab="Fitted values", ylab="Residuals") ``` Heteroskedasticity often occurs in real positive data because the range of variability across large values is larger than that across small values. To fix this, we need to "stabilize" the variance using some transformations. Again, the log-transformation is a good place to start. ```{r} log.seedsize3 <- log(seedsize3) lfit3 <- lm(log.seedsize3 ~ genotype) ## ANSWER ## plot(fitted(lfit3), residuals(lfit3), xlab="Fitted values", ylab="Residuals") ``` ## Checking for additional terms Let's say, in our above example, that the first two plants from every genotype were processed in one batch, the next two plant in another batch, and so on. ```{r} batch <- c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "A", "A", "B", "B", "C", "C", "D", "D", "E", "E") batch <- factor(batch) ``` We can check whether there is a batch effect in our `fit` object by examining the distribution of the residuals for each batch. ```{r} plot(as.integer(batch), residuals(fit), xaxt="n", ylab="Residuals", xlab="Batch") axis(side=1, at=1:5, levels(batch)) ``` What about this data set? ```{r} seedsize4 <- c(3.6, 5.6, 2.7, 2.3, 5.5, 7.0, 4.9, 3.8, 8.0, 9.9, 2.9, 4.9, 2.3, 0.8, 3.6, 6.0, 2.6, 0.8, 7.2, 7.4, 0.4, 2.7, 1.4, 0.5, 2.4, 4.7, 2.0, 0.8, 4.5, 5.6) fit4 <- lm(seedsize4 ~ genotype) plot(as.integer(batch), residuals(fit4), xaxt="n", ylab="Residuals", xlab="Batch") axis(side=1, at=1:5, levels(batch)) ``` How do we fix this? ```{r} fit4b <- lm(seedsize4 ~ genotype + batch) plot(as.integer(batch), residuals(fit4b), xaxt="n", ylab="Residuals", xlab="Batch") axis(side=1, at=1:5, levels(batch)) ``` # Session information ```{r} sessionInfo() ```