2. Statistics
Built-in support for statistics
- R is a statistical programming language:
- Classical statistical tests are built-in
- Statistical modeling functions are built-in
- Regression analysis is fully supported
- Additional mathematical packages are available (
MASS
, Waves, sparse matrices, etc)
Distribution functions
- Most commonly used distributions are built-in, functions have stereotypical names, e.g. for normal distribution:
pnorm
- cumulative distribution for x
qnorm
- inverse of pnorm (from probability gives x)
dnorm
- distribution density
rnorm
- random number from normal distribution
Available for variety of distributions: punif
(uniform), pbinom
(binomial), pnbinom
(negative binomial), ppois
(poisson), pgeom
(geometric), phyper
(hyper-geometric), pt
(T distribution), pf (F distribution)
10 random values from the Normal distribution with mean 10 and standard deviation 5:
rnorm(10, mean=10, sd=5)
[1] 5.693620 4.429255 0.442925 9.223783 14.217208 10.117813 10.052955 9.347432
[9] 7.879201 3.260119
- The probability of drawing exactly 10 from this distribution:
dnorm(10, mean=10, sd=5)
[1] 0.07978846
dnorm(100, mean=10, sd=5)
[1] 3.517499e-72
- The probability of drawing a value smaller than 10:
pnorm(10, mean=10, sd=5)
[1] 0.5
qnorm(0.5, mean=10, sd=5)
[1] 10
- How many standard deviations for statistical significance?
qnorm(0.95, mean=0, sd=1)
[1] 1.644854
Example
Recall our histogram of Wind Speed from yesterday:
- The data look to be roughly normally-distributed
- An assumption we rely on for various statistical tests
weather <- read.csv("ozone.csv")
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
Create a normal distribution curve
- If our data are normally-distributed, we can calculate the probability of drawing particular values.
windMean <- mean(weather$Wind)
windSD <- sd(weather$Wind)
dnorm(10, mean=windMean, sd=windSD)
[1] 0.1132311
- We can overlay this on the histogram using
points
as we just saw:
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
points(10, dnorm(10, mean=windMean, sd=windSD),
col="red", pch=16)
- We can repeat the calculation for a vector of values
- remember that functions in R are often vectorized
- use
lines
in this case rather than points
xs <- c(0,5,10,15,20)
ys <- dnorm(xs, mean=windMean, sd=windSD)
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
lines(xs, ys, col="red")
- For a smoother curve, use a longer vector:
- we can generate x values using the
seq()
function
- Inspecting the data in this manner is usually acceptable to assess normality
- the fit doesn’t have to be exact
- the shapiro test is also available
?shapiro.test
(but not really recommended by statisticians)
hist(weather$Wind,col="steelblue",xlab="Wind Speed",
main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
xs <- seq(0,20, length.out = 10000)
ys <- dnorm(xs, mean=windMean,sd=windSD)
lines(xs,ys,col="red")
Simple testing
- If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
\[t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}\]
- Say a Wind Speed of 2; which from the histogram seems to be unlikely
t <- (windMean - 2) / (windSD/sqrt(length(weather$Wind)))
t
[1] 27.93897
- …or use the
t.test()
function to compute the statistic and corresponding p-value
t.test(weather$Wind, mu=2)
One Sample t-test
data: weather$Wind
t = 27.939, df = 152, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 2
95 percent confidence interval:
9.394804 10.520229
sample estimates:
mean of x
9.957516
- A variety of tests are supported the R authors have tried to make them as consistent as possible
?var.test
?t.test
?wilcox.test
?prop.test
?cor.test
?chisq.test
?fisher.test
- Bottom-line: Pretty much any statistical test you care to name will probably be in R
- This is not supposed to be a statistics course (sorry!)
- None of them are particular harder than others to use
- The difficulty is deciding which test to use:
- whether the assumptions of the test are met, etc.
- Consult your local statistician if not sure
- An upcoming course that will help
- Some good references:
- R will do whatever you ask it, it will never check the assumptions or help you to interpret the result
Example analysis
- We have already seen that men in our
patients
dataset tend to be heavier than women
- We can test this formally in R
- N.B. the weight of people in a population tends to be normally distributed, so we can probably be safe to use a parametric test
We need to run this if we don’t have the patients data in our R environment
patients <- read.delim("patient-info.txt")
We can test if the variance of the two groups is the same
- this will influence what flavour of t-test we use
var.test(patients$Weight~patients$Sex)
F test to compare two variances
data: patients$Weight by patients$Sex
F = 0.14216, num df = 49, denom df = 44, p-value = 3.59e-10
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.07900337 0.25344664
sample estimates:
ratio of variances
0.1421572
t.test(patients$Weight~patients$Sex, var.equal=FALSE)
Welch Two Sample t-test
data: patients$Weight by patients$Sex
t = -11.204, df = 55.168, p-value = 7.786e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-15.62079 -10.88094
sample estimates:
mean in group Female mean in group Male
68.95980 82.21067
If were unwilling to make a assumption of normality, a non-parametric test could be used.
wilcox.test(patients$Weight~patients$Sex)
Wilcoxon rank sum test with continuity correction
data: patients$Weight by patients$Sex
W = 59, p-value = 1.993e-15
alternative hypothesis: true location shift is not equal to 0
Linear Regression
x <- c(1, 2.3, 3.1, 4.8, 5.6, 6.3)
y <- c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
- The
~
is used to define a formula; i.e. “y is given by x”
- Take care about the order of
x
and y
in the plot
and lm
expressions
plot(x,y, xlim=c(0,10), ylim=c(0,10))
myModel <- lm(y~x)
abline(myModel, col="red")
The generic summary
function give an overview of the model
- particular components are accessible if we know their names
summary(myModel)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
0.33159 -0.22785 -0.39520 0.21169 0.14434 -0.06458
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.68422 0.29056 5.796 0.0044 **
x 0.58418 0.06786 8.608 0.0010 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3114 on 4 degrees of freedom
Multiple R-squared: 0.9488, Adjusted R-squared: 0.936
F-statistic: 74.1 on 1 and 4 DF, p-value: 0.001001
names(myModel) # Names of the objects within myModel
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
- alternatively, various helper functions are provided.
coef(myModel) # Coefficients
(Intercept) x
1.6842239 0.5841843
resid(myModel) # Residuals
1 2 3 4 5 6
0.33159186 -0.22784770 -0.39519512 0.21169160 0.14434418 -0.06458482
fitted(myModel) # Fitted values
1 2 3 4 5 6
2.268408 3.027848 3.495195 4.488308 4.955656 5.364585
residuals(myModel) + fitted(myModel) # what values does this give?
1 2 3 4 5 6
2.6 2.8 3.1 4.7 5.1 5.3
You can also get some diagnostic information on the model.
- Some explanation can be found here and here
par(mfrow=c(2,2))
plot(myModel)
- R has a very powerful formula syntax for describing statistical models
- Suppose we had two explanatory variables
x
and z
, and one response variable y
- We can describe a relationship between, say,
y
and x
using a tilde ~
, placing the response variable on the left of the tilde and the explanatory variables on the right:
- It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
y~x #If x is continuous, this is linear regression
y ~ x
y~x #If x is categorical, ANOVA
y ~ x
y~x+z #If x and z are continuous, multiple regression
y ~ x + z
y~x+z #If x and z are categorical, two-way ANOVA
y ~ x + z
y~x+z+x:z # : is the symbol for the interaction term
y ~ x + z + x:z
y~x*z # * is a shorthand for x+z+x:z
y ~ x * z
Exercise: Exercise 6
- There are suggestions that Ozone level could be influenced by Temperature:
plot(weather$Temp, weather$Ozone,xlab="Temperature",ylab="Ozone level",pch=16)
- Perform a linear regression analysis to assess this:
- Fit the linear model and print a summary of the output
- Plot the two variables and overlay a best-fit line
- What is the equation of the best-fit line in the form \[ y = ax + b\]
- Calculate the correlation between the two variables using the function cor (?cor)
- can you add text to the plot to show the correlation coefficient?
- HINT: The
paste
can be used to join strings of text together, or variables
paste("Hello","World")
[1] "Hello World"
age <- 35
paste("My age is", age)
[1] "My age is 35"
## Your Answer Here ##
---
title: "Introduction to Solving Biological Problems Using R - Day 2"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
  Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
---


# 2. Statistics
##Built-in support for statistics
- R is a statistical programming language:
    + Classical statistical tests are built-in
    + Statistical modeling functions are built-in
    + Regression analysis is fully supported
    + Additional mathematical packages are available (`MASS`, Waves, sparse matrices, etc)
  
##Distribution functions  
- Most commonly used distributions are built-in, functions have stereotypical names, e.g. for normal distribution:
    + **`pnorm`** - cumulative distribution for x
    + **`qnorm`** - inverse of pnorm (from probability gives x)
    + **`dnorm`** - distribution density
    + **`rnorm`** - random number from normal distribution
  
![distributions](images/distributions.png)  
  
- Available for variety of distributions: `punif` (uniform), `pbinom` (binomial), `pnbinom` (negative binomial), `ppois` (poisson), `pgeom` (geometric), `phyper` (hyper-geometric), `pt` (T distribution), pf (F distribution) 


- 10 random values from the Normal distribution with mean 10 and standard deviation 5:
```{r}
rnorm(10, mean=10, sd=5)
```
- The probability of drawing exactly 10 from this distribution:
```{r}
dnorm(10, mean=10, sd=5)
```

```{r}
dnorm(100, mean=10, sd=5)
```


- The probability of drawing a value smaller than 10:
```{r}
pnorm(10, mean=10, sd=5)
```
- The inverse of `pnorm()`:
```{r}
qnorm(0.5, mean=10, sd=5)
```
- How many standard deviations for statistical significance?
```{r}
qnorm(0.95, mean=0, sd=1)
```

## Example

Recall our histogram of Wind Speed from yesterday:

- The data look to be roughly normally-distributed
- An assumption we rely on for various statistical tests

```{r}
weather <- read.csv("ozone.csv")
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
     main="Distribution of Wind Speed",
     breaks = 20, freq=FALSE)
```

## Create a normal distribution curve

- If our data are normally-distributed, we can calculate the probability of drawing particular values.
      + e.g. a Wind Speed of 10

```{r}
windMean <- mean(weather$Wind)
windSD <- sd(weather$Wind)
dnorm(10, mean=windMean, sd=windSD)
```

- We can overlay this on the histogram using `points` as we just saw:
```{r}
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
     main="Distribution of Wind Speed",
     breaks = 20, freq=FALSE)
points(10, dnorm(10, mean=windMean, sd=windSD),
       col="red", pch=16)
```

- We can repeat the calculation for a vector of values
    + remember that functions in R are often ***vectorized***
    + use `lines` in this case rather than `points`
    
```{r}
xs <- c(0,5,10,15,20)
ys <- dnorm(xs, mean=windMean, sd=windSD)
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
     main="Distribution of Wind Speed",
     breaks = 20, freq=FALSE)
lines(xs, ys, col="red")
```




- For a smoother curve, use a longer vector:
    + we can generate x values using the `seq()` function
- Inspecting the data in this manner is usually acceptable to assess normality
    + the fit doesn't have to be exact
    + the shapiro test is also available `?shapiro.test` (but not really recommended by statisticians)


```{r}
hist(weather$Wind,col="steelblue",xlab="Wind Speed",
     main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
xs <- seq(0,20, length.out = 10000)
ys <- dnorm(xs, mean=windMean,sd=windSD)
lines(xs,ys,col="red")
```

## Simple testing

- If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:

$$t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}$$

- Say a Wind Speed of 2; which from the histogram seems to be unlikely

```{r}
t <- (windMean - 2) / (windSD/sqrt(length(weather$Wind)))
t
```

- ...or use the **`t.test()`** function to compute the statistic and corresponding p-value

```{r}
t.test(weather$Wind, mu=2)
```

- A variety of tests are supported the R authors have tried to make them as consistent as possible

```{r}
?var.test
?t.test
?wilcox.test
?prop.test
?cor.test
?chisq.test
?fisher.test
```

    
- Bottom-line: Pretty much any statistical test you care to name will probably be in R
    + This is not supposed to be a statistics course (sorry!)
    + None of them are particular harder than others to use
    + The difficulty is deciding which test to use:
        + whether the assumptions of the test are met, etc.
    + Consult your local statistician if not sure
    + An upcoming course that will help
        + [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
    + Some good references:
        + [Statistical Analysis Using R (Course from the Babaraham Bioinformatics Core)](http://training.csx.cam.ac.uk/bioinformatics/event/1827771)
        + [Quick-R guide to stats](http://www.statmethods.net/stats/index.html)
        + [Simple R eBook](https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf)
        + [R wiki](https://en.wikibooks.org/wiki/R_Programming/Descriptive_Statistics)
        + [R tutor](http://www.r-tutor.com/elementary-statistics)
        + [Statistical Cheatsheet](https://rawgit.com/bioinformatics-core-shared-training/intermediate-stats/master/cheatsheet.pdf)
   - R will do whatever you ask it, it will never check the assumptions or help you to interpret the result

![](images/clippy.jpg)

## Example analysis

- We have already seen that men in our `patients` dataset tend to be heavier than women
- We can **test this formally** in R
    + N.B. the weight of people in a population tends to be normally distributed, so we can probably be safe to use a parametric test
    
![](images/males-versus-females.png)




We need to run this if we don't have the patients data in our R environment
```{r}
patients <- read.delim("patient-info.txt")
```

We can test if the variance of the two groups is the same

+ this will influence what flavour of t-test we use

```{r}
var.test(patients$Weight~patients$Sex)
```


```{r}
t.test(patients$Weight~patients$Sex, var.equal=FALSE)
```

If were unwilling to make a assumption of normality, a non-parametric test could be used.

```{r}
wilcox.test(patients$Weight~patients$Sex)
```

## Linear Regression

- Linear modeling is supported by the function `lm()`:
    + `example(lm)`
- The output assumes you know a fair bit about the subject
- `lm` is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson's correlation coefficient
    + This is very easy in R
- Three steps to plotting with a best fit line:

    + Plot XY scatter-plot data
    + Fit a linear model
    + Add bestfit line data to plot with `abline()` function
- Let's see a toy example:-

```{r}
x <- c(1, 2.3, 3.1, 4.8, 5.6, 6.3)
y <- c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
```

- The `~` is used to define a formula; i.e. "y is given by x"
    + Take care about the order of `x` and `y` in the `plot` and `lm` expressions

```{r}
plot(x,y, xlim=c(0,10), ylim=c(0,10))
myModel <- lm(y~x)
abline(myModel, col="red")
```

The generic `summary` function give an overview of the model

- particular components are accessible if we know their names

```{r}
summary(myModel)
names(myModel)  # Names of the objects within myModel
```

- alternatively, various helper functions are provided.

```{r}
coef(myModel)   # Coefficients
resid(myModel)  # Residuals
fitted(myModel) # Fitted values
residuals(myModel) + fitted(myModel) # what values does this give?
```

You can also get some diagnostic information on the model.

- Some explanation can be found [here](http://data.library.virginia.edu/diagnostic-plots/) and [here](http://strata.uga.edu/6370/rtips/regressionPlots.html)

```{r}
par(mfrow=c(2,2)) 
plot(myModel)
```

- R has a very powerful formula syntax for describing statistical models
- Suppose we had two explanatory variables `x` and `z`, and one response variable `y`
- We can describe a relationship between, say, `y` and `x` using a tilde `~`, placing the response variable on the left of the tilde and the explanatory variables on the right:
    + y~x
- It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example

```{r}
y~x       #If x is continuous, this is linear regression
y~x       #If x is categorical, ANOVA
y~x+z     #If x and z are continuous, multiple regression
y~x+z     #If x and z are categorical, two-way ANOVA
y~x+z+x:z # : is the symbol for the interaction term
y~x*z     # * is a shorthand for x+z+x:z
```

## Exercise: Exercise 6

- There are suggestions that Ozone level could be influenced by Temperature:
```{r}
plot(weather$Temp, weather$Ozone,xlab="Temperature",ylab="Ozone level",pch=16)
```

- Perform a linear regression analysis to assess this:
    + Fit the linear model and print a summary of the output
    + Plot the two variables and overlay a best-fit line
    + What is the equation of the best-fit line in the form
      $$ y = ax + b$$
- Calculate the correlation between the two variables using the function cor (?cor)
    + can you add text to the plot to show the correlation coefficient?
    + HINT: The `paste` can be used to join strings of text together, or variables

```{r}
paste("Hello","World")
age <- 35
paste("My age is", age)
```


```{r}

## Your Answer Here ##

```

![](images/exercise6.png)
![](images/exercise6b.png)


## More words of warning

***Correlation != Causation***

![](images/spurious.png)

http://tylervigen.com/spurious-correlations


![](images/nobel-prize.jpeg)

[So if I want to win a nobel prize, I should eat even more chocolate?!?!?](http://www.businessinsider.com/chocolate-consumption-vs-nobel-prizes-2014-4?IR=T)

But no-one would ever take such trends seriously....would they?


![Cutting-down on Ice Cream was recommended as a safeguard against polio!](images/icecreamvspolio.jpg)

