Data preparation

Quick recap of how to download data from a URL and load it into R.

rm(list = ls())
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ade4)
## Warning: package 'ade4' was built under R version 3.4.4
library(ggplot2)

# Download the data provided by your collaborator
dir.create("data")
## Warning in dir.create("data"): 'data' already exists
download.file("https://github.com/tavareshugo/data_carpentry_extras/blob/master/rnaseq_visualisation/rnaseq_data.RData?raw=true", destfile = "data/rnaseq_data.RData")

# Load the data
load("data/rnaseq_data.RData")

What have we downloaded?

ls()
## [1] "norm_cts"    "raw_cts"     "sample_info" "test_result"
head(norm_cts)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003   9.758847   9.450683   9.883149   9.663635  10.195880
## ENSG00000000005   6.758878   6.758878   6.758878   6.758878   6.758878
## ENSG00000000419   9.355133   9.600204   9.505742   9.542258   9.447583
## ENSG00000000457   8.795024   8.729316   8.683145   8.762721   8.625484
## ENSG00000000460   7.797652   7.820791   7.556019   7.741701   7.860172
## ENSG00000000938   6.758878   6.758878   6.939283   6.758878   6.886583
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003   9.895816  10.024567   9.657639
## ENSG00000000005   6.758878   6.758878   6.758878
## ENSG00000000419   9.593435   9.347560   9.528572
## ENSG00000000457   8.733469   8.791749   8.754535
## ENSG00000000460   7.673760   7.981447   7.838440
## ENSG00000000938   6.758878   6.758878   6.758878

Let’s recapitulate some of the data manipulation which we have seen in the previous session

norm_cts_tbl <- as_tibble(norm_cts, rownames = "gene")
norm_cts_long <- norm_cts_tbl %>% 
  gather(sample_id, cts, SRR1039508:SRR1039521)

Histograms

Histograms are a tool to get to know your data better and e.g. learn about their distributions.

ggplot(norm_cts_long, aes(x=cts)) + geom_histogram(binwidth=1)

Now we know for sure that this data is not following a Gaussian distribution (as people who have worked with RNA-Seq data before may have expected).

We can also check whether data distributions look similar across samples.

ggplot(norm_cts_long, aes(x=cts, fill=sample_id)) + geom_histogram(binwidth=1, position="dodge")

Conclusion

From the histogram plot one can conclude that data distributions are not Gaussian, but quite similar across samples.

Scatter plots

So far we have looked at RNA-Seq data that Hugo has preprocessed for us (thus norm_cts) using a sophisticated method (DESeq2). In the following, we will also take a look at the raw data (raw_cts) and compare these to the normalized version to explore some properties of the data normalization applied.

Let’s start by preparing the raw data the same way we have done for the normalized count data.

raw_cts_tbl <- as_tibble(raw_cts, rownames = "gene")
raw_cts_long <- raw_cts_tbl %>% 
  gather(sample_id, cts, SRR1039508:SRR1039521)

The summary function is quite handy here as a first approach to exploring differences between raw and normalized data.

summary(norm_cts_long)
##      gene            sample_id              cts        
##  Length:512816      Length:512816      Min.   : 6.759  
##  Class :character   Class :character   1st Qu.: 6.759  
##  Mode  :character   Mode  :character   Median : 6.759  
##                                        Mean   : 7.397  
##                                        3rd Qu.: 7.181  
##                                        Max.   :18.733
summary(raw_cts_long)
##      gene            sample_id              cts          
##  Length:512816      Length:512816      Min.   :     0.0  
##  Class :character   Class :character   1st Qu.:     0.0  
##  Mode  :character   Mode  :character   Median :     0.0  
##                                        Mean   :   342.3  
##                                        3rd Qu.:     9.0  
##                                        Max.   :513766.0

Observations - raw and normalized data live on VERY different scales - the raw_cts_long has a few very large count values while most values are actually small (<10).

Scatter plots are useful e.g. to visually inspect similarity (correlation) between variables.

ggplot(norm_cts_tbl, aes(x=SRR1039508, y=SRR1039509)) +  geom_point()

ggplot(raw_cts_tbl, aes(x=SRR1039508, y=SRR1039509)) +  geom_point()

Exercise:

Due to the few extreme counts in the raw data, our scatter plot may be more informative on a log scale (Look for the “Scales” section in the ggplot2 cheat sheet)…

Examining the mean-variance relationship

mean_var_raw <- raw_cts_long %>% 
  group_by(gene) %>%
  summarise(g_mean = mean(cts), g_var = var(cts))
ggplot(mean_var_raw, aes(x=g_mean, y=g_var)) + geom_point()

Exercise

Relabel the axes of the plot to “Mean” and “Variance” (Hint: Look for the “Labels” section of the ggplot2 cheat sheet).

This plot reveals a strong dependency between a gene’s mean count (expression) and its variance (statisticians would call this heteroscedasticity) - which can be a problem in many downstream comparisons across genes.

Let’s do a mean-variance plot for the normalized data…

# a bit more convoluted data preparation for plotting the normalized counts
mean_var_norm <- norm_cts_long %>% 
  group_by(gene) %>%
  summarise(g_mean = mean(cts), g_var = var(cts))

ggplot(mean_var_norm, aes(x=g_mean, y=g_var)) + geom_point()

Exercise:

Make the dots transparent by providing an alpha argument to geom_point().

Conclusion

These mean-variance plots verify that Hugo’s (ie. DESeq2’s) normalization approach is effectively achieving its main goal: to stabilize the variance, as the normalised data do no longer show a (strong) dependency between a gene’s mean and its variance.

Principal Component Analysis (PCA)

PCA is a transformation of high-dimensional data into an orthogonal basis such that first principal component (PC, aka “axis”“) is aligned with the largest source of variance, the second PC to the largest remaining source of variance and so on. This makes high-dimensional data more amenable to visual exploration, as we can examine projections to the first two (or few) PCs.

image: Link to the original animation

In R there are several possibilities to compute a PCA. Here we will use the implementation from the ade4 package (which provides many more ordination methods).

pc_obj <- dudi.pca(t(norm_cts), scannf = F, nf = 5)
pc_frame <- pc_obj$li

head(pc_frame)
##                Axis1      Axis2      Axis3      Axis4      Axis5
## SRR1039508 -77.65457   76.53370  -78.45581   5.553696  -4.082898
## SRR1039509  82.08869   69.00058 -134.75801  17.248325   6.216420
## SRR1039512 -88.32018   10.87250   56.63397  89.930343 -75.154150
## SRR1039513 106.68187   12.42528   87.10566  96.681992  50.049044
## SRR1039516 -95.49831 -102.48941  -19.70441  -5.130251 111.736966
## SRR1039517  57.56416 -152.38918  -39.70910 -22.243042 -79.884526

Let’s convert the PCA result (obtained from the original matrix) into a tibble - recalling some of the things we’ve used previously.

pc_tbl <- as_tibble(pc_frame, rownames = "sample")
head(pc_tbl)
## # A tibble: 6 x 6
##   sample      Axis1  Axis2  Axis3  Axis4   Axis5
##   <chr>       <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 SRR1039508  -77.7   76.5  -78.5   5.55   -4.08
## 2 SRR1039509   82.1   69.0 -135.   17.2     6.22
## 3 SRR1039512  -88.3   10.9   56.6  89.9   -75.2 
## 4 SRR1039513  107.    12.4   87.1  96.7    50.0 
## 5 SRR1039516  -95.5 -102.   -19.7  -5.13  112.  
## 6 SRR1039517   57.6 -152.   -39.7 -22.2   -79.9
sample_pc <- full_join(sample_info, pc_tbl, by = c("sample" = "sample"))
head(sample_pc)
## # A tibble: 6 x 8
##   sample     cell    dex    Axis1  Axis2  Axis3  Axis4   Axis5
##   <chr>      <chr>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 SRR1039508 N61311  untrt  -77.7   76.5  -78.5   5.55   -4.08
## 2 SRR1039509 N61311  trt     82.1   69.0 -135.   17.2     6.22
## 3 SRR1039512 N052611 untrt  -88.3   10.9   56.6  89.9   -75.2 
## 4 SRR1039513 N052611 trt    107.    12.4   87.1  96.7    50.0 
## 5 SRR1039516 N080611 untrt  -95.5 -102.   -19.7  -5.13  112.  
## 6 SRR1039517 N080611 trt     57.6 -152.   -39.7 -22.2   -79.9

Visualizing the PCA

Since the PCA “rotated” our data so that the first two dimensions (PCs) correspond to the largest sources of variance in our data, plotting these can be a very useful exploratory tool.

ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point()

We can colour the samples by treatment.

ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point(aes(colour = factor(dex)))

This clearly shows that the first principal component separates the samples by the treatment.

We can also colour by cell line…

ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point(aes(colour = factor(cell)))

… which reveals that this corresponds to the second pricipal component.

Exercise

Encode the treatment by plot colour and the cell line by shape (by supplying another argument to geom_point(aes(...)).)

Conclusion

Together this illustrates how PCA can reveal the dominant sources of variation in a data set and whether these relate to the biological signal of interest or potentially to technical differences.

Bonus level

For RNA-Seq experts: the data normalization routine mentioned above was aware of both the treatment and cell groupings and might have contributed to the separation seen. As an exercise you could repeat the above PCA on the raw counts.