Quick recap of how to download data from a URL and load it into R.
rm(list = ls())
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ade4)
## Warning: package 'ade4' was built under R version 3.4.4
library(ggplot2)
# Download the data provided by your collaborator
dir.create("data")
## Warning in dir.create("data"): 'data' already exists
download.file("https://github.com/tavareshugo/data_carpentry_extras/blob/master/rnaseq_visualisation/rnaseq_data.RData?raw=true", destfile = "data/rnaseq_data.RData")
# Load the data
load("data/rnaseq_data.RData")
What have we downloaded?
ls()
## [1] "norm_cts" "raw_cts" "sample_info" "test_result"
head(norm_cts)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 9.758847 9.450683 9.883149 9.663635 10.195880
## ENSG00000000005 6.758878 6.758878 6.758878 6.758878 6.758878
## ENSG00000000419 9.355133 9.600204 9.505742 9.542258 9.447583
## ENSG00000000457 8.795024 8.729316 8.683145 8.762721 8.625484
## ENSG00000000460 7.797652 7.820791 7.556019 7.741701 7.860172
## ENSG00000000938 6.758878 6.758878 6.939283 6.758878 6.886583
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 9.895816 10.024567 9.657639
## ENSG00000000005 6.758878 6.758878 6.758878
## ENSG00000000419 9.593435 9.347560 9.528572
## ENSG00000000457 8.733469 8.791749 8.754535
## ENSG00000000460 7.673760 7.981447 7.838440
## ENSG00000000938 6.758878 6.758878 6.758878
Let’s recapitulate some of the data manipulation which we have seen in the previous session
norm_cts_tbl <- as_tibble(norm_cts, rownames = "gene")
norm_cts_long <- norm_cts_tbl %>%
gather(sample_id, cts, SRR1039508:SRR1039521)
Histograms are a tool to get to know your data better and e.g. learn about their distributions.
ggplot(norm_cts_long, aes(x=cts)) + geom_histogram(binwidth=1)
Now we know for sure that this data is not following a Gaussian distribution (as people who have worked with RNA-Seq data before may have expected).
We can also check whether data distributions look similar across samples.
ggplot(norm_cts_long, aes(x=cts, fill=sample_id)) + geom_histogram(binwidth=1, position="dodge")
From the histogram plot one can conclude that data distributions are not Gaussian, but quite similar across samples.
So far we have looked at RNA-Seq data that Hugo has preprocessed for us (thus norm_cts
) using a sophisticated method (DESeq2). In the following, we will also take a look at the raw data (raw_cts
) and compare these to the normalized version to explore some properties of the data normalization applied.
Let’s start by preparing the raw data the same way we have done for the normalized count data.
raw_cts_tbl <- as_tibble(raw_cts, rownames = "gene")
raw_cts_long <- raw_cts_tbl %>%
gather(sample_id, cts, SRR1039508:SRR1039521)
The summary
function is quite handy here as a first approach to exploring differences between raw and normalized data.
summary(norm_cts_long)
## gene sample_id cts
## Length:512816 Length:512816 Min. : 6.759
## Class :character Class :character 1st Qu.: 6.759
## Mode :character Mode :character Median : 6.759
## Mean : 7.397
## 3rd Qu.: 7.181
## Max. :18.733
summary(raw_cts_long)
## gene sample_id cts
## Length:512816 Length:512816 Min. : 0.0
## Class :character Class :character 1st Qu.: 0.0
## Mode :character Mode :character Median : 0.0
## Mean : 342.3
## 3rd Qu.: 9.0
## Max. :513766.0
Observations - raw and normalized data live on VERY different scales - the raw_cts_long
has a few very large count values while most values are actually small (<10).
Scatter plots are useful e.g. to visually inspect similarity (correlation) between variables.
ggplot(norm_cts_tbl, aes(x=SRR1039508, y=SRR1039509)) + geom_point()
ggplot(raw_cts_tbl, aes(x=SRR1039508, y=SRR1039509)) + geom_point()
Due to the few extreme counts in the raw data, our scatter plot may be more informative on a log scale (Look for the “Scales” section in the ggplot2 cheat sheet)…
mean_var_raw <- raw_cts_long %>%
group_by(gene) %>%
summarise(g_mean = mean(cts), g_var = var(cts))
ggplot(mean_var_raw, aes(x=g_mean, y=g_var)) + geom_point()
Relabel the axes of the plot to “Mean” and “Variance” (Hint: Look for the “Labels” section of the ggplot2 cheat sheet).
This plot reveals a strong dependency between a gene’s mean count (expression) and its variance (statisticians would call this heteroscedasticity) - which can be a problem in many downstream comparisons across genes.
Let’s do a mean-variance plot for the normalized data…
# a bit more convoluted data preparation for plotting the normalized counts
mean_var_norm <- norm_cts_long %>%
group_by(gene) %>%
summarise(g_mean = mean(cts), g_var = var(cts))
ggplot(mean_var_norm, aes(x=g_mean, y=g_var)) + geom_point()
Make the dots transparent by providing an alpha
argument to geom_point()
.
These mean-variance plots verify that Hugo’s (ie. DESeq2’s) normalization approach is effectively achieving its main goal: to stabilize the variance, as the normalised data do no longer show a (strong) dependency between a gene’s mean and its variance.
PCA is a transformation of high-dimensional data into an orthogonal basis such that first principal component (PC, aka “axis”“) is aligned with the largest source of variance, the second PC to the largest remaining source of variance and so on. This makes high-dimensional data more amenable to visual exploration, as we can examine projections to the first two (or few) PCs.
image: Link to the original animation
In R there are several possibilities to compute a PCA. Here we will use the implementation from the ade4 package (which provides many more ordination methods).
pc_obj <- dudi.pca(t(norm_cts), scannf = F, nf = 5)
pc_frame <- pc_obj$li
head(pc_frame)
## Axis1 Axis2 Axis3 Axis4 Axis5
## SRR1039508 -77.65457 76.53370 -78.45581 5.553696 -4.082898
## SRR1039509 82.08869 69.00058 -134.75801 17.248325 6.216420
## SRR1039512 -88.32018 10.87250 56.63397 89.930343 -75.154150
## SRR1039513 106.68187 12.42528 87.10566 96.681992 50.049044
## SRR1039516 -95.49831 -102.48941 -19.70441 -5.130251 111.736966
## SRR1039517 57.56416 -152.38918 -39.70910 -22.243042 -79.884526
Let’s convert the PCA result (obtained from the original matrix) into a tibble - recalling some of the things we’ve used previously.
pc_tbl <- as_tibble(pc_frame, rownames = "sample")
head(pc_tbl)
## # A tibble: 6 x 6
## sample Axis1 Axis2 Axis3 Axis4 Axis5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SRR1039508 -77.7 76.5 -78.5 5.55 -4.08
## 2 SRR1039509 82.1 69.0 -135. 17.2 6.22
## 3 SRR1039512 -88.3 10.9 56.6 89.9 -75.2
## 4 SRR1039513 107. 12.4 87.1 96.7 50.0
## 5 SRR1039516 -95.5 -102. -19.7 -5.13 112.
## 6 SRR1039517 57.6 -152. -39.7 -22.2 -79.9
sample_pc <- full_join(sample_info, pc_tbl, by = c("sample" = "sample"))
head(sample_pc)
## # A tibble: 6 x 8
## sample cell dex Axis1 Axis2 Axis3 Axis4 Axis5
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SRR1039508 N61311 untrt -77.7 76.5 -78.5 5.55 -4.08
## 2 SRR1039509 N61311 trt 82.1 69.0 -135. 17.2 6.22
## 3 SRR1039512 N052611 untrt -88.3 10.9 56.6 89.9 -75.2
## 4 SRR1039513 N052611 trt 107. 12.4 87.1 96.7 50.0
## 5 SRR1039516 N080611 untrt -95.5 -102. -19.7 -5.13 112.
## 6 SRR1039517 N080611 trt 57.6 -152. -39.7 -22.2 -79.9
Since the PCA “rotated” our data so that the first two dimensions (PCs) correspond to the largest sources of variance in our data, plotting these can be a very useful exploratory tool.
ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point()
We can colour the samples by treatment.
ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point(aes(colour = factor(dex)))
This clearly shows that the first principal component separates the samples by the treatment.
We can also colour by cell line…
ggplot(sample_pc, aes(x=Axis1, y=Axis2)) + geom_point(aes(colour = factor(cell)))
… which reveals that this corresponds to the second pricipal component.
Encode the treatment by plot colour and the cell line by shape (by supplying another argument to geom_point(aes(...))
.)
Together this illustrates how PCA can reveal the dominant sources of variation in a data set and whether these relate to the biological signal of interest or potentially to technical differences.
For RNA-Seq experts: the data normalization routine mentioned above was aware of both the treatment and cell groupings and might have contributed to the separation seen. As an exercise you could repeat the above PCA on the raw counts.