Note: This lesson has been used as an extra in some Data Carpentry in R courses, and thus assumes that some of those lessons were covered beforehand.
matrix
object and how to convert it to a tibble
In this lesson, we will apply some of the skills that we’ve gained so far to manipulate and explore a dataset from an RNAseq experiment.
This lesson uses data from an experiment included in the airway
R/Bioconductor package. Very briefly:
This package provides […] read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone.
There are many dedicated packages to deal with RNAseq data, mostly within the Bioconductor package repository.
This lesson is not about analysing RNAseq data (that would be a topic for a whole course!), but rather to show you how the data manipulation principles learned so far can be applied to explore these kind of data.
If you are doing RNAseq analysis, you should use dedicated packages and workflows, which implement models to account for particular features of these data.
Let’s say that you did this experiment yourself, and that a bioinformatician analysed it and provided you with three things:
The data are provided within an RData
file, which you can download and read into your R session.
First, let’s clean our workspace (remove any objects created before), and also load the tidyverse
package.
It’s also a good idea to check if your working directory is correctly set to your workshop folder, which should contain a data
folder within it (you can use getwd()
to check your working directory).
# Clean your workspace by removing any objects created beforehand
rm(list = ls())
# Load the tidyverse package
library(tidyverse)
# Download the data provided by your collaborator
dir.create("data")
download.file("https://github.com/tavareshugo/data_carpentry_extras/blob/master/rnaseq_visualisation/rnaseq_data.RData?raw=true",
destfile = "data/rnaseq_data.RData",
method = "wb")
# Load the data
load("data/rnaseq_data.RData")
Now that you have these data, you want to produce the following two visualisations:
Exercise:
Familiarise yourself with these data:
- what kind of objects were you provided with? (hint:
class()
)- How many samples did you have?
- How many genes do you have gene expression levels for?
- How many genes were differentially expressed at the 5% false-discovery rate?
- After looking at these data and their formats, think about the steps you’d have to take in order to produce the graphs above.
To produce the plots above, we need to do a few things:
tibble
objectgene
, sample_id
, cts
, rather than one column per samplesample_info
tabletest_result
table and classify them as up- or down-regulated. Join this table to the previous one.You might have noticed that norm_cts
is a matrix
object. We haven’t found these before, and to produce either plot, it is convenient to start by converting our matrix of gene expression to a tibble
(data.frame
).
Matrices are a bit similar to data.frame
, but they only contain values of a single type, in this case numeric values (whereas in a data.frame
different columns can contain different types of data).
# Look at the first 10 rows of norm_cts
norm_cts[1:10, ]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 9.758847 9.450683 9.883149 9.663635 10.195880
## ENSG00000000005 6.758878 6.758878 6.758878 6.758878 6.758878
## ENSG00000000419 9.355133 9.600204 9.505742 9.542258 9.447583
## ENSG00000000457 8.795024 8.729316 8.683145 8.762721 8.625484
## ENSG00000000460 7.797652 7.820791 7.556019 7.741701 7.860172
## ENSG00000000938 6.758878 6.758878 6.939283 6.758878 6.886583
## ENSG00000000971 11.726659 12.076530 12.412430 12.679605 12.531782
## ENSG00000001036 10.652133 10.444215 10.713297 10.573383 10.469123
## ENSG00000001084 9.465607 9.280221 9.460261 9.876335 9.813546
## ENSG00000001167 9.183190 8.826912 9.204851 8.819579 9.569653
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 9.895816 10.024567 9.657639
## ENSG00000000005 6.758878 6.758878 6.758878
## ENSG00000000419 9.593435 9.347560 9.528572
## ENSG00000000457 8.733469 8.791749 8.754535
## ENSG00000000460 7.673760 7.981447 7.838440
## ENSG00000000938 6.758878 6.758878 6.758878
## ENSG00000000971 12.983172 12.510744 13.083466
## ENSG00000001036 10.271080 10.719016 10.432815
## ENSG00000001084 9.472570 9.907291 9.891182
## ENSG00000001167 9.264339 9.198719 8.897247
You will also notice that gene names are contained in the row names of this matrix.
To convert this matrix into a tibble
object we can use the function as_tibble()
:
# Convert matrix to tibble
as_tibble(norm_cts)
## # A tibble: 64,102 x 8
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9.76 9.45 9.88 9.66 10.2 9.90
## 2 6.76 6.76 6.76 6.76 6.76 6.76
## 3 9.36 9.60 9.51 9.54 9.45 9.59
## 4 8.80 8.73 8.68 8.76 8.63 8.73
## 5 7.80 7.82 7.56 7.74 7.86 7.67
## 6 6.76 6.76 6.94 6.76 6.89 6.76
## 7 11.7 12.1 12.4 12.7 12.5 13.0
## 8 10.7 10.4 10.7 10.6 10.5 10.3
## 9 9.47 9.28 9.46 9.88 9.81 9.47
## 10 9.18 8.83 9.20 8.82 9.57 9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## # SRR1039521 <dbl>
But now we’ve lost our gene names! If we look at the function’s help (?as_tibble
), we can see that there’s a way to solve this problem:
# Convert matrix to tibble - add colnames to a new column called "gene"
cts_tbl <- as_tibble(norm_cts, rownames = "gene")
cts_tbl
## # A tibble: 64,102 x 9
## gene SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG… 9.76 9.45 9.88 9.66 10.2 9.90
## 2 ENSG… 6.76 6.76 6.76 6.76 6.76 6.76
## 3 ENSG… 9.36 9.60 9.51 9.54 9.45 9.59
## 4 ENSG… 8.80 8.73 8.68 8.76 8.63 8.73
## 5 ENSG… 7.80 7.82 7.56 7.74 7.86 7.67
## 6 ENSG… 6.76 6.76 6.94 6.76 6.89 6.76
## 7 ENSG… 11.7 12.1 12.4 12.7 12.5 13.0
## 8 ENSG… 10.7 10.4 10.7 10.6 10.5 10.3
## 9 ENSG… 9.47 9.28 9.46 9.88 9.81 9.47
## 10 ENSG… 9.18 8.83 9.20 8.82 9.57 9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## # SRR1039521 <dbl>
There’s two functions that allow us to convert tables from a “wide” to a “long” format and vice-versa: gather()
and spread()
.
If you notice, what we want in our case is to gather
our gene expression columns. We can do this by giving gather()
four things:
Like so:
# "gather" the counts data
cts_long <- cts_tbl %>%
gather(sample_id, cts, SRR1039508:SRR1039521)
cts_long
## # A tibble: 512,816 x 3
## gene sample_id cts
## <chr> <chr> <dbl>
## 1 ENSG00000000003 SRR1039508 9.76
## 2 ENSG00000000005 SRR1039508 6.76
## 3 ENSG00000000419 SRR1039508 9.36
## 4 ENSG00000000457 SRR1039508 8.80
## 5 ENSG00000000460 SRR1039508 7.80
## 6 ENSG00000000938 SRR1039508 6.76
## 7 ENSG00000000971 SRR1039508 11.7
## 8 ENSG00000001036 SRR1039508 10.7
## 9 ENSG00000001084 SRR1039508 9.47
## 10 ENSG00000001167 SRR1039508 9.18
## # ... with 512,806 more rows
If we wanted to do the reverse, we could use the spread()
function:
cts_long %>%
spread(sample_id, cts)
## # A tibble: 64,102 x 9
## gene SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG… 9.76 9.45 9.88 9.66 10.2 9.90
## 2 ENSG… 6.76 6.76 6.76 6.76 6.76 6.76
## 3 ENSG… 9.36 9.60 9.51 9.54 9.45 9.59
## 4 ENSG… 8.80 8.73 8.68 8.76 8.63 8.73
## 5 ENSG… 7.80 7.82 7.56 7.74 7.86 7.67
## 6 ENSG… 6.76 6.76 6.94 6.76 6.89 6.76
## 7 ENSG… 11.7 12.1 12.4 12.7 12.5 13.0
## 8 ENSG… 10.7 10.4 10.7 10.6 10.5 10.3
## 9 ENSG… 9.47 9.28 9.46 9.88 9.81 9.47
## 10 ENSG… 9.18 8.83 9.20 8.82 9.57 9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## # SRR1039521 <dbl>
(see here for another example of using these functions)
The next thing we want to do is to add information about each sample to our gene expression table.
sample_info
## # A tibble: 8 x 3
## sample cell dex
## <chr> <chr> <chr>
## 1 SRR1039508 N61311 untrt
## 2 SRR1039509 N61311 trt
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611 trt
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611 trt
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011 trt
We can do this by joining the cts_long
table with the sample_info
table.
Joining tables is an important task that is often needed in multi-layered data.
There are several different kinds of joins that can be performed. Let’s look at the “Combine Data Sets” section of the dplyr cheatsheet to understand them better.
In our case, we know that all samples in our counts table also occur in the sample_info
table, so there’s a few different kinds of joins that would work.
For safety, let’s use full_join()
, to ensure we retain all data:
cts_long <- full_join(cts_long, sample_info, by = c("sample_id" = "sample"))
cts_long
## # A tibble: 512,816 x 5
## gene sample_id cts cell dex
## <chr> <chr> <dbl> <chr> <chr>
## 1 ENSG00000000003 SRR1039508 9.76 N61311 untrt
## 2 ENSG00000000005 SRR1039508 6.76 N61311 untrt
## 3 ENSG00000000419 SRR1039508 9.36 N61311 untrt
## 4 ENSG00000000457 SRR1039508 8.80 N61311 untrt
## 5 ENSG00000000460 SRR1039508 7.80 N61311 untrt
## 6 ENSG00000000938 SRR1039508 6.76 N61311 untrt
## 7 ENSG00000000971 SRR1039508 11.7 N61311 untrt
## 8 ENSG00000001036 SRR1039508 10.7 N61311 untrt
## 9 ENSG00000001084 SRR1039508 9.47 N61311 untrt
## 10 ENSG00000001167 SRR1039508 9.18 N61311 untrt
## # ... with 512,806 more rows
Notice how, in this case, the columns we wanted to be matched between the two tables had different names. Also, we can join tables by more than one column, which is not necessary here.
Now we want to add information about the test results for each gene to our gene expression table. However, we only want to retain genes that are likely to have a response to the treatment.
Exercise:
- Using the
test_result
table, make a new table calledgenes_of_interest
that contains only those genes withpadj < 0.05
. Keep only the columnsgene
,log2FoldChange
.- Create a new table called
cts_filtered
by joining the previous table with thects_long
table. Retain only the filtered genes.
cts_filtered
## # A tibble: 1,496 x 6
## gene sample_id cts cell dex log2FoldChange
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 ENSG00000013293 SRR1039508 8.82 N61311 untrt -2.65
## 2 ENSG00000013293 SRR1039509 7.45 N61311 trt -2.65
## 3 ENSG00000013293 SRR1039512 9.14 N052611 untrt -2.65
## 4 ENSG00000013293 SRR1039513 7.99 N052611 trt -2.65
## 5 ENSG00000013293 SRR1039516 9.39 N080611 untrt -2.65
## 6 ENSG00000013293 SRR1039517 7.75 N080611 trt -2.65
## 7 ENSG00000013293 SRR1039520 9.67 N061011 untrt -2.65
## 8 ENSG00000013293 SRR1039521 7.95 N061011 trt -2.65
## 9 ENSG00000013297 SRR1039508 12.7 N61311 untrt -1.67
## 10 ENSG00000013297 SRR1039509 10.9 N61311 trt -1.67
## # ... with 1,486 more rows
Finally, we want to classify our genes according to whether they were up- or down-regulated after the treatement. These genes would have, respectively, a positive or negative log2(fold-change) between the two conditions.
We can therefore use the log2FoldChange
column to create a new column that will classify our genes as being of either class, depending on whether its value is positive or negative. This type of operation can be done using the ifelse()
function.
cts_filtered <- cts_filtered %>%
mutate(up_or_down = ifelse(log2FoldChange > 0, "up-regulated", "down-regulated"))
cts_filtered
## # A tibble: 1,496 x 7
## gene sample_id cts cell dex log2FoldChange up_or_down
## <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 ENSG00000013293 SRR1039508 8.82 N61311 untrt -2.65 down-regu…
## 2 ENSG00000013293 SRR1039509 7.45 N61311 trt -2.65 down-regu…
## 3 ENSG00000013293 SRR1039512 9.14 N0526… untrt -2.65 down-regu…
## 4 ENSG00000013293 SRR1039513 7.99 N0526… trt -2.65 down-regu…
## 5 ENSG00000013293 SRR1039516 9.39 N0806… untrt -2.65 down-regu…
## 6 ENSG00000013293 SRR1039517 7.75 N0806… trt -2.65 down-regu…
## 7 ENSG00000013293 SRR1039520 9.67 N0610… untrt -2.65 down-regu…
## 8 ENSG00000013293 SRR1039521 7.95 N0610… trt -2.65 down-regu…
## 9 ENSG00000013297 SRR1039508 12.7 N61311 untrt -1.67 down-regu…
## 10 ENSG00000013297 SRR1039509 10.9 N61311 trt -1.67 down-regu…
## # ... with 1,486 more rows
Now, we can apply the “split-apply-combine” strategy that we’ve already learned about, to summarise the data per gene.
Exercise:
Using
group_by
andsummarise
create a new table calledcts_per_gene
with the mean expression of each gene in each treatment.
## # A tibble: 374 x 4
## gene dex up_or_down mean_cts
## <chr> <chr> <chr> <dbl>
## 1 ENSG00000013293 trt down-regulated 7.79
## 2 ENSG00000013293 untrt down-regulated 9.26
## 3 ENSG00000013297 trt down-regulated 10.8
## 4 ENSG00000013297 untrt down-regulated 12.4
## 5 ENSG00000025423 trt down-regulated 7.80
## 6 ENSG00000025423 untrt down-regulated 8.75
## 7 ENSG00000035664 trt up-regulated 10.1
## 8 ENSG00000035664 untrt up-regulated 8.90
## 9 ENSG00000046653 trt up-regulated 10.7
## 10 ENSG00000046653 untrt up-regulated 8.87
## # ... with 364 more rows
With that final table, we are now ready to produce our plot!
ggplot(cts_per_gene, aes(dex, mean_cts)) +
geom_line(aes(group = gene)) +
facet_grid(~ up_or_down)
Notice how we needed to specify the option group = gene
, to tell geom_line
that each line should connect the values referring to the same gene. Otherwise, the function would not know what points should be connected.
The plot above is quite close to what we want, but we notice that the labels on the x-axis appear in the opposite order of what we might want.
To solve this, we can convert the dex
variable to a factor, specifying the levels in the right order:
cts_per_gene %>%
mutate(dex = factor(dex, levels = c("untrt", "trt"))) %>%
ggplot(aes(dex, mean_cts)) +
geom_line(aes(group = gene), alpha = 0.5) +
facet_grid(~ up_or_down)
Let’s look at how using factors can be very useful to improve our display of information.
Let’s try to make a “heatmap” representation of our genes’ expression in every sample:
ggplot(cts_filtered, aes(sample_id, gene)) +
geom_tile(aes(fill = cts))
From the graph above, we can first see that we have an aesthetic problem with our axis labels: neither of them are readable! We can fix this by customizing our plot using the theme()
function:
# Make the text on the x-axis at an angle
# Remove the text from the y-axis
ggplot(cts_filtered, aes(sample_id, gene)) +
geom_tile(aes(fill = cts)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank())
Now it’s easy to see that the order in which the samples appear is alphabetical, but ideally we would like it if samples appeared grouped according to their treatment.
## # A tibble: 8 x 2
## dex sample
## <chr> <chr>
## 1 trt SRR1039509
## 2 trt SRR1039513
## 3 trt SRR1039517
## 4 trt SRR1039521
## 5 untrt SRR1039508
## 6 untrt SRR1039512
## 7 untrt SRR1039516
## 8 untrt SRR1039520
We know we can change the order of the labels if we convert the sample_id
variable into a factor. And in this case, we would like the order of the samples to be according to the order of the dex
variable.
Of course we could do it by hand, but imagine you had hundreds of samples!
Here’s a simplified example of how we can order a factor based on another variable:
animal <- c("mouse", "elephant", "cat", "mouse", "elephant")
weight <- c(10, 1e6, 10000, 10, 1e6)
# Default order of levels is alphabetical
factor(animal)
## [1] mouse elephant cat mouse elephant
## Levels: cat elephant mouse
# We can use the reorder function to order the levels based on a numeric variable
reorder(animal, weight)
## [1] mouse elephant cat mouse elephant
## attr(,"scores")
## cat elephant mouse
## 1e+04 1e+06 1e+01
## Levels: mouse cat elephant
But how about this example?
animal <- c("mouse", "elephant", "cat", "mouse", "elephant")
animal_order <- c("1st", "3rd", "2nd", "1st", "3rd")
# Order levels of animals based on the size
reorder(animal, animal_order)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## [1] mouse elephant cat mouse elephant
## attr(,"scores")
## cat elephant mouse
## NA NA NA
## Levels: cat elephant mouse
This throws a warning and does not reorder the levels based on the order we want.
This is because the reorder()
function works best if the variable we’re using to define the order of the levels is numeric. In this case, we could do the following:
# Define the rank of each string in `animal_ranks`
rank(animal_order)
## [1] 1.5 4.5 3.0 1.5 4.5
# Use this ranking of our character vector inside reorder
reorder(animal, rank(animal_order))
## [1] mouse elephant cat mouse elephant
## attr(,"scores")
## cat elephant mouse
## 3.0 4.5 1.5
## Levels: mouse cat elephant
Using the trick above, we can order our samples according to their treatment:
cts_filtered %>%
mutate(sample_id = reorder(sample_id, rank(dex))) %>%
ggplot(aes(sample_id, gene)) +
geom_tile(aes(fill = cts)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank())
This is already much more structured. However, we still have the problem that our genes are also not ordered logically. Preferably we would like them ordered according to their fold-change between treatments.
Exercise:
Using the same trick as above, order the genes according to the
log2FoldChange
column - you should obtain this much more structured graph:
There’s one final thing that will improve our “heatmap” representation of the data. We could transform our data to ensure that the different genes are represented on a comparable scale.
You could imagine that two genes might have changed by the same magnitude (say, doubled the expression after treatment), but their base mean levels of expression might have been quite different. For example, one gene might have changed from 10 to 20 and another from 100 to 200.
If what we’re interested in is the relative change in expression, then those two genes will appear more different than they really are in our heatmap.
A useful data transformation in this case is to center and scale each genes’ expression by their mean and standard deviation, respectively. The values thus obtained are known as z-scores, and can be interpreted as the “number of standard deviations away from the mean”.
A positive z-score means that the gene’s expression was above the average across samples, whereas a negative one means it was below average. A value of zero means the gene’s expression was exactly average.
In R, we can use the function scale
to do this transformation. Let’s re-do our heatmap, but center and scale each gene’s expression first:
cts_filtered %>%
mutate(sample_id = reorder(sample_id, rank(dex)),
gene = reorder(gene, log2FoldChange)) %>%
group_by(gene) %>%
mutate(cts = scale(cts)) %>%
ggplot(aes(sample_id, gene)) +
geom_tile(aes(fill = cts)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
scale_fill_gradient2(low = "blue3", mid = "grey", high = "red3") +
labs(x = "Sample", fill = "Z-score")
And that’s it! Notice that because of the symmetry of z-scores, they are very suited to a “divergent colour scale”, which we’ve set with the function scale_fill_gradient2
to emphasize positive and negative values in a more intuitive way.
Also notice how we’ve further improved the appearance of the plot by removing the y-axis label and ticks, and customised our labels.