Note: This lesson has been used as an extra in some Data Carpentry in R courses, and thus assumes that some of those lessons were covered beforehand.

Lesson objectives

  • Introducing a matrix object and how to convert it to a tibble
  • Reformat data from wide to long format (and vice-versa)
  • Join tables
  • Using factors to improve data visualisation

In this lesson, we will apply some of the skills that we’ve gained so far to manipulate and explore a dataset from an RNAseq experiment.

This lesson uses data from an experiment included in the airway R/Bioconductor package. Very briefly:

This package provides […] read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone.

Important note

There are many dedicated packages to deal with RNAseq data, mostly within the Bioconductor package repository.

This lesson is not about analysing RNAseq data (that would be a topic for a whole course!), but rather to show you how the data manipulation principles learned so far can be applied to explore these kind of data.

If you are doing RNAseq analysis, you should use dedicated packages and workflows, which implement models to account for particular features of these data.

Getting started

Let’s say that you did this experiment yourself, and that a bioinformatician analysed it and provided you with three things:

  • Normalised read counts for all genes (a measure of the genes’ expression)
  • Information about each sample
  • A table with results from a statistical test that assesses the likelihood of your data assuming no difference between treated and untreated cells.

The data are provided within an RData file, which you can download and read into your R session.

First, let’s clean our workspace (remove any objects created before), and also load the tidyverse package.

It’s also a good idea to check if your working directory is correctly set to your workshop folder, which should contain a data folder within it (you can use getwd() to check your working directory).

# Clean your workspace by removing any objects created beforehand
rm(list = ls())

# Load the tidyverse package
library(tidyverse)

# Download the data provided by your collaborator
dir.create("data")
download.file("https://github.com/tavareshugo/data_carpentry_extras/blob/master/rnaseq_visualisation/rnaseq_data.RData?raw=true",
              destfile = "data/rnaseq_data.RData",
              method = "wb")

# Load the data
load("data/rnaseq_data.RData")

Now that you have these data, you want to produce the following two visualisations:

Exercise:

Familiarise yourself with these data:

  • what kind of objects were you provided with? (hint: class())
  • How many samples did you have?
  • How many genes do you have gene expression levels for?
  • How many genes were differentially expressed at the 5% false-discovery rate?
  • After looking at these data and their formats, think about the steps you’d have to take in order to produce the graphs above.

Preparing data for visualisation

To produce the plots above, we need to do a few things:

  1. Convert the matrix of normalised counts to a tibble object
  2. Convert that table to “long” format, i.e. with 3 columns: gene, sample_id, cts, rather than one column per sample
  3. Join this table with the sample_info table
  4. Filter “significant” genes from test_result table and classify them as up- or down-regulated. Join this table to the previous one.
  5. Summarise the normalised counts to obtain the average expression per treatment and sample

Matrix object

You might have noticed that norm_cts is a matrix object. We haven’t found these before, and to produce either plot, it is convenient to start by converting our matrix of gene expression to a tibble (data.frame).

Matrices are a bit similar to data.frame, but they only contain values of a single type, in this case numeric values (whereas in a data.frame different columns can contain different types of data).

# Look at the first 10 rows of norm_cts
norm_cts[1:10, ]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003   9.758847   9.450683   9.883149   9.663635  10.195880
## ENSG00000000005   6.758878   6.758878   6.758878   6.758878   6.758878
## ENSG00000000419   9.355133   9.600204   9.505742   9.542258   9.447583
## ENSG00000000457   8.795024   8.729316   8.683145   8.762721   8.625484
## ENSG00000000460   7.797652   7.820791   7.556019   7.741701   7.860172
## ENSG00000000938   6.758878   6.758878   6.939283   6.758878   6.886583
## ENSG00000000971  11.726659  12.076530  12.412430  12.679605  12.531782
## ENSG00000001036  10.652133  10.444215  10.713297  10.573383  10.469123
## ENSG00000001084   9.465607   9.280221   9.460261   9.876335   9.813546
## ENSG00000001167   9.183190   8.826912   9.204851   8.819579   9.569653
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003   9.895816  10.024567   9.657639
## ENSG00000000005   6.758878   6.758878   6.758878
## ENSG00000000419   9.593435   9.347560   9.528572
## ENSG00000000457   8.733469   8.791749   8.754535
## ENSG00000000460   7.673760   7.981447   7.838440
## ENSG00000000938   6.758878   6.758878   6.758878
## ENSG00000000971  12.983172  12.510744  13.083466
## ENSG00000001036  10.271080  10.719016  10.432815
## ENSG00000001084   9.472570   9.907291   9.891182
## ENSG00000001167   9.264339   9.198719   8.897247

You will also notice that gene names are contained in the row names of this matrix.

To convert this matrix into a tibble object we can use the function as_tibble():

# Convert matrix to tibble
as_tibble(norm_cts)
## # A tibble: 64,102 x 8
##    SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1       9.76       9.45       9.88       9.66      10.2        9.90
##  2       6.76       6.76       6.76       6.76       6.76       6.76
##  3       9.36       9.60       9.51       9.54       9.45       9.59
##  4       8.80       8.73       8.68       8.76       8.63       8.73
##  5       7.80       7.82       7.56       7.74       7.86       7.67
##  6       6.76       6.76       6.94       6.76       6.89       6.76
##  7      11.7       12.1       12.4       12.7       12.5       13.0 
##  8      10.7       10.4       10.7       10.6       10.5       10.3 
##  9       9.47       9.28       9.46       9.88       9.81       9.47
## 10       9.18       8.83       9.20       8.82       9.57       9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## #   SRR1039521 <dbl>

But now we’ve lost our gene names! If we look at the function’s help (?as_tibble), we can see that there’s a way to solve this problem:

# Convert matrix to tibble - add colnames to a new column called "gene"
cts_tbl <- as_tibble(norm_cts, rownames = "gene")
cts_tbl
## # A tibble: 64,102 x 9
##    gene  SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG…       9.76       9.45       9.88       9.66      10.2        9.90
##  2 ENSG…       6.76       6.76       6.76       6.76       6.76       6.76
##  3 ENSG…       9.36       9.60       9.51       9.54       9.45       9.59
##  4 ENSG…       8.80       8.73       8.68       8.76       8.63       8.73
##  5 ENSG…       7.80       7.82       7.56       7.74       7.86       7.67
##  6 ENSG…       6.76       6.76       6.94       6.76       6.89       6.76
##  7 ENSG…      11.7       12.1       12.4       12.7       12.5       13.0 
##  8 ENSG…      10.7       10.4       10.7       10.6       10.5       10.3 
##  9 ENSG…       9.47       9.28       9.46       9.88       9.81       9.47
## 10 ENSG…       9.18       8.83       9.20       8.82       9.57       9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## #   SRR1039521 <dbl>

Reshaping data to “long” format

There’s two functions that allow us to convert tables from a “wide” to a “long” format and vice-versa: gather() and spread().

If you notice, what we want in our case is to gather our gene expression columns. We can do this by giving gather() four things:

  • The data, i.e. our table of gene expression
  • A name for the new column that will contain the old column names (the “key” column)
  • A name for the new column that will contain our expression counts (the “values” column)
  • The names of the columns that we want to gather in this way

Like so:

# "gather" the counts data
cts_long <- cts_tbl %>% 
  gather(sample_id, cts, SRR1039508:SRR1039521)

cts_long
## # A tibble: 512,816 x 3
##    gene            sample_id    cts
##    <chr>           <chr>      <dbl>
##  1 ENSG00000000003 SRR1039508  9.76
##  2 ENSG00000000005 SRR1039508  6.76
##  3 ENSG00000000419 SRR1039508  9.36
##  4 ENSG00000000457 SRR1039508  8.80
##  5 ENSG00000000460 SRR1039508  7.80
##  6 ENSG00000000938 SRR1039508  6.76
##  7 ENSG00000000971 SRR1039508 11.7 
##  8 ENSG00000001036 SRR1039508 10.7 
##  9 ENSG00000001084 SRR1039508  9.47
## 10 ENSG00000001167 SRR1039508  9.18
## # ... with 512,806 more rows

If we wanted to do the reverse, we could use the spread() function:

cts_long %>% 
  spread(sample_id, cts)
## # A tibble: 64,102 x 9
##    gene  SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG…       9.76       9.45       9.88       9.66      10.2        9.90
##  2 ENSG…       6.76       6.76       6.76       6.76       6.76       6.76
##  3 ENSG…       9.36       9.60       9.51       9.54       9.45       9.59
##  4 ENSG…       8.80       8.73       8.68       8.76       8.63       8.73
##  5 ENSG…       7.80       7.82       7.56       7.74       7.86       7.67
##  6 ENSG…       6.76       6.76       6.94       6.76       6.89       6.76
##  7 ENSG…      11.7       12.1       12.4       12.7       12.5       13.0 
##  8 ENSG…      10.7       10.4       10.7       10.6       10.5       10.3 
##  9 ENSG…       9.47       9.28       9.46       9.88       9.81       9.47
## 10 ENSG…       9.18       8.83       9.20       8.82       9.57       9.26
## # ... with 64,092 more rows, and 2 more variables: SRR1039520 <dbl>,
## #   SRR1039521 <dbl>

(see here for another example of using these functions)

Joining tables together

The next thing we want to do is to add information about each sample to our gene expression table.

sample_info
## # A tibble: 8 x 3
##   sample     cell    dex  
##   <chr>      <chr>   <chr>
## 1 SRR1039508 N61311  untrt
## 2 SRR1039509 N61311  trt  
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611 trt  
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611 trt  
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011 trt

We can do this by joining the cts_long table with the sample_info table.

Joining tables is an important task that is often needed in multi-layered data.

There are several different kinds of joins that can be performed. Let’s look at the “Combine Data Sets” section of the dplyr cheatsheet to understand them better.

In our case, we know that all samples in our counts table also occur in the sample_info table, so there’s a few different kinds of joins that would work.

For safety, let’s use full_join(), to ensure we retain all data:

cts_long <- full_join(cts_long, sample_info, by = c("sample_id" = "sample"))

cts_long
## # A tibble: 512,816 x 5
##    gene            sample_id    cts cell   dex  
##    <chr>           <chr>      <dbl> <chr>  <chr>
##  1 ENSG00000000003 SRR1039508  9.76 N61311 untrt
##  2 ENSG00000000005 SRR1039508  6.76 N61311 untrt
##  3 ENSG00000000419 SRR1039508  9.36 N61311 untrt
##  4 ENSG00000000457 SRR1039508  8.80 N61311 untrt
##  5 ENSG00000000460 SRR1039508  7.80 N61311 untrt
##  6 ENSG00000000938 SRR1039508  6.76 N61311 untrt
##  7 ENSG00000000971 SRR1039508 11.7  N61311 untrt
##  8 ENSG00000001036 SRR1039508 10.7  N61311 untrt
##  9 ENSG00000001084 SRR1039508  9.47 N61311 untrt
## 10 ENSG00000001167 SRR1039508  9.18 N61311 untrt
## # ... with 512,806 more rows

Notice how, in this case, the columns we wanted to be matched between the two tables had different names. Also, we can join tables by more than one column, which is not necessary here.

Filter and classify genes

Now we want to add information about the test results for each gene to our gene expression table. However, we only want to retain genes that are likely to have a response to the treatment.

Exercise:

  • Using the test_result table, make a new table called genes_of_interest that contains only those genes with padj < 0.05. Keep only the columns gene, log2FoldChange.
  • Create a new table called cts_filtered by joining the previous table with the cts_long table. Retain only the filtered genes.
cts_filtered
## # A tibble: 1,496 x 6
##    gene            sample_id    cts cell    dex   log2FoldChange
##    <chr>           <chr>      <dbl> <chr>   <chr>          <dbl>
##  1 ENSG00000013293 SRR1039508  8.82 N61311  untrt          -2.65
##  2 ENSG00000013293 SRR1039509  7.45 N61311  trt            -2.65
##  3 ENSG00000013293 SRR1039512  9.14 N052611 untrt          -2.65
##  4 ENSG00000013293 SRR1039513  7.99 N052611 trt            -2.65
##  5 ENSG00000013293 SRR1039516  9.39 N080611 untrt          -2.65
##  6 ENSG00000013293 SRR1039517  7.75 N080611 trt            -2.65
##  7 ENSG00000013293 SRR1039520  9.67 N061011 untrt          -2.65
##  8 ENSG00000013293 SRR1039521  7.95 N061011 trt            -2.65
##  9 ENSG00000013297 SRR1039508 12.7  N61311  untrt          -1.67
## 10 ENSG00000013297 SRR1039509 10.9  N61311  trt            -1.67
## # ... with 1,486 more rows

Finally, we want to classify our genes according to whether they were up- or down-regulated after the treatement. These genes would have, respectively, a positive or negative log2(fold-change) between the two conditions.

We can therefore use the log2FoldChange column to create a new column that will classify our genes as being of either class, depending on whether its value is positive or negative. This type of operation can be done using the ifelse() function.

cts_filtered <- cts_filtered %>% 
  mutate(up_or_down = ifelse(log2FoldChange > 0, "up-regulated", "down-regulated"))

cts_filtered
## # A tibble: 1,496 x 7
##    gene            sample_id    cts cell   dex   log2FoldChange up_or_down
##    <chr>           <chr>      <dbl> <chr>  <chr>          <dbl> <chr>     
##  1 ENSG00000013293 SRR1039508  8.82 N61311 untrt          -2.65 down-regu…
##  2 ENSG00000013293 SRR1039509  7.45 N61311 trt            -2.65 down-regu…
##  3 ENSG00000013293 SRR1039512  9.14 N0526… untrt          -2.65 down-regu…
##  4 ENSG00000013293 SRR1039513  7.99 N0526… trt            -2.65 down-regu…
##  5 ENSG00000013293 SRR1039516  9.39 N0806… untrt          -2.65 down-regu…
##  6 ENSG00000013293 SRR1039517  7.75 N0806… trt            -2.65 down-regu…
##  7 ENSG00000013293 SRR1039520  9.67 N0610… untrt          -2.65 down-regu…
##  8 ENSG00000013293 SRR1039521  7.95 N0610… trt            -2.65 down-regu…
##  9 ENSG00000013297 SRR1039508 12.7  N61311 untrt          -1.67 down-regu…
## 10 ENSG00000013297 SRR1039509 10.9  N61311 trt            -1.67 down-regu…
## # ... with 1,486 more rows

Calculate mean expression per gene

Now, we can apply the “split-apply-combine” strategy that we’ve already learned about, to summarise the data per gene.

Exercise:

Using group_by and summarise create a new table called cts_per_gene with the mean expression of each gene in each treatment.

## # A tibble: 374 x 4
##    gene            dex   up_or_down     mean_cts
##    <chr>           <chr> <chr>             <dbl>
##  1 ENSG00000013293 trt   down-regulated     7.79
##  2 ENSG00000013293 untrt down-regulated     9.26
##  3 ENSG00000013297 trt   down-regulated    10.8 
##  4 ENSG00000013297 untrt down-regulated    12.4 
##  5 ENSG00000025423 trt   down-regulated     7.80
##  6 ENSG00000025423 untrt down-regulated     8.75
##  7 ENSG00000035664 trt   up-regulated      10.1 
##  8 ENSG00000035664 untrt up-regulated       8.90
##  9 ENSG00000046653 trt   up-regulated      10.7 
## 10 ENSG00000046653 untrt up-regulated       8.87
## # ... with 364 more rows

Making our plots

With that final table, we are now ready to produce our plot!

ggplot(cts_per_gene, aes(dex, mean_cts)) +
  geom_line(aes(group = gene)) +
  facet_grid(~ up_or_down)

Notice how we needed to specify the option group = gene, to tell geom_line that each line should connect the values referring to the same gene. Otherwise, the function would not know what points should be connected.

Using factors to order values in plots

The plot above is quite close to what we want, but we notice that the labels on the x-axis appear in the opposite order of what we might want.

To solve this, we can convert the dex variable to a factor, specifying the levels in the right order:

cts_per_gene %>% 
  mutate(dex = factor(dex, levels = c("untrt", "trt"))) %>% 
  ggplot(aes(dex, mean_cts)) +
  geom_line(aes(group = gene), alpha = 0.5) +
  facet_grid(~ up_or_down)

Ordering factors based on other variables

Let’s look at how using factors can be very useful to improve our display of information.

Let’s try to make a “heatmap” representation of our genes’ expression in every sample:

ggplot(cts_filtered, aes(sample_id, gene)) +
  geom_tile(aes(fill = cts))

From the graph above, we can first see that we have an aesthetic problem with our axis labels: neither of them are readable! We can fix this by customizing our plot using the theme() function:

# Make the text on the x-axis at an angle
# Remove the text from the y-axis
ggplot(cts_filtered, aes(sample_id, gene)) +
  geom_tile(aes(fill = cts)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank())

Now it’s easy to see that the order in which the samples appear is alphabetical, but ideally we would like it if samples appeared grouped according to their treatment.

## # A tibble: 8 x 2
##   dex   sample    
##   <chr> <chr>     
## 1 trt   SRR1039509
## 2 trt   SRR1039513
## 3 trt   SRR1039517
## 4 trt   SRR1039521
## 5 untrt SRR1039508
## 6 untrt SRR1039512
## 7 untrt SRR1039516
## 8 untrt SRR1039520

We know we can change the order of the labels if we convert the sample_id variable into a factor. And in this case, we would like the order of the samples to be according to the order of the dex variable.

Of course we could do it by hand, but imagine you had hundreds of samples!

Here’s a simplified example of how we can order a factor based on another variable:

animal <- c("mouse", "elephant", "cat", "mouse", "elephant")
weight <- c(10, 1e6, 10000, 10, 1e6)

# Default order of levels is alphabetical
factor(animal)
## [1] mouse    elephant cat      mouse    elephant
## Levels: cat elephant mouse
# We can use the reorder function to order the levels based on a numeric variable
reorder(animal, weight)
## [1] mouse    elephant cat      mouse    elephant
## attr(,"scores")
##      cat elephant    mouse 
##    1e+04    1e+06    1e+01 
## Levels: mouse cat elephant

But how about this example?

animal <- c("mouse", "elephant", "cat", "mouse", "elephant")
animal_order <- c("1st", "3rd", "2nd", "1st", "3rd")

# Order levels of animals based on the size
reorder(animal, animal_order)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## [1] mouse    elephant cat      mouse    elephant
## attr(,"scores")
##      cat elephant    mouse 
##       NA       NA       NA 
## Levels: cat elephant mouse

This throws a warning and does not reorder the levels based on the order we want.

This is because the reorder() function works best if the variable we’re using to define the order of the levels is numeric. In this case, we could do the following:

# Define the rank of each string in `animal_ranks`
rank(animal_order)
## [1] 1.5 4.5 3.0 1.5 4.5
# Use this ranking of our character vector inside reorder
reorder(animal, rank(animal_order))
## [1] mouse    elephant cat      mouse    elephant
## attr(,"scores")
##      cat elephant    mouse 
##      3.0      4.5      1.5 
## Levels: mouse cat elephant

Using the trick above, we can order our samples according to their treatment:

cts_filtered %>% 
  mutate(sample_id = reorder(sample_id, rank(dex))) %>% 
  ggplot(aes(sample_id, gene)) +
  geom_tile(aes(fill = cts)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank())

This is already much more structured. However, we still have the problem that our genes are also not ordered logically. Preferably we would like them ordered according to their fold-change between treatments.

Exercise:

Using the same trick as above, order the genes according to the log2FoldChange column - you should obtain this much more structured graph:

Transforming expression data for visualization

There’s one final thing that will improve our “heatmap” representation of the data. We could transform our data to ensure that the different genes are represented on a comparable scale.

You could imagine that two genes might have changed by the same magnitude (say, doubled the expression after treatment), but their base mean levels of expression might have been quite different. For example, one gene might have changed from 10 to 20 and another from 100 to 200.

If what we’re interested in is the relative change in expression, then those two genes will appear more different than they really are in our heatmap.

A useful data transformation in this case is to center and scale each genes’ expression by their mean and standard deviation, respectively. The values thus obtained are known as z-scores, and can be interpreted as the “number of standard deviations away from the mean”.

A positive z-score means that the gene’s expression was above the average across samples, whereas a negative one means it was below average. A value of zero means the gene’s expression was exactly average.

In R, we can use the function scale to do this transformation. Let’s re-do our heatmap, but center and scale each gene’s expression first:

cts_filtered %>% 
  mutate(sample_id = reorder(sample_id, rank(dex)),
         gene = reorder(gene, log2FoldChange)) %>% 
  group_by(gene) %>% 
  mutate(cts = scale(cts)) %>% 
  ggplot(aes(sample_id, gene)) +
  geom_tile(aes(fill = cts)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank()) + 
  scale_fill_gradient2(low = "blue3", mid = "grey", high = "red3") +
  labs(x = "Sample", fill = "Z-score")

And that’s it! Notice that because of the symmetry of z-scores, they are very suited to a “divergent colour scale”, which we’ve set with the function scale_fill_gradient2 to emphasize positive and negative values in a more intuitive way.

Also notice how we’ve further improved the appearance of the plot by removing the y-axis label and ticks, and customised our labels.