Hands-on Practice

Overview

  • Understanding the contents of a .fastq file
  • How .fastq files can be accessed in Bioconductor
    • (should you ever need to
  • What information in a .fastq file can inform us about quality
  • How complex data structures can be assessed in Bioconductor

Two example .fastq files have been provided for you in the folder /home/participant/Course_Materials/Day1/. As we discussed, they can be viewed without any specialist software. If you are curious how these fastq files were generated, you can see the Appendix for details.

We can launch the CRUK Docker from the Desktop and navigate to the directory containing the files

cd /home/participant/Course_Materials/Day1/

Then print the first few lines using the standard unix command head

head -n 12 sample.fq1
@SRR081708.237649/1
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
+
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
@SRR081679.742077/1
CTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCAGCCTAACCCGAACGCT
+
####################################################################
@SRR079476.668457/1
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
+
S=;?<>>><>=??>=@=?@?=@>@@@>A?@A@>A>@A@>@>@@@>A>@@@>@>@@?>@=>@@=?>@?Q
head -n 12 sample.fq2
@SRR081708.237649/2
GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA
+
S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q
@SRR081679.742077/2
GGATTGGGTTAGGGTTCGGGTTAGGGGTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGGTAGGG
+
####################################################################
@SRR079476.668457/2
TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGGTAGGGTTAGGGTTG
+
S=;>>>:=;?>@;=<?@@<><?@@;@<AA@=<BA?;=:?>>4A9?@@<<:<;:"=8?;;2=7<:=###

The unix command wc can count the number of lines in a file with the option -l

wc -l sample.fq1
25162140 sample.fq1

We will now use the ShortRead Bioconductor package to understand these files a bit more. Please note that we are using this package for exploration and to demonstrate what the data look like. We probably wouldn’t want to perform these tasks in a production-level analysis, and there are far superior tools for visualising fastq files (as we will see later).

Getting started

Load RStudio….

Working directory

Session -> Set Working Directory -> Choose Directory

/home/participant/Course_Materials/Day1/

Template

/home/participant/Course_Materials/Day1/Session2-template.Rmd



Exercise

Import the example fastq file using the ShortRead package.

  • How many reads are found?
  • How long is each sequenced read?
library(ShortRead)
fq <- readFastq("sample.fq1")
fq



The object we have just created is typical of how data are presented in Bioconductor packages; typing the name of the object fq prints a summary of the contents to the screen, and we have to use a series of functions to extract the contents. Such functions are referred to as accessor functions.

One such function is called sread and allows us to access the sequencing reads. Conveniently, the function prints a summary to the screen so that only the first five and last five reads in the file are displayed. The type of this object is a DNAStringSet, which is a very efficient way of storing DNA sequences in Bioconductor. The Biostrings package has lots of useful tools to manipulate these objects. We can save the result of sread to a variable for futher processing.

sread(fq)
##   A DNAStringSet instance of length 6290535
##           width seq
##       [1]    68 GGGTTAGGGTTAGGGTTAGGGTTAGGGT...TAGGGTTAGGGTTAGGGTTAGGGTTAGG
##       [2]    68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTCAGCCTAACCCGAACGCT
##       [3]    68 CTAACCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTAACCCTAACCCTAACCCT
##       [4]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GGGTTAGGGTTAGGGCTAGGGTTAGGGT
##       [5]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...AGGGTTAGGGTTAGGGTTAGGGGTAGGG
##       ...   ... ...
## [6290531]    68 TGGGTACTTTGATTTTTTATGTACAATA...TTTGGTTACTTTGATATTTTATTTAGAG
## [6290532]    68 CTTGATATTTTATGTACAGTATACAATA...ACTTTGATATTTTATGTACAGAATAAAA
## [6290533]    68 ACTCCACTCAACAACATTCCATTTCATT...ACACAACTCCATTCCCCTCCACTCCACT
## [6290534]    68 CACTTCATCACATTCCATTCGATTTCAT...GAATCTTTCAATTTCATTCCATTCGATT
## [6290535]    68 CAAATGGAATGAAATGGAACGGAGTGGA...GGAATCGAATGGAATGGAATCAAACGGA
myreads <- sread(fq)

Even though the data they represent are much more complex, the way in which we subset the data should be familiar to us. This means there are no new rules to learn when it comes to accessing the data.

Try the following:-

myreads[1:5]
##   A DNAStringSet instance of length 5
##     width seq
## [1]    68 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG...GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
## [2]    68 CTAAGCCTAACCCTAACCCTAACCCTAACCC...TAACCCTAACCCTCAGCCTAACCCGAACGCT
## [3]    68 CTAACCCTAACCCTAACCCTAACCCTAACCC...TAACCCTAACCCTAACCCTAACCCTAACCCT
## [4]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG...TTAGGGTTAGGGTTAGGGCTAGGGTTAGGGT
## [5]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG...GTTAGGGTTAGGGTTAGGGTTAGGGGTAGGG

We can treat the myreads object as a vector and use the [] in the normal fashion. However, it might be tempting to think that two indexes could be used to subset both the read index ,and the position in the read.

myreads[1:5,1:10]

Alas, the object is not designed to work in this way and we get an error message.

## Error in myreads[1:5, 1:3] : invalid subsetting

If we want to chop-up the strings in this manner, we can use the subseq function. The is a specialised function that is much-more efficient that the general substr function in base R.




Exercise

  • Use subseq to print the first five bases of the sequenced reads
firstBases <- subseq(myreads,1,5)
firstBases
##   A DNAStringSet instance of length 6290535
##           width seq
##       [1]     5 GGGTT
##       [2]     5 CTAAG
##       [3]     5 CTAAC
##       [4]     5 GTTAG
##       [5]     5 GTTAG
##       ...   ... ...
## [6290531]     5 TGGGT
## [6290532]     5 CTTGA
## [6290533]     5 ACTCC
## [6290534]     5 CACTT
## [6290535]     5 CAAAT
  • How many reads start with five A’s?
    • HINT: we can test for equality using == which gives a logical (TRUE or FALSE)
    • HINT: table can be used to get counts of how many different values occur in a vector
  • What is the most-common five bases at the start of the reads
    • HINT: sort will produce a sorted version of a vector



Another useful function is letterFrequency, which seeks to tabulate the number of letters in each sequence in the form of a matrix.

af <- letterFrequency(sread(fq), letters=c("A","C","G","T","N"))
head(af)
##       A  C  G  T N
## [1,] 11  0 35 22 0
## [2,] 21 32  4 11 0
## [3,] 22 34  0 12 0
## [4,] 11  1 35 21 0
## [5,] 10  0 37 21 0
## [6,] 12  0 32 24 0

This matrix can be accessed in the usual manner. For instance, we could get the number of A bases for each read, as a vector, and produce a histogram with the following two lines:-

aS <- af[,1]
hist(aS)




Exercise

  • Create a vector to represent the “GC” content of each read
  • What is the mean number of GC bases?
  • Visualise the GC distribution using a histogram
  • What reads have more than 60 GC bases?

The following R output is for your reference

##   A DNAStringSet instance of length 1
##     width seq
## [1]    68 ATTTGTTGATTAGAAGCTCACAATTGGCAGG...TGGCTCNNNNNNNNAGGCAGAACACTTTGGG

##   A DNAStringSet instance of length 84
##      width seq
##  [1]    68 GGCGGCGGCGGCGGCGGCGGCAGGGGCTGCT...CGGCCTTGGGGCCGCGGCGCGGCCGCTGGG
##  [2]    68 GGGCGGCGGCGGGGACCCTGGCTGGGAGCGC...CCCGCGGGGGGGGGCGGCGGGGGGCTCACG
##  [3]    68 GGACGGGGCGGCTGGCCGACCCCCCCCCCCC...CCCGGGTGGGGGGGGGGGCCGGCGCGAGGG
##  [4]    68 CCGGACGGGGCGGCTGGCCGGGCGGGGGGGC...CCCCCCCCTCCCCCCCCGACGGGGCGGCCG
##  [5]    68 CCCGCCCGGCCAGCCGCCCCGTCCGGGGGGG...GGGGGTACGCCCCCCGACCGGCCCGGCTCC
##  ...   ... ...
## [80]    68 ACCGGCCTCTGGGGCCGCCCGGGCTGCCGGG...CGGCCGGGCCGGGGCCCGGGGGGGGACCCC
## [81]    68 GGGGCAGCCGGGGTTCCTGGCGCTCCGGGGG...GCGGCCGCCGGGGGGGCCGCCGGGCCGCAG
## [82]    68 TGGCGGCCGGCGCCCGTCCTCGGGGGCAGCC...GCCCCGCCGGCGGCCGCCACTCCGCCGCGC
## [83]    68 GCGGAGGCCGGGGCCGCAGAGGCCGGGGCCG...CGGGGCCGCAGAGGCCGGGGGCGGCGGCGG
## [84]    68 GGGCGGGGACAGAGAGGCGGGCGCGCCCCCG...CCGCCCAGGCCCCCAAGGGGGCGGGGGGCG



Note that at this point we have not performed any manipulation on the reads themselves. If we wanted to make persistent changes to the object, such as creating a subset, we would have to overwrite the myreads object (not recommended!) or create a new object

myreads
##   A DNAStringSet instance of length 6290535
##           width seq
##       [1]    68 GGGTTAGGGTTAGGGTTAGGGTTAGGGT...TAGGGTTAGGGTTAGGGTTAGGGTTAGG
##       [2]    68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTCAGCCTAACCCGAACGCT
##       [3]    68 CTAACCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTAACCCTAACCCTAACCCT
##       [4]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GGGTTAGGGTTAGGGCTAGGGTTAGGGT
##       [5]    68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...AGGGTTAGGGTTAGGGTTAGGGGTAGGG
##       ...   ... ...
## [6290531]    68 TGGGTACTTTGATTTTTTATGTACAATA...TTTGGTTACTTTGATATTTTATTTAGAG
## [6290532]    68 CTTGATATTTTATGTACAGTATACAATA...ACTTTGATATTTTATGTACAGAATAAAA
## [6290533]    68 ACTCCACTCAACAACATTCCATTTCATT...ACACAACTCCATTCCCCTCCACTCCACT
## [6290534]    68 CACTTCATCACATTCCATTCGATTTCAT...GAATCTTTCAATTTCATTCCATTCGATT
## [6290535]    68 CAAATGGAATGAAATGGAACGGAGTGGA...GGAATCGAATGGAATGGAATCAAACGGA

We can also look in more detail at the number of bases that were called during each round of sequencing

abc <- alphabetByCycle(myreads,alphabet = c("A","C","G","T","N"))
head(abc)[,1:5]
##         cycle
## alphabet    [,1]    [,2]    [,3]    [,4]    [,5]
##        A 1873688 1810280 1742078 1728768 1730331
##        C 1424249 1312001 1379743 1376224 1350739
##        G 1189956 1415207 1382478 1400930 1423660
##        T 1802642 1753047 1786236 1784613 1785805
##        N       0       0       0       0       0

The following plot can be used to illustrate the number of each DNA base observed at each position

plot(abc[1,],type="n",ylim=c(0,2000000))
text(abc[1,1:68],label="A",col="red")
text(abc[2,1:68],label="C",col="orange")
text(abc[3,1:68],label="G",col="blue")
text(abc[4,1:68],label="T",col="green")

Also in the .fastq file is the quality associated with each base call. This is especially important when it comes to calling SNVs, as we need to be sure of the bases we observe. The quality function is used to access these data.

quality(fq)
## class: FastqQuality
## quality:
##   A BStringSet instance of length 6290535
##           width seq
##       [1]    68 N?>:<9>>>:=;>>?<>:@?>;==@@@>...>4B=<>>.@>?<@;?#############
##       [2]    68 ############################...############################
##       [3]    68 S=;?<>>><>=??>=@=?@?=@>@@@>A...>@@@>A>@@@>@>@@?>@=>@@=?>@?Q
##       [4]    68 ############################...############################
##       [5]    68 O9<:=;:7<8===8=:>=?:>9?>>9=8...=@<==77?;=##################
##       ...   ... ...
## [6290531]    68 S>??:<=>>>>>=?????=>@==>>@>>...>@@@A>@>>@??@>><=>>><=>?;@?S
## [6290532]    68 S>=>=<;<=>><;?==>>?==><>>@>>...>????@?>>=@??=>A==8=@>==;@@@
## [6290533]    68 S<<=><<=><?==@=>=>?@>=>???=@...>>@=>@??@=><9=;>???9==;?>###
## [6290534]    68 S=<>=><<>==>=>??>>?@9?>??@>>...:?A?A@?@@?@???A?>>??>=?@8?=P
## [6290535]    68 S=>><>?=><>>?@>?@>@>9A?@>@@?...9A@@>A9??>@@@@=AA?@=?>??>8@S
myquals <- quality(fq)

As we noted before, a variey of different schemes can be used to encode the qualities. Fortunately, with the help of the encoding function, we can make a guess about what scale is being used in our file

encoding(quality(fq))
##  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
##  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J 
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41



Exercise

  • What can you say about the quality of base calls in the first five reads?
  • What reads seem to have lower quality?
    • Is it a particular part of the read that is low quality?
    • Or the whole read?



Quality scores can often be used to motivate the need for trimming, or to use an aligner that can incorporate quality information. We can use the following two commands to obtain a more-familiar matrix of numeric values to represent quality

qm <- as(quality(fq), "matrix")
head(qm)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]   45   30   29   25   27   24   29   29   29    25    28    26    29
## [2,]    2    2    2    2    2    2    2    2    2     2     2     2     2
## [3,]   50   28   26   30   27   29   29   29   27    29    28    30    30
## [4,]    2    2    2    2    2    2    2    2    2     2     2     2     2
## [5,]   46   24   27   25   28   26   25   22   27    23    28    28    28
## [6,]   50   29   26   30   30   29   26   28   26    30    29    30    28
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]    29    30    27    29    25    31    30    29    26    28    28
## [2,]     2     2     2     2     2     2     2     2     2     2     2
## [3,]    29    28    31    28    30    31    30    28    31    29    31
## [4,]     2     2     2     2     2     2     2     2     2     2     2
## [5,]    23    28    25    29    28    30    25    29    24    30    29
## [6,]    29    25    30    31    30    29    30    29    32    32    32
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## [1,]    31    31    31    29    30    28    32    32    32    27    29
## [2,]     2     2     2     2     2     2     2     2     2     2     2
## [3,]    31    31    29    32    30    31    32    31    29    32    29
## [4,]     2     2     2     2     2     2     2     2     2     2     2
## [5,]    29    24    28    23    32    30    28    22    26    29    29
## [6,]    27    31    28    32    31    32    28    31    28    31    31
##      [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## [1,]    28    32    31    30    21    29    19    33    28    27    29
## [2,]     2     2     2     2     2     2     2     2     2     2     2
## [3,]    31    32    31    29    31    29    31    31    31    29    32
## [4,]     2     2     2     2     2     2     2     2     2     2     2
## [5,]    32    31    31    29    28    28    31    27    28    28    22
## [6,]    32    28    31    27    31    32    29    31    27    31    30
##      [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## [1,]    29    13    31    29    30    27    31    26    30     2     2
## [2,]     2     2     2     2     2     2     2     2     2     2     2
## [3,]    29    31    31    31    29    31    29    31    31    30    29
## [4,]     2     2     2     2     2     2     2     2     2     2     2
## [5,]    22    30    26    28     2     2     2     2     2     2     2
## [6,]    27    27    29    22    33    30    27     2     2     2     2
##      [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68]
## [1,]     2     2     2     2     2     2     2     2     2     2     2
## [2,]     2     2     2     2     2     2     2     2     2     2     2
## [3,]    31    28    29    31    31    28    30    29    31    30    48
## [4,]     2     2     2     2     2     2     2     2     2     2     2
## [5,]     2     2     2     2     2     2     2     2     2     2     2
## [6,]     2     2     2     2     2     2     2     2     2     2     2

To look at global trends, it is useful to look at the mean quality at each base. The colMeans function in R provides a quick way of being able to compute the average (mean) of each column in a matrix.

qualmeans <- colMeans(qm)
plot(qualmeans)

Hopefully this section was useful in familiarising yourself with the contents of a fastq file and give ideas about how R and Bioconductor can interface with sequencing data. However, in practice we would not normally look into fastq files in such depth in R.

Appendix

For your reference, here is how the example files were created

First, we download an example bam file from the 1000 genomes project

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/other_exome_alignments/NA06984/exome_alignment/NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam

Then the file is downsampled to give a much reduced set of reads. The original file can be removed

java -jar $PICARD DownsampleSam I=NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam O=random.bam P=0.1 VALIDATION_STRINGENCY=SILENT

rm NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam

For convenience, we filter the file to keep only reads that are properly paired. Then picard is used again to extract the fastq data from the file. As the original alignments were a mix of 68 and 76 bases, we trim all the reads to 68 bases to processing easier.

samtools view -f 0x02 -b random.bam > paired.bam 

java -jar $PICARD SamToFastq I=paired.bam F=sample.fq1 VALIDATION_STRINGENCY=SILENT F2=sample.fq2 R1_MAX_BASES=68 R2_MAX_BASES=68