.fastq
file.fastq
files can be accessed in Bioconductor
.fastq
file can inform us about qualityTwo example .fastq
files have been provided for you in the folder /home/participant/Course_Materials/Day1/
. As we discussed, they can be viewed without any specialist software. If you are curious how these fastq files were generated, you can see the Appendix for details.
We can launch the CRUK Docker from the Desktop and navigate to the directory containing the files
cd /home/participant/Course_Materials/Day1/
Then print the first few lines using the standard unix command head
head -n 12 sample.fq1
@SRR081708.237649/1
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
+
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
@SRR081679.742077/1
CTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCAGCCTAACCCGAACGCT
+
####################################################################
@SRR079476.668457/1
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
+
S=;?<>>><>=??>=@=?@?=@>@@@>A?@A@>A>@A@>@>@@@>A>@@@>@>@@?>@=>@@=?>@?Q
head -n 12 sample.fq2
@SRR081708.237649/2
GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA
+
S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q
@SRR081679.742077/2
GGATTGGGTTAGGGTTCGGGTTAGGGGTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGGTAGGG
+
####################################################################
@SRR079476.668457/2
TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGGTAGGGTTAGGGTTG
+
S=;>>>:=;?>@;=<?@@<><?@@;@<AA@=<BA?;=:?>>4A9?@@<<:<;:"=8?;;2=7<:=###
The unix command wc
can count the number of lines in a file with the option -l
wc -l sample.fq1
25162140 sample.fq1
We will now use the ShortRead
Bioconductor package to understand these files a bit more. Please note that we are using this package for exploration and to demonstrate what the data look like. We probably wouldn’t want to perform these tasks in a production-level analysis, and there are far superior tools for visualising fastq files (as we will see later).
Load RStudio….
Session -> Set Working Directory -> Choose Directory
/home/participant/Course_Materials/Day1/
/home/participant/Course_Materials/Day1/Session2-template.Rmd
Import the example fastq
file using the ShortRead
package.
library(ShortRead)
fq <- readFastq("sample.fq1")
fq
The object we have just created is typical of how data are presented in Bioconductor packages; typing the name of the object fq
prints a summary of the contents to the screen, and we have to use a series of functions to extract the contents. Such functions are referred to as accessor functions.
One such function is called sread
and allows us to access the sequencing reads. Conveniently, the function prints a summary to the screen so that only the first five and last five reads in the file are displayed. The type of this object is a DNAStringSet
, which is a very efficient way of storing DNA sequences in Bioconductor. The Biostrings
package has lots of useful tools to manipulate these objects. We can save the result of sread
to a variable for futher processing.
sread(fq)
## A DNAStringSet instance of length 6290535
## width seq
## [1] 68 GGGTTAGGGTTAGGGTTAGGGTTAGGGT...TAGGGTTAGGGTTAGGGTTAGGGTTAGG
## [2] 68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTCAGCCTAACCCGAACGCT
## [3] 68 CTAACCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTAACCCTAACCCTAACCCT
## [4] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GGGTTAGGGTTAGGGCTAGGGTTAGGGT
## [5] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...AGGGTTAGGGTTAGGGTTAGGGGTAGGG
## ... ... ...
## [6290531] 68 TGGGTACTTTGATTTTTTATGTACAATA...TTTGGTTACTTTGATATTTTATTTAGAG
## [6290532] 68 CTTGATATTTTATGTACAGTATACAATA...ACTTTGATATTTTATGTACAGAATAAAA
## [6290533] 68 ACTCCACTCAACAACATTCCATTTCATT...ACACAACTCCATTCCCCTCCACTCCACT
## [6290534] 68 CACTTCATCACATTCCATTCGATTTCAT...GAATCTTTCAATTTCATTCCATTCGATT
## [6290535] 68 CAAATGGAATGAAATGGAACGGAGTGGA...GGAATCGAATGGAATGGAATCAAACGGA
myreads <- sread(fq)
Even though the data they represent are much more complex, the way in which we subset the data should be familiar to us. This means there are no new rules to learn when it comes to accessing the data.
Try the following:-
myreads[1:5]
## A DNAStringSet instance of length 5
## width seq
## [1] 68 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG...GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
## [2] 68 CTAAGCCTAACCCTAACCCTAACCCTAACCC...TAACCCTAACCCTCAGCCTAACCCGAACGCT
## [3] 68 CTAACCCTAACCCTAACCCTAACCCTAACCC...TAACCCTAACCCTAACCCTAACCCTAACCCT
## [4] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG...TTAGGGTTAGGGTTAGGGCTAGGGTTAGGGT
## [5] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG...GTTAGGGTTAGGGTTAGGGTTAGGGGTAGGG
We can treat the myreads
object as a vector and use the []
in the normal fashion. However, it might be tempting to think that two indexes could be used to subset both the read index ,and the position in the read.
myreads[1:5,1:10]
Alas, the object is not designed to work in this way and we get an error message.
## Error in myreads[1:5, 1:3] : invalid subsetting
If we want to chop-up the strings in this manner, we can use the subseq
function. The is a specialised function that is much-more efficient that the general substr
function in base R.
subseq
to print the first five bases of the sequenced readsfirstBases <- subseq(myreads,1,5)
firstBases
## A DNAStringSet instance of length 6290535
## width seq
## [1] 5 GGGTT
## [2] 5 CTAAG
## [3] 5 CTAAC
## [4] 5 GTTAG
## [5] 5 GTTAG
## ... ... ...
## [6290531] 5 TGGGT
## [6290532] 5 CTTGA
## [6290533] 5 ACTCC
## [6290534] 5 CACTT
## [6290535] 5 CAAAT
==
which gives a logical (TRUE
or FALSE
)table
can be used to get counts of how many different values occur in a vectorsort
will produce a sorted version of a vectorAnother useful function is letterFrequency
, which seeks to tabulate the number of letters in each sequence in the form of a matrix.
af <- letterFrequency(sread(fq), letters=c("A","C","G","T","N"))
head(af)
## A C G T N
## [1,] 11 0 35 22 0
## [2,] 21 32 4 11 0
## [3,] 22 34 0 12 0
## [4,] 11 1 35 21 0
## [5,] 10 0 37 21 0
## [6,] 12 0 32 24 0
This matrix can be accessed in the usual manner. For instance, we could get the number of A
bases for each read, as a vector, and produce a histogram with the following two lines:-
aS <- af[,1]
hist(aS)
The following R output is for your reference
## A DNAStringSet instance of length 1
## width seq
## [1] 68 ATTTGTTGATTAGAAGCTCACAATTGGCAGG...TGGCTCNNNNNNNNAGGCAGAACACTTTGGG
## A DNAStringSet instance of length 84
## width seq
## [1] 68 GGCGGCGGCGGCGGCGGCGGCAGGGGCTGCT...CGGCCTTGGGGCCGCGGCGCGGCCGCTGGG
## [2] 68 GGGCGGCGGCGGGGACCCTGGCTGGGAGCGC...CCCGCGGGGGGGGGCGGCGGGGGGCTCACG
## [3] 68 GGACGGGGCGGCTGGCCGACCCCCCCCCCCC...CCCGGGTGGGGGGGGGGGCCGGCGCGAGGG
## [4] 68 CCGGACGGGGCGGCTGGCCGGGCGGGGGGGC...CCCCCCCCTCCCCCCCCGACGGGGCGGCCG
## [5] 68 CCCGCCCGGCCAGCCGCCCCGTCCGGGGGGG...GGGGGTACGCCCCCCGACCGGCCCGGCTCC
## ... ... ...
## [80] 68 ACCGGCCTCTGGGGCCGCCCGGGCTGCCGGG...CGGCCGGGCCGGGGCCCGGGGGGGGACCCC
## [81] 68 GGGGCAGCCGGGGTTCCTGGCGCTCCGGGGG...GCGGCCGCCGGGGGGGCCGCCGGGCCGCAG
## [82] 68 TGGCGGCCGGCGCCCGTCCTCGGGGGCAGCC...GCCCCGCCGGCGGCCGCCACTCCGCCGCGC
## [83] 68 GCGGAGGCCGGGGCCGCAGAGGCCGGGGCCG...CGGGGCCGCAGAGGCCGGGGGCGGCGGCGG
## [84] 68 GGGCGGGGACAGAGAGGCGGGCGCGCCCCCG...CCGCCCAGGCCCCCAAGGGGGCGGGGGGCG
Note that at this point we have not performed any manipulation on the reads themselves. If we wanted to make persistent changes to the object, such as creating a subset, we would have to overwrite the myreads
object (not recommended!) or create a new object
myreads
## A DNAStringSet instance of length 6290535
## width seq
## [1] 68 GGGTTAGGGTTAGGGTTAGGGTTAGGGT...TAGGGTTAGGGTTAGGGTTAGGGTTAGG
## [2] 68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTCAGCCTAACCCGAACGCT
## [3] 68 CTAACCCTAACCCTAACCCTAACCCTAA...CCCTAACCCTAACCCTAACCCTAACCCT
## [4] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GGGTTAGGGTTAGGGCTAGGGTTAGGGT
## [5] 68 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...AGGGTTAGGGTTAGGGTTAGGGGTAGGG
## ... ... ...
## [6290531] 68 TGGGTACTTTGATTTTTTATGTACAATA...TTTGGTTACTTTGATATTTTATTTAGAG
## [6290532] 68 CTTGATATTTTATGTACAGTATACAATA...ACTTTGATATTTTATGTACAGAATAAAA
## [6290533] 68 ACTCCACTCAACAACATTCCATTTCATT...ACACAACTCCATTCCCCTCCACTCCACT
## [6290534] 68 CACTTCATCACATTCCATTCGATTTCAT...GAATCTTTCAATTTCATTCCATTCGATT
## [6290535] 68 CAAATGGAATGAAATGGAACGGAGTGGA...GGAATCGAATGGAATGGAATCAAACGGA
We can also look in more detail at the number of bases that were called during each round of sequencing
abc <- alphabetByCycle(myreads,alphabet = c("A","C","G","T","N"))
head(abc)[,1:5]
## cycle
## alphabet [,1] [,2] [,3] [,4] [,5]
## A 1873688 1810280 1742078 1728768 1730331
## C 1424249 1312001 1379743 1376224 1350739
## G 1189956 1415207 1382478 1400930 1423660
## T 1802642 1753047 1786236 1784613 1785805
## N 0 0 0 0 0
The following plot can be used to illustrate the number of each DNA base observed at each position
plot(abc[1,],type="n",ylim=c(0,2000000))
text(abc[1,1:68],label="A",col="red")
text(abc[2,1:68],label="C",col="orange")
text(abc[3,1:68],label="G",col="blue")
text(abc[4,1:68],label="T",col="green")
Also in the .fastq
file is the quality associated with each base call. This is especially important when it comes to calling SNVs, as we need to be sure of the bases we observe. The quality
function is used to access these data.
quality(fq)
## class: FastqQuality
## quality:
## A BStringSet instance of length 6290535
## width seq
## [1] 68 N?>:<9>>>:=;>>?<>:@?>;==@@@>...>4B=<>>.@>?<@;?#############
## [2] 68 ############################...############################
## [3] 68 S=;?<>>><>=??>=@=?@?=@>@@@>A...>@@@>A>@@@>@>@@?>@=>@@=?>@?Q
## [4] 68 ############################...############################
## [5] 68 O9<:=;:7<8===8=:>=?:>9?>>9=8...=@<==77?;=##################
## ... ... ...
## [6290531] 68 S>??:<=>>>>>=?????=>@==>>@>>...>@@@A>@>>@??@>><=>>><=>?;@?S
## [6290532] 68 S>=>=<;<=>><;?==>>?==><>>@>>...>????@?>>=@??=>A==8=@>==;@@@
## [6290533] 68 S<<=><<=><?==@=>=>?@>=>???=@...>>@=>@??@=><9=;>???9==;?>###
## [6290534] 68 S=<>=><<>==>=>??>>?@9?>??@>>...:?A?A@?@@?@???A?>>??>=?@8?=P
## [6290535] 68 S=>><>?=><>>?@>?@>@>9A?@>@@?...9A@@>A9??>@@@@=AA?@=?>??>8@S
myquals <- quality(fq)
As we noted before, a variey of different schemes can be used to encode the qualities. Fortunately, with the help of the encoding
function, we can make a guess about what scale is being used in our file
encoding(quality(fq))
## ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## : ; < = > ? @ A B C D E F G H I J
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Quality scores can often be used to motivate the need for trimming, or to use an aligner that can incorporate quality information. We can use the following two commands to obtain a more-familiar matrix of numeric values to represent quality
qm <- as(quality(fq), "matrix")
head(qm)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 45 30 29 25 27 24 29 29 29 25 28 26 29
## [2,] 2 2 2 2 2 2 2 2 2 2 2 2 2
## [3,] 50 28 26 30 27 29 29 29 27 29 28 30 30
## [4,] 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5,] 46 24 27 25 28 26 25 22 27 23 28 28 28
## [6,] 50 29 26 30 30 29 26 28 26 30 29 30 28
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 29 30 27 29 25 31 30 29 26 28 28
## [2,] 2 2 2 2 2 2 2 2 2 2 2
## [3,] 29 28 31 28 30 31 30 28 31 29 31
## [4,] 2 2 2 2 2 2 2 2 2 2 2
## [5,] 23 28 25 29 28 30 25 29 24 30 29
## [6,] 29 25 30 31 30 29 30 29 32 32 32
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## [1,] 31 31 31 29 30 28 32 32 32 27 29
## [2,] 2 2 2 2 2 2 2 2 2 2 2
## [3,] 31 31 29 32 30 31 32 31 29 32 29
## [4,] 2 2 2 2 2 2 2 2 2 2 2
## [5,] 29 24 28 23 32 30 28 22 26 29 29
## [6,] 27 31 28 32 31 32 28 31 28 31 31
## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## [1,] 28 32 31 30 21 29 19 33 28 27 29
## [2,] 2 2 2 2 2 2 2 2 2 2 2
## [3,] 31 32 31 29 31 29 31 31 31 29 32
## [4,] 2 2 2 2 2 2 2 2 2 2 2
## [5,] 32 31 31 29 28 28 31 27 28 28 22
## [6,] 32 28 31 27 31 32 29 31 27 31 30
## [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## [1,] 29 13 31 29 30 27 31 26 30 2 2
## [2,] 2 2 2 2 2 2 2 2 2 2 2
## [3,] 29 31 31 31 29 31 29 31 31 30 29
## [4,] 2 2 2 2 2 2 2 2 2 2 2
## [5,] 22 30 26 28 2 2 2 2 2 2 2
## [6,] 27 27 29 22 33 30 27 2 2 2 2
## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68]
## [1,] 2 2 2 2 2 2 2 2 2 2 2
## [2,] 2 2 2 2 2 2 2 2 2 2 2
## [3,] 31 28 29 31 31 28 30 29 31 30 48
## [4,] 2 2 2 2 2 2 2 2 2 2 2
## [5,] 2 2 2 2 2 2 2 2 2 2 2
## [6,] 2 2 2 2 2 2 2 2 2 2 2
To look at global trends, it is useful to look at the mean quality at each base. The colMeans
function in R provides a quick way of being able to compute the average (mean) of each column in a matrix.
qualmeans <- colMeans(qm)
plot(qualmeans)
Hopefully this section was useful in familiarising yourself with the contents of a fastq
file and give ideas about how R and Bioconductor can interface with sequencing data. However, in practice we would not normally look into fastq files in such depth in R.
For your reference, here is how the example files were created
First, we download an example bam file from the 1000 genomes project
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/other_exome_alignments/NA06984/exome_alignment/NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam
Then the file is downsampled to give a much reduced set of reads. The original file can be removed
java -jar $PICARD DownsampleSam I=NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam O=random.bam P=0.1 VALIDATION_STRINGENCY=SILENT
rm NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam
For convenience, we filter the file to keep only reads that are properly paired. Then picard is used again to extract the fastq data from the file. As the original alignments were a mix of 68 and 76 bases, we trim all the reads to 68 bases to processing easier.
samtools view -f 0x02 -b random.bam > paired.bam
java -jar $PICARD SamToFastq I=paired.bam F=sample.fq1 VALIDATION_STRINGENCY=SILENT F2=sample.fq2 R1_MAX_BASES=68 R2_MAX_BASES=68