(Acknowledgement to Ines De Santiago for her session at the previous summer school)
FastQC from Babraham Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads
The manual for fastqc
is available online and is very comprehensive; especially the parts which describe particular sections of the report. The authors also run a “QCfail” blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.
A “traffic light” system is used to draw your attention to sections of the report that require further investigation. However, it is worth bearing in mind that fastqc
is designed to be run on fastq files from any type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. It could be that you expect high levels of duplication or GC content. Always consider the nature of your study before taking any drastic action!
Also, fastqc
will not actually do anything to your data. If you decide to trim or remove contamination for your samples, you will need to use another tool.
fastqc
reportSome simple statistics about the composition of your file, which can be useful to see if it has guessed the encoding correctly and identified the correct number of reads. This section of the report is designed never to give a warning message
This section of the report is probably the one that receives most attention. It’s generally accepted that there is a degradation of quality over the duration of a sequencing run, but the extent to which the quality “drops-off” should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box represent the median and inter-quartile range in the usual manner.
Ideally, the plot should look something like following:-
However, a warning will be triggered if the lower quartile (25% of the data) of any base in less than 10, or if the median for any base is less than 25. A failure (red cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than 20.
With this section of the report, we are checking to see if there is a population of sequences that have low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean below 20.
This is one area of the report where you need to exercise some caution, because the results may depend on the type of sequencing you are performing. This course is going to be focussed on whole-genome (or exome) DNA sequencing, which should result in an even distribution of bases along the read length.
Sometimes you will get biased sequencing composition at the start of reads due to adaptor contamination, which would tend to be flagged elsewhere in the report
And some biases are inevitable due to the type of sequencing being performed. e.g. RNA-seq is know to introduce biases at the start of reads due to “random” hexamer priming
Or in the cases of a RRBS library, you will have converted most of the C bases to T, so would expect something along the following lines:-
A familiar bell-shaped normal distribution should be seen. However, the location of the mean (central peak) will depend on the species being sequenced. fastqc
has no way of knowing what species to expect, so may produce a warning or error message for an otherwise good-quality sequencing run.
Significant deviations from normal would tend to signify a problem with the library though, or contamination from different species
Recall that an N
base is recorded when we are not confident enough to make a particular base call. We might sometimes see some N
appearing at the end of a sequence due to the expected drop in quality, although this should be a low proportion.
As you might expect, this will tell you if all your reads are not the same length. There might be entirely plausible reasons for having reads of different length (e.g. reads generated from multiple runs), so again we may wish to ignore any warnings that arise from this part of the report.
Again, we need to exercise caution when interpreting these plots and traffic lights as they as based on a library that is diverse and not enriched for any particular property.
A perfectly good and respectable RNA-seq library may have duplication rates above 50%
This section of the report can highlight contamination or lower-than-expected diversity
The tool will attempt to identify and adaptors present in the data from common ones that it knows about
Although the overrepresented sequences section above will identify sequences that are exactly duplicated, it may miss other problems in the data such as i) long sequences with poor quality ii) repeated sequences appearing at various places
This section reports the positional bias of the 6 most-biased k-mers, and whereabouts they occur in the read
“Random” hexamer primer in RNA-seq libraries are sometimes not as random as we would hope
Described in detail in Hansen et al. (2010)
fastqc
fastqc
can be run from the CRUK Docker as follows;
cd /home/participant/Course_Materials/Day1/
fastqc sample.fq1
As a result, you should get two files in your working directory; sample.fq1_fastqc.zip
and sample.fq1_fastqc.html
. The .zip
file contains all the metrics that fastqc
computes, should you wish to perform extra manipulation and visualisation beyond what fastqc
offers.
The html
file can be opened in a web browser. If you follow the link on the Desktop to Course_Materials
you should be able to navigate to the folder Day1/
and see the files we have created.
fastqc
report for our example file sample.fq1
If time allows, we will cover how to trim our data
Based on these plots we may want to trim our data; fastqc will not do this for us.
Popular choices are fastx_toolit, trimmomatic and cutadapt; all of which should be installed on your computers. As implied by the name, cutadapt
is useful for removing adaptor sequences.
fastx_toolkit
allows us to remove reads that do not have a high enough proportion of high-quality reads.
fastq_quality_filter -v Q 33 -q 20 -p 75 -i sample.fq1 -o sample_filtered.fq
the options were used:-
-i: input file
-o: output file
-v: report the number of sequences
-Q: 33, determines the input quality ASCII offset
-q: 20, the quality value required
-p 75: the percentage of bases required to have that quality value
the output should be something like…
Quality cut-off: 20
Minimum percentage: 75
Input: 6290535 reads.
Output: 5399767 reads.
discarded 890768 (14%) low-quality reads.
It can also do straightforward trimming to a particular read length
fastx_trimmer -v -f 1 -l 60 -i Day1/sample.fq1 -o sample.fastx.trimmed.fq1
the options specified were:-
-f: first base to keep
-l: last base to keep
-i: input file
-o: output file
-v: report number of sequences
Trimmomatic is interesting as it allows for differently-sized reads.
To use trimmomatic
we can run the following in a Terminal. Here $TRIMMOMATIC
is used to refer to the particular location that the tool has been installed to.
java -jar $TRIMMOMATIC SE -phred33 sample.fq1 sample.trimmed.fq1 TRAILING:3 MINLEN:30
SE: run single-end mode
-phred33: use the offset of 33 to interpret quality scores
TRAILING:3 cut bases from the end of the read if below 3
For this output file it is instructive to load using ShortRead
trimmed.fq <- readFastq("sample.trimmed.fq1")
trimmed.fq
## class: ShortReadQ
## length: 5971322 reads; width: 30..68 cycles
How many were removed?
length(trimmed.fq)/length(fq)
## [1] 0.949255
summary(width(sread(trimmed.fq)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 68.00 68.00 64.58 68.00 68.00
hist(width(sread(trimmed.fq)))
We can verify why some of the reads were removed
old.ids <- as.character(id(fq))
ids.left <- as.character(id(trimmed.fq))
missing.ids <- setdiff(old.ids,ids.left)
myquals <- quality(fq)
myquals[old.ids %in% missing.ids]
## class: FastqQuality
## quality:
## A BStringSet instance of length 319213
## width seq
## [1] 68 #############################...############################
## [2] 68 #############################...############################
## [3] 68 #############################...############################
## [4] 68 #############################...############################
## [5] 68 L5;8<=>:>;<?><<;7>?##########...############################
## ... ... ...
## [319209] 68 L<:9=;=9:=6;8;>>:@9=>>?######...############################
## [319210] 68 #############################...############################
## [319211] 68 H<.3*;:<<<###################...############################
## [319212] 68 #############################...############################
## [319213] 68 #############################...############################
fastqc
and observe what effect this has on the resultsfastqc sample.trimmed.fq1
fastqc sample.fastx.trimmed.fq1
fastqc sample.filtered.fq1
For completeness, we will show the commands used to aligned these two example fastq files.
Note, that are assuming that you are still located in the Day1
directory.
cd /home/participant/Course_Materials/Day1
We have previously downloaded a reference genome from the 1000 genomes into a directory raw_data
wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta -P ../ref_data/
cd ../ref_data
gunzip ref_data/human_g1k_v37.fasta.gz
bwa
can index the unzipped .fasta
file. The option -p
allows us how we want to prefix all the files that the indexing procedure creates. [N.B this step will take a long time]
bwa index -p hg19 -a bwtsw Homo_sapiens_assembly19.fasta
Here’s one we made earlier!
ls -l ../ref_data/
The command to perform the alignment is bwa aln
which has many options. We are going to stick with the default.
bwa aln
Each .fastq
file needs to be aligned separately. The output from this step is in a binary format that we cannot interpret directly.
bwa aln ../ref_data/hg19 sample.fq1 > sample_1.sai
bwa aln ../ref_data/hg19 sample.fq2 > sample_2.sai
We need to make use of the bwa sampe
command to create a human-readable format. We give it the files from the previous step and the fastq files.
bwa sampe ../ref_data/hg19 sample_1.sai sample_2.sai sample.fq1 sample.fq2 > sample.sam
The output is a .sam
file that will be described in the next section