Fastqc Primer

(Acknowledgement to Ines De Santiago for her session at the previous summer school)

FastQC from Babraham Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads

The manual for fastqc is available online and is very comprehensive; especially the parts which describe particular sections of the report. The authors also run a “QCfail” blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.

A “traffic light” system is used to draw your attention to sections of the report that require further investigation. However, it is worth bearing in mind that fastqc is designed to be run on fastq files from any type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. It could be that you expect high levels of duplication or GC content. Always consider the nature of your study before taking any drastic action!

Also, fastqc will not actually do anything to your data. If you decide to trim or remove contamination for your samples, you will need to use another tool.

The sections of a `fastqc` report

Basic Statistics

Some simple statistics about the composition of your file, which can be useful to see if it has guessed the encoding correctly and identified the correct number of reads. This section of the report is designed never to give a warning message

Per-base sequence quality

This section of the report is probably the one that receives most attention. It’s generally accepted that there is a degradation of quality over the duration of a sequencing run, but the extent to which the quality “drops-off” should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box represent the median and inter-quartile range in the usual manner.

Ideally, the plot should look something like following:-

However, a warning will be triggered if the lower quartile (25% of the data) of any base in less than 10, or if the median for any base is less than 25. A failure (red cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than 20.

Per-sequence quality scores

With this section of the report, we are checking to see if there is a population of sequences that have low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean below 20.

Per-base sequence content

This is one area of the report where you need to exercise some caution, because the results may depend on the type of sequencing you are performing. This course is going to be focussed on whole-genome (or exome) DNA sequencing, which should result in an even distribution of bases along the read length.

Sometimes you will get biased sequencing composition at the start of reads due to adaptor contamination, which would tend to be flagged elsewhere in the report

And some biases are inevitable due to the type of sequencing being performed. e.g. RNA-seq is know to introduce biases at the start of reads due to “random” hexamer priming

Or in the cases of a RRBS library, you will have converted most of the C bases to T, so would expect something along the following lines:-

Per-sequence GC content

A familiar bell-shaped normal distribution should be seen. However, the location of the mean (central peak) will depend on the species being sequenced. fastqc has no way of knowing what species to expect, so may produce a warning or error message for an otherwise good-quality sequencing run.

Significant deviations from normal would tend to signify a problem with the library though, or contamination from different species

Per-base N content

Recall that an N base is recorded when we are not confident enough to make a particular base call. We might sometimes see some Nappearing at the end of a sequence due to the expected drop in quality, although this should be a low proportion.

Sequence length distribution

As you might expect, this will tell you if all your reads are not the same length. There might be entirely plausible reasons for having reads of different length (e.g. reads generated from multiple runs), so again we may wish to ignore any warnings that arise from this part of the report.

Duplicate Sequences

Again, we need to exercise caution when interpreting these plots and traffic lights as they as based on a library that is diverse and not enriched for any particular property.

A perfectly good and respectable RNA-seq library may have duplication rates above 50%

Overrepresented Sequences

This section of the report can highlight contamination or lower-than-expected diversity

The tool will attempt to identify and adaptors present in the data from common ones that it knows about

Kmer Content

Although the overrepresented sequences section above will identify sequences that are exactly duplicated, it may miss other problems in the data such as i) long sequences with poor quality ii) repeated sequences appearing at various places

This section reports the positional bias of the 6 most-biased k-mers, and whereabouts they occur in the read

“Random” hexamer primer in RNA-seq libraries are sometimes not as random as we would hope

Described in detail in Hansen et al. (2010)

Running `fastqc`

fastqc can be run from the CRUK Docker as follows;

cd /home/participant/Course_Materials/Day1/
fastqc sample.fq1

As a result, you should get two files in your working directory; sample.fq1_fastqc.zip and sample.fq1_fastqc.html. The .zip file contains all the metrics that fastqc computes, should you wish to perform extra manipulation and visualisation beyond what fastqc offers.

The html file can be opened in a web browser. If you follow the link on the Desktop to Course_Materials you should be able to navigate to the folder Day1/ and see the files we have created.

Exercise

View the fastqc report for our example file sample.fq1
Are there any areas of concern?

Optional / advanced section

If time allows, we will cover how to trim our data

Based on these plots we may want to trim our data; fastqc will not do this for us.

Popular choices are fastx_toolit, trimmomatic and cutadapt; all of which should be installed on your computers. As implied by the name, cutadapt is useful for removing adaptor sequences.

fastx_toolkit allows us to remove reads that do not have a high enough proportion of high-quality reads.

fastq_quality_filter -v Q 33 -q 20 -p 75 -i sample.fq1 -o sample_filtered.fq

the options were used:-

-i: input file
-o: output file
-v: report the number of sequences
-Q:  33, determines the input quality ASCII offset
-q: 20, the quality value required
-p 75: the percentage of bases required to have that quality value

the output should be something like…

Quality cut-off: 20
Minimum percentage: 75
Input: 6290535 reads.
Output: 5399767 reads.
discarded 890768 (14%) low-quality reads.

It can also do straightforward trimming to a particular read length

fastx_trimmer -v -f 1 -l 60 -i Day1/sample.fq1 -o sample.fastx.trimmed.fq1

the options specified were:-

-f: first base to keep
-l: last base to keep
-i: input file
-o: output file
-v: report number of sequences

Trimmomatic is interesting as it allows for differently-sized reads.

To use trimmomatic we can run the following in a Terminal. Here $TRIMMOMATIC is used to refer to the particular location that the tool has been installed to.

java -jar $TRIMMOMATIC SE -phred33 sample.fq1 sample.trimmed.fq1 TRAILING:3 MINLEN:30

SE: run single-end mode
-phred33: use the offset of 33 to interpret quality scores
TRAILING:3 cut bases from the end of the read if below 3

For this output file it is instructive to load using ShortRead

trimmed.fq <- readFastq("sample.trimmed.fq1")
trimmed.fq

## class: ShortReadQ
## length: 5971322 reads; width: 30..68 cycles

How many were removed?

length(trimmed.fq)/length(fq)

## [1] 0.949255

summary(width(sread(trimmed.fq)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   68.00   68.00   64.58   68.00   68.00

hist(width(sread(trimmed.fq)))

We can verify why some of the reads were removed

old.ids <- as.character(id(fq))
ids.left <- as.character(id(trimmed.fq))

missing.ids <- setdiff(old.ids,ids.left)
myquals <- quality(fq)
myquals[old.ids %in% missing.ids]

## class: FastqQuality
## quality:
##   A BStringSet instance of length 319213
##          width seq
##      [1]    68 #############################...############################
##      [2]    68 #############################...############################
##      [3]    68 #############################...############################
##      [4]    68 #############################...############################
##      [5]    68 L5;8<=>:>;<?><<;7>?##########...############################
##      ...   ... ...
## [319209]    68 L<:9=;=9:=6;8;>>:@9=>>?######...############################
## [319210]    68 #############################...############################
## [319211]    68 H<.3*;:<<<###################...############################
## [319212]    68 #############################...############################
## [319213]    68 #############################...############################

Exercise

Re-run fastqc and observe what effect this has on the results

fastqc sample.trimmed.fq1
fastqc sample.fastx.trimmed.fq1
fastqc sample.filtered.fq1

Alignment

Locating where each generated sequence came from in the genome
Outside the scope of this course
Usually perfomed automatically by a sequencing service
For most of what follows in the course, we will assume alignment has been performed and we are dealing with aligned data
- Popular aligners
- bwa http://bio-bwa.sourceforge.net/
- bowtie http://bowtie-bio.sourceforge.net/index.shtml
- novoalign http://www.novocraft.com/products/novoalign/
- stampy http://www.well.ox.ac.uk/project-stampy
- many, many more…..

For completeness, we will show the commands used to aligned these two example fastq files.

Note, that are assuming that you are still located in the Day1 directory.

cd /home/participant/Course_Materials/Day1

We have previously downloaded a reference genome from the 1000 genomes into a directory raw_data

wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta -P ../ref_data/
cd ../ref_data
gunzip ref_data/human_g1k_v37.fasta.gz

bwa can index the unzipped .fasta file. The option -p allows us how we want to prefix all the files that the indexing procedure creates. [N.B this step will take a long time]

bwa index -p hg19 -a bwtsw Homo_sapiens_assembly19.fasta

Here’s one we made earlier!

ls -l ../ref_data/

The command to perform the alignment is bwa aln which has many options. We are going to stick with the default.

bwa aln

Each .fastq file needs to be aligned separately. The output from this step is in a binary format that we cannot interpret directly.

bwa aln ../ref_data/hg19 sample.fq1 > sample_1.sai

bwa aln ../ref_data/hg19 sample.fq2 > sample_2.sai

We need to make use of the bwa sampe command to create a human-readable format. We give it the files from the previous step and the fastq files.

bwa sampe ../ref_data/hg19 sample_1.sai sample_2.sai sample.fq1 sample.fq2 > sample.sam

The output is a .sam file that will be described in the next section

Understanding Sequencing Reads

Mark Dunning

Last modified: 23 Jul 2016

Fastqc Primer

The sections of a `fastqc` report

Running `fastqc`

Exercise

Optional / advanced section

Exercise

Alignment

Understanding Sequencing Reads

Mark Dunning

Last modified: 23 Jul 2016

Fastqc Primer

The sections of a fastqc report

Running fastqc

Exercise

Optional / advanced section

Exercise

Alignment

The sections of a `fastqc` report

Running `fastqc`