(Acknowledgement to Ines De Santiago for her session at the previous summer school)
FastQC from Babraham Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads
The manual for fastqc
is available online and is very comprehensive; especially the parts which describe particular sections of the report. The authors also run a “QCfail” blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.
A “traffic light” system is used to draw your attention to sections of the report that require further investigation. However, it is worth bearing in mind that fastqc
is designed to be run on fastq files from any type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. It could be that you expect high levels of duplication or GC content. Always consider the nature of your study before taking any drastic action!
Also, fastqc
will not actually do anything to your data. If you decide to trim or remove contamination for your samples, you will need to use another tool.
fastqc
reportSome simple statistics about the composition of your file, which can be useful to see if it has guessed the encoding correctly and identified the correct number of reads. This section of the report is designed never to give a warning message
This section of the report is probably the one that receives most attention. It’s generally accepted that there is a degradation of quality over the duration of a sequencing run, but the extent to which the quality “drops-off” should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box represent the median and inter-quartile range in the usual manner.
Ideally, the plot should look something like following:-
However, a warning will be triggered if the lower quartile (25% of the data) of any base in less than 10, or if the median for any base is less than 25. A failure (red cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than 20.
With this section of the report, we are checking to see if there is a population of sequences that have low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean below 20.
This is one area of the report where you need to exercise some caution, because the results may depend on the type of sequencing you are performing. This course is going to be focussed on whole-genome (or exome) DNA sequencing, which should result in an even distribution of bases along the read length.
Sometimes you will get biased sequencing composition at the start of reads due to adaptor contamination, which would tend to be flagged elsewhere in the report