Introduction to NGS data

Mark Dunning

Last modified: 08 Apr 2016

Course Introduction

About the Course

Further disclaimer

fisher

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”. R.A. Fisher, 1938

If you haven’t designed your experiment properly, then all the Bioinformatics we teach you won’t help: Consult with your local statistician - preferably not the day before your grant is due!!!!

Course Outline

Day 1

Day 2

Day 3

Historical context

Cast your minds back a few years..

array-summary

Plenty of success stories with microarrays

array-achievements

What did we learn from arrays?

Reproducibility is key

duke-scandal

Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research on the map.

Keith Baggerly’s talk from Cambridge in 2010 is highy-recommended.

Why do sequencing?

Microarrays vs sequencing

The cost of sequencing

costs

Reports of the death of microarrays

microarray-dead

Reports of the death of microarrays. Greatly exagerated?

http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html

hadfield-blog

What are NGS data?

Illumina sequencing *

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

* Other sequencing technologies are available

Illumina sequencing

seq1

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq2

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq3

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Paired-end

seq4

Multiplexing

seq5

Image processing

cluster

Image processing

firecrest

Base-calling

bustard

Alignment

Data formats

Raw reads - fastq

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

~ 250 Million reads (sequences) per Hi-Seq lane

Fastq sequence names

@HWUSI-EAS100R:6:73:941:1973#0/1

Fastq quality scores

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Fastq quality scores

phred

Useful for quality control

fastqc

Aligned reads - sam

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

Sam format - key columns

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

sam

Sam file flags

Derivation

ReadHasProperty Binary MultiplyBy
isPaired TRUE 1 1
isProperPair TRUE 1 2
isUnmappedQuery FALSE 0 4
hasUnmappedMate FALSE 0 8
isMinusStrand FALSE 0 16
isMateMinusStrand TRUE 1 32
isFirstMateRead TRUE 1 64
isSecondMateRead FALSE 0 128
isSecondaryAlignment FALSE 0 256
isNotPassingQualityControls FALSE 0 512
isDuplicate FALSE 0 1024

Value of flag is given by 1x1 + 1x2 + 0x4 + 0x8 + 0x16 + 1x32 + 1x64 + 0x128 + 0x256 + 0x512 + 0x1024 = 99

See also

samtools flagstat

$ samtools flagstat NA19914.chr22.bam
2109857 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
40096 + 0 duplicates
2064356 + 0 mapped (97.84%:-nan%)
2011540 + 0 paired in sequencing
1005911 + 0 read1
1005629 + 0 read2
1903650 + 0 properly paired (94.64%:-nan%)
1920538 + 0 with itself and mate mapped
45501 + 0 singletons (2.26%:-nan%)
5134 + 0 with mate mapped to a different chr
4794 + 0 with mate mapped to a different chr (mapQ>=5)

Aligned reads - bam

samtools view mysequences.bam | head

Post-processing of aligned files

Post-processing of aligned files

Aligned files in IGV

igv

Crash-course in R

Advantages of R

NYT

The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include Facebook, google, Microsoft (who recently invested in a commerical provider of R), and the New York Times.

Key features

Support for R

RStudio

RStudio

R recap

R can do simple numerical calculations

2  + 2
## [1] 4
sqrt(25)
## [1] 5

Here, sqrt is a function and the number 25 was used as an argument to the function. Functions can have multiple arguments

Variables

We can save the result of a computation as a variable using the assignment operator <-

x <- sqrt(25)
x + 5
## [1] 10
y <- x +5
y
## [1] 10

Vectors

A vector can be used to combine multiple values. The resulting object is indexed and particular values can be queried using the [] operator

vec <- c(1,2,3,6)
vec[1]
## [1] 1

Vectors

Calculations can be performed on vectors

vec*2
## [1]  2  4  6 12
mean(vec)
## [1] 3
sum(vec)
## [1] 12

Data frames

These can be used to represent familiar tabular (row and column) data

df <- data.frame(A = c(1,2,3,6), B = c(7,8,10,12))
df
##   A  B
## 1 1  7
## 2 2  8
## 3 3 10
## 4 6 12

Data frames

Don’t need the same data type in each column

df <- data.frame(A = c(1,2,3,6), 
                 B = month.name[c(7,8,10,12)])
df
##   A        B
## 1 1     July
## 2 2   August
## 3 3  October
## 4 6 December

Data frames

We can subset data frames using the [], but can specify row and column indices

df[1,2]
## [1] July
## Levels: August December July October
df[2,1]
## [1] 2

Data frames

df[1,]
##   A    B
## 1 1 July
df[,2]
## [1] July     August   October  December
## Levels: August December July October

Or leave the row or column index blank to get all rows and columns respectively

Data frames

Another way to access columns is by the $ operator

df$A
## [1] 1 2 3 6
df$B
## [1] July     August   October  December
## Levels: August December July October

Plotting

All your favourite types of plot can be created in R

Plotting

The Bioconductor project

BioC

The Bioconductor project

Many of the packages are by well-respected authors and get lots of citations.

citations

Downloading a package

Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you’ll find;

Introducing the practical